Writing academic articles using R Sweave and LaTeX

One of my favourite activities in R is using Markdown to create business reports. Most of my work I export to MS Word to communicate analytical results with my colleagues. For my academic work and eBooks, I prefer LaTeX to produce great typography. This article explains how to write academic articles and essays combining R Sweave and LaTeX. The article is formatted in accordance with the APA (American Psychological Association) requirements.

To illustrate the principles of using R Sweave and LaTeX, I recycled an essay about problems with body image that I wrote for a psychology course many years ago. You can find the completed paper and all necessary files on my GitHub repository.

R Sweave and LaTeX using the APA journal template

Body Image

Body image describes the way we feel about the shape of our body. The literature on this topic demonstrates that many people, especially young women, struggle with their body image. A negative body image has been strongly associated with eating disorders. Psychologists measure body image using a special scale, shown in the image below.

My paper measures the current and ideal body shape of the subject and the body shape of the most attractive other sex. The results confirm previous research which found that body dissatisfaction for females is significantly higher than for men. The research also found a mild positive correlation between age and ideal body shape for women and between age and the female body shape found most attractive by men. You can read the full paper on my personal website.

Body shape measurement scale.

Body shape measurement scale.

R Sweave and LaTeX

The R file for this essay uses the Sweave package to integrate R code with LaTeX. The first two code chunks create a table to summarise the respondents using the xtable package. This package creates LaTeX or HTML tables from data generated by R code.

The first lines of the code read and prepare the data, while the second set of lines creates a table in LaTeX code. The code chunk uses results=tex to ensure the output is interpreted as LaTeX. This approach is used in most of the other chunks. The image is created within the document and saved as a pdf file and back integrated into the document as an image with appropriate label and caption.

<<echo=FALSE, results=tex>>=
body <- read.csv("body_image.csv")
# Respondent characteristics
body$Cohort <- cut(body$Age, c(0, 15, 30, 50, 99), 
                   labels = c("<16", "16--30", "31--50", ">50"))
body$Date <- as.Date(body$Date)
body$Current_Ideal <- body$Current - body$Ideal

library(xtable)
respondents <- addmargins(table(body$Gender, body$Cohort))
xtable(respondents, caption = "Age profile of survey participants", 
                    label = "gender-age", digits = 0)
@

Configuration

I created this file in R Studio, using the Sweave and knitr functionality. To knit the R Sweave file for this paper you will need to install the apa6 and ccicons packages in your LaTeX distribution. The apa6 package provides macros to format papers in accordance with the requirements American Psychological Association.

Pandigital Products: Euler Problem 32

Euler Problem 32 returns to pandigital numbers, which are numbers that contain one of each digit. Like so many of the Euler Problems, these numbers serve no practical purpose whatsoever, other than some entertainment value. You can find all pandigital numbers in base-10 in the Online Encyclopedia of Interegers (A050278). The Numberhile video explains everything you ever wanted to

The Numberhile video explains everything you ever wanted to know about pandigital numbers but were afraid to ask.

Euler Problem 32 Definition

We shall say that an n-digit number is pandigital if it makes use of all the digits 1 to n exactly once; for example, the 5-digit number, 15234, is 1 through 5 pandigital.

The product 7254 is unusual, as the identity, 39 × 186 = 7254, containing multiplicand, multiplier, and product is 1 through 9 pandigital.

Find the sum of all products whose multiplicand/multiplier/product identity can be written as a 1 through 9 pandigital.

HINT: Some products can be obtained in more than one way so be sure to only include it once in your sum.

Proposed Solution

The pandigital.9 function tests whether a string classifies as a pandigital number. The pandigital.prod vector is used to store the multiplication.

The only way to solve this problem is brute force and try all multiplications but we can limit the solution space to a manageable number. The multiplication needs to have ten digits. For example, when the starting number has two digits, the second number should have three digits so that the total has four digits, e.g.: 39 × 186 = 7254. When the first number only has one digit, the second number needs to have four digits.

pandigital.9 <- function(x) # Test if string is 9-pandigital
    (length(x)==9 & sum(duplicated(x))==0 & sum(x==0)==0)

t <- proc.time()
pandigital.prod <- vector()
i <- 1
for (m in 2:100) {
    if (m < 10) n_start <- 1234 else n_start <- 123
    for (n in n_start:round(10000 / m)) {
        # List of digits
        digs <- as.numeric(unlist(strsplit(paste0(m, n, m * n), "")))
        # is Pandigital?
        if (pandigital.9(digs)) {
            pandigital.prod[i] <- m * n
            i <- i + 1
            print(paste(m, "*", n, "=", m * n))
        }
    }
}
answer <- sum(unique(pandigital.prod))
print(answer)

Numbers can also be checked for pandigitality using mathematics instead of strings.

You can view the most recent version of this code on GitHub.

Analysing soil moisture data in NetCDF format with the ncdf4 library

The netCDF format is popular in sciences that analyse sequential spatial data. It is a self-describing, machine-independent data format for creating, accessing and sharing array-oriented information. The netCDF format provides spatial time-series such as meteorological or environmental data. This article shows how to visualise and analyse this data format by reviewing soil moisture data published by the Australian Bureau of Statistics.

Soil Moisture data

The Australian Bureau of Meteorology publishes hydrological data in both a simple map grid and in the NetCDF format. The map grid consists of a flat text file that requires a bit of data jujitsu before it can be used. The NetCDF format is much easier to deploy as it provides a three-dimensional matrix of spatial data over time.

We are looking at the possible relationship between sewer main blockages and deep soil moisture levels. You will need to manually download this dataset from the Bureau of Meteorology website. I have not been able to scrape the website automatically. For this analysis, I use the actual deep soil moisture level, aggregated monthly in NetCDF 4 format.

Soil moisture data from the Australian Bureau of meteorology in netCDF format

Reading, Extracting and Transforming the netCDF format

The ncdf4 library, developed by David W. Pierce, provides the necessary functionality to manage this data. The first step is to load the data, extract the relevant information and transform the data for visualisation and analysis. When the data is read, it essentially forms a complex list that contains the metadata and the measurements.

The ncvar_get function extracts the data from the list. The lon, lat and dates variables are the dimensions of the moisture data. The time data is stored as the number of days since 1 January 1900. The spatial coordinates are stored in decimal degrees with 0.05-decimal degree intervals. The moisture data is a three-dimensional matrix with longitue, latitude and time as dimensions. Storing this data in this way will make it very easy to use.

library(ncdf4)
# Load data
bom <- nc_open("Hydroinformatics/SoilMoisture/sd_pct_Actual_month.nc")
print(bom) # Inspect the data

# Extract data
lon <- ncvar_get(bom, "longitude")
lat <- ncvar_get(bom, "latitude")
dates <- as.Date("1900-01-01") + ncvar_get(bom, "time")
moisture <- ncvar_get(bom, "sd_pct")
dimnames(moisture) <- list(lon, lat, dates)

Visualising the data

The first step is to check the overall data. This first code snippet extracts a matrix from the cube for 31 July 2017 and plots it. This code pipe extracts the date for the end of July 2017 and creates a data frame which is passed to ggplot for visualisation. Although I use the Tidyverse, I still need reshape2 because the gather function does not like matrices.

library(tidyverse)
library(RColorBrewer)
library(reshape2)

d <- "2017-07-31"
m <- moisture[, , which(dates == d)] %>%
       melt(varnames = c("lon", "lat")) %>%
       subset(!is.na(value))

ggplot(m, aes(x = lon, y = lat, fill = value)) + borders("world") + 
    geom_tile() + 
    scale_fill_gradientn(colors = brewer.pal(9, "Blues")) + 
    labs(title = "Total moisture in deep soil layer (100-500 cm)",
    subtitle = format(as.Date(d), "%d %B %Y")) + 
    xlim(range(lon)) + ylim(range(lat)) + coord_fixed()

Deep soil moisture: Source Bureau of Meteorology, Australia

With the ggmap package we can create a nice map of a local area.

library(ggmap)
loc <- round(geocode("Bendigo") / 0.05) * 0.05 
map_tile <- get_map(loc, zoom = 12, color = "bw") %>% 
    ggmap()

map_tile + 
    geom_tile(data = m, aes(x = lon, y = lat, fill = value), alpha = 0.8) + 
    scale_fill_gradientn(colors = brewer.pal(9, "Blues")) + 
    labs(title = "Total moisture in deep soil layer (100-500 cm)",
        subtitle = format(as.Date(d), "%d %B %Y"))

Analysing the data

For my analysis, I am interested in the time series of moisture data for a specific point on the map. The previous code slices the data horizontally over time. To create a time series we can pierce through the data for a specific coordinate. The purpose of this time series is to investigate the relationship between sewer main blockages and deep soil data, which can be a topic for a future post.

mt <- data.frame(date = dates, 
                 dp = moisture[as.character(loc$lon), as.character(loc$lat), ])
ggplot(mt, aes(x = date, y = dp)) + geom_line() + 
    labs(x = "Month",
         y = "Moisture",
         title = "Total moisture in deep soil layer (100-500 cm)",
         subtitle = paste(as.character(loc), collapse = ", "))

The latest version of this code is available on my GitHub repository.

Deep soil moisture time series.

Pacific Island Hopping using R and iGraph

Last month I enjoyed a relaxing holiday in the tropical paradise of Vanuatu. One rainy day I contemplated how to go island hopping across the Pacific ocean visiting as many island nations as possible. The Pacific ocean is a massive body of water between, Asia and the Americas, which covers almost half the surface of the earth. The southern Pacific is strewn with island nations from Australia to Chile. In this post, I describe how to use R to plan your next Pacific island hopping journey.

Pacific Island hopping

The Pacific Ocean.

Listing all airports

My first step was to create a list of flight connections between each of the island nations in the Pacific ocean. I am not aware of a publically available data set of international flights so unfortunately, I created a list manually (if you do know of such data set, then please leave a comment).

My manual research resulted in a list of international flights from or to island airports. This list might not be complete, but it is a start. My Pinterest board with Pacific island airline route maps was the information source for this list.

The first code section reads the list of airline routes and uses the ggmap package to extract their coordinates from Google maps. The data frame with airport coordinates is saved for future reference to avoid repeatedly pinging Google for the same information.

# Init
library(tidyverse)
library(ggmap)
library(ggrepel)
library(geosphere)

# Read flight list and airport list
flights <- read.csv("Geography/PacificFlights.csv", stringsAsFactors = FALSE)
f <- "Geography/airports.csv"
if (file.exists(f)) {
    airports <- read.csv(f)
    } else airports <- data.frame(airport = NA, lat = NA, lon = NA)

# Lookup coordinates for new airports
all_airports <- unique(c(flights$From, flights$To))
new_airports <- all_airports[!(all_airports %in% airports$airport)]
if (length(new_airports) != 0) {
    coords <- geocode(new_airports)
    new_airports <- data.frame(airport = new_airports, coords)
    airports <- rbind(airports, new_airports)
    airports <- subset(airports, !is.na(airport))
    write.csv(airports, "Geography/airports.csv", row.names = FALSE)
}

# Add coordinates to flight list
flights <- merge(flights, airports, by.x="From", by.y="airport")
flights <- merge(flights, airports, by.x="To", by.y="airport")

Create the map

To create a map, I modified the code to create flight maps I published in an earlier post. This code had to be changed to centre the map on the Pacific. Mapping the Pacific ocean is problematic because the -180 and +180 degree meridians meet around the date line. Longitudes west of the antemeridian are positive, while longitudes east are negative.

The world2 data set in the borders function of the ggplot2 package is centred on the Pacific ocean. To enable plotting on this map, all negative longitudes are made positive by adding 360 degrees to them.

# Pacific centric
flights$lon.x[flights$lon.x < 0] <- flights$lon.x[flights$lon.x < 0] + 360
flights$lon.y[flights$lon.y < 0] <- flights$lon.y[flights$lon.y < 0] + 360
airports$lon[airports$lon < 0] <- airports$lon[airports$lon < 0] + 360

# Plot flight routes
worldmap <- borders("world2", colour="#efede1", fill="#efede1")
ggplot() + worldmap + 
    geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + 
    geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), 
      col = "black", size = 2, segment.color = NA) + 
    geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, 
      yend = lat.y, col = Airline), size = .4, curvature = .2) + 
    theme(panel.background = element_rect(fill="white"), 
          axis.line = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()
          ) + 
    xlim(100, 300) + ylim(-40,40)

Pacific island hopping

Pacific Island Hopping

This visualisation is aesthetic and full of context, but it is not the best visualisation to solve the travel problem. This map can also be expressed as a graph with nodes (airports) and edges (routes). Once the map is represented mathematically, we can generate travel routes and begin our Pacific Island hopping.

The igraph package converts the flight list to a graph that can be analysed and plotted. The shortest_path function can then be used to plan routes. If I would want to travel from Auckland to Saipan in the Northern Mariana Islands, I have to go through Port Vila, Honiara, Port Moresby, Chuuk, Guam and then to Saipan. I am pretty sure there are quicker ways to get there, but this would be an exciting journey through the Pacific.

library(igraph)
g <- graph_from_edgelist(as.matrix(flights[,1:2]), directed = FALSE)
par(mar = rep(0, 4))
plot(g, layout = layout.fruchterman.reingold, vertex.size=0)
V(g)
shortest_paths(g, "Auckland", "Saipan")

View the latest version of this code on GitHub.

Pacific Flight network

Digit fifth powers: Euler Problem 30

Euler problem 30 is another number crunching problem that deals with numbers to the power of five. Two other Euler problems dealt with raising numbers to a power. The previous problem looked at permutations of powers and problem 16 asks for the sum of the digits of 2^{1000}.

Numberphile has a nice video about a trick to quickly calculate the fifth root of a number that makes you look like a mathematical wizard.

Euler Problem 30 Definition

Surprisingly there are only three numbers that can be written as the sum of fourth powers of their digits:

1634 = 1^4 + 6^4 + 3^4 + 4^4

8208 = 8^4 + 2^4 + 0^4 + 8^4

9474 = 9^4 + 4^4 + 7^4 + 4^4

As 1 = 1^4 is not a sum, it is not included.

The sum of these numbers is 1634 + 8208 + 9474 = 19316. Find the sum of all the numbers that can be written as the sum of fifth powers of their digits.

Proposed Solution

The problem asks for a brute-force solution but we have a halting problem. How far do we need to go before we can be certain there are no sums of fifth power digits? The highest digit is 9 and 9^5=59049, which has five digits. If we then look at 5 \times 9^5=295245, which has six digits and a good endpoint for the loop. The loop itself cycles through the digits of each number and tests whether the sum of the fifth powers equals the number.

largest <- 6 * 9^5
answer <- 0
for (n in 2:largest) {
    power.sum <-0
    i <- n while (i > 0) {
        d <- i %% 10
        i <- floor(i / 10)
        power.sum <- power.sum + d^5
    }
    if (power.sum == n) {
        print(n)
        answer <- answer + n
    }
}
print(answer)

View the most recent version of this code on GitHub.

Visualising Water Consumption using a Geographic Bubble Chart

A geographic bubble chart is a straightforward method to visualise quantitative information with a geospatial relationship. Last week I was in Vietnam helping the Phú Thọ Water Supply Joint Stock Company with their data science. They asked me to create a map of a sample of their water consumption data. In this post, I share this little ditty to explain how to plot a bubble chart over a map using the

In this post, I share this little ditty to explain how to plot a bubble chart over a map using the ggmap package.

Load and Explore the Data

The sample data contains a list of just over 100 readings from water meters in the city of Việt Trì in Vietnam, plus their geospatial location. This data uses the World Geodetic System of 1984 (WGS84), which is compatible with Google Maps and similar systems.

# Load the data
water <- read.csv("PhuTho/MeterReads.csv")
water$Consumption <- water$read_new - water$read_old

# Summarise the data
head(water)
summary(water$Consumption)

The consumption at each connection is between 0 and 529 cubic metres, with a mean consumption of 23.45 cubic metres.

Visualise the data with a geographic bubble chart

With the ggmap extension of the ggplot package, we can visualise any spatial data set on a map. The only condition is that the spatial coordinates are in the WGS84 datum. The ggmap package adds a geographical layer to ggplot by adding a Google Maps or Open Street Map canvas.

The first step is to download the map canvas. To do this, you need to know the centre coordinates and the zoom factor. To determine the perfect zoon factor requires some trial and error. The ggmap package provides for various map types, which are described in detail in the documentation.

# Load map library
library(ggmap)

# Find the middle of the points
centre <- c(mean(range(water$lon)), mean(range(water$lat)))

# Download the satellite image
viettri <- get_map(centre, zoom = 17, maptype = "hybrid")
g <- ggmap(viettri)

The ggmap package follows the same conventions as ggplot. We first call the map layer and then add any required geom. The point geom creates a nice bubble chart when used in combination with the scale_size_area option. This option scales the points to a maximum size so that they are easily visible. The transparency (alpha) minimises problems with overplotting. This last code snippet plots the map with water consumption.

# Add the points
g + geom_point(data = reads, aes(x = lon, y = lat, size = Consumption), 
        shape = 21, colour = "dodgerblue4", fill = "dodgerblue", alpha = .5) +
    scale_size_area(max_size = 20) + 
    # Size of the biggest point 
    ggtitle("Việt Trì sự tiêu thụ nước")

You can find the code and data for this article on my GitHub repository. With thanks to Ms Quy and Mr Tuyen of Phu Tho water for their permission to use this data.

This map visualises water consumption in the targeted area of Việt Trì. The larger the bubble, the larger the consumption. It is no surprise that two commercial customers used the most water. Ggplot automatically adds the legend for the consumption variable.

Geographic Bubble Chart: Visualising Water Consumption in Vietnam

Data Science for Water Utilities Using R

Data science comes natural to water utilities because of the engineering competencies required to deliver clean and refreshing water. Many water managers I speak to are interested in a more systematic approach to creating value from data.

My work in this area is gaining popularity. Two weeks ago I was the keynote speaker at an asset data conference in New Zealand. My paper about data science strategy for water utilities is the most downloaded paper this year. This week I am in Vietnam, assisting the local Phú Thọ water company with their data science problems.

In all my talks and publications I emphasise the importance of collaboration between utilities and that we should share code because we are all sharing the same problems. I am hoping to develop a global data science coalition for water services to achieve this goal.

My book about making water utilities more customer-centric will soon be published, so time to start another project. My new book will be about Data Science for Water Utilities Using R. This book is currently not more than a collection of existing articles, code snippets and production work from my job. The cover is finished because it motivates me to keep writing.

This article describes my proposed chapter structure with some example code snippets. The most recent version of this code can be found on my GitHub repository. Feel free to leave a comment at the bottom of this article if you like to see additional problems discussed, or if you want to participate by sharing code.

Data Science for Water Utilities

 

Data Science for Water Utilities

The first chapter will provide a strategic overview of data science and how water utilities can use this discipline to create value. This chapter is based on earlier articles and recent presentations on the topic.

Using R

This chapter will make a case for using R by providing just enough information for readers to be able to follow the code in the book. A recurring theme at a data conference in Auckland I spoke at was the problems posed by the high reliance on spreadsheets. This chapter will explain why code is superior and how to use R to achieve this advantage.

Reservoirs

This first practical chapter will discuss how to manage data from reservoirs. The core problem is to find the relationship between depth and volume based on bathymetric survey data. I started toying with bathymetric data from Pretyboy Reservoir in the state of Mayne. The code below downloads and visualises this data.

# RESERVOIRS
library(tidyverse)
library(RColorBrewer)
library(gridExtra)

# Read data
if (!file.exists("Hydroinformatics/prettyboy.csv")) {
    url <- "http://www.mgs.md.gov/ReservoirDataPoints/PrettyBoy1998.dat"
    prettyboy <- read.csv(url, skip = 2, header = FALSE)
    names(prettyboy) <- read.csv(url, nrows = 1, header = FALSE, stringsAsFactors = FALSE)
    write_csv(prettyboy, "Hydroinformatics/prettyboy.csv")
} else prettyboy <- read_csv("Hydroinformatics/prettyboy.csv")
head(prettyboy)

# Remove extremes, duplicates and Anomaly
ext <- c(which(prettyboy$Easting == min(prettyboy$Easting)), 
         which(prettyboy$Easting == max(prettyboy$Easting)),
         which(duplicated(prettyboy)))
prettyboy <- prettyboy[-ext, ]

# Visualise reservoir
bathymetry_colours <- c(rev(brewer.pal(3, "Greens"))[-2:-3], 
                        brewer.pal(9, "Blues")[-1:-3])
ggplot(prettyboy, aes(x = Easting, y = Northing, colour = Depth)) + 
    geom_point(size = .1) + coord_equal() + 
    scale_colour_gradientn(colors = bathymetry_colours) 
Data Science for Water Utilities: Prettyboy reseroir

Bathymetric survey of the Prettyboy reservoir.

In the plot, you can see the lines where the survey boat took soundings. I am working on converting this survey data to a non-convex hull to calculate its volume and to determine the relationship between depth and volume.

Other areas to be covered in this chapter could be hydrology and meteorology, but alas I am not qualified in these subjects. I hope to find somebody who can help me with this part.

Water Quality

The quality of water in tanks and networks is tested using samples. One of the issues in analysing water quality data is the low number of data points due to the cost of laboratory testing. There has been some discussion about how to correctly calculate percentiles and other statistical issues.

This chapter will also describe how to create a water system index to communicate the performance of a water system to non-experts. The last topic in this chapter discusses analysing taste testing data.

Water system performance index.

Water system performance index.

Water Balance

We have developed a model to produce water balances based on SCADA data. I am currently generalising this idea by using the igraph package to define water network geometry. Next year I will start experimenting with a predictive model for water consumption that uses data from the Australian Census and historical data to predict future use.

SCADA Data

Data from SCADA systems are time series. This chapter will discuss how to model this data, find spikes in the readings and conduct predictive analyses.

Customer Perception

This chapter is based on my dissertation on customer perception. Most water utilities do not extract the full value from their customer surveys. In this chapter, I will show how to analyse latent variables in survey data. The code below loads the cleaned data set of the results of a customer survey I undertook in Australia and the USA. The first ten variables are the Personal Involvement Index. This code does a quick exploratory analysis using a boxplot and visualises a factor analysis that uncovers two latent variables.

# CUSTOMERS
library(psych)

# Read data
customers <- read_csv("Hydroinformatics/customers.csv")

# Exploratory Analyis
p1 <- customers[,1:10] %>% 
    gather %>% 
    ggplot(aes(x = key, y = value)) + 
        geom_boxplot() + 
        xlab("Item") + ylab("Response") + ggtitle("Personal Involvement Index")

# Factor analysis
fap <- fa.parallel(customers[,1:10]) grid.arrange(p1, ncol= 2) customers[,1:10] %>%
    fa(nfactors = fap$nfact, rotate = "promax") %>% 
    fa.diagram(main = "Factor Analysis")

Personal Involvement Index - Data Science for Water Utilities

Customer Complaints

Customer complaints are a gift to the business. Unfortunately, most business view complaints punitively. This chapter will explain how to analyse and respond to complaints to improve the level of service to customers.

Customer Contacts

One of the topics in this chapter is how to use Erlang-C modelling to predict staffing levels in contact centres.

Economics

Last but not least, economics is the engine room of any organisation. In the early stages of my career, I specialised in cost estimating, including probabilistic methods. This chapter will include an introduction to Monte Carlo simulation to improve cost estimation reliability.

Data Science for Water Utilities Mind Map

This book is still in its early stages. The mind map below shows the work in progress on the proposed chapters and topic.

Data Science for Water Utilities: The next steps

I started writing bits and pieces of Data Science for Water Utilities using the fabulous bookdown system in R-Studio. It will take me about a year to realise this vision as I need to increase my analytical skills to write about such a broad range of topics. I would love to get some feedback on these two questions:

  1. What is missing in this list? Any practical problems I should include?
  2. Would you like to donate some data and code to include in the book?

Feel free to leave a comment below.

Euler Problem 29: Distinct Powers

Euler Problem 29 is another permutation problem that is quite easy to solve using brute force. The MathBlog site by Kristian Edlund has a nice solution using only pen and paper.

Raising number to a power can have interesting results. The video below explains why this pandigital formula approximates e to billions of decimals:

(1 + 9^{-4^{6 \times 7}})^{3^{2^{85}}} \approx e

Euler Problem 29 Definition

Consider all integer combinations of: a^b for 2 \leq a \leq 5 and \leq b \leq 5 .

2^2=4, \quad 2^3 = 8,\quad 2^4 = 16,\quad 2^5 = 32

3^2 = 9,\quad 3^3 = 27,\quad 3^4 = 81,\quad 3^5 = 243

4^2 = 16,\quad 4^3 = 64,\quad 4^4 = 256, \quad 4^5 = 1024

5^2 = 25,\quad 5^3 = 125,\quad 5^4 = 625,\quad 5^5 = 3125

If they are then placed in numerical order, with any repeats removed, we get the following sequence of 15 distinct terms:

4, \ 8, \ 9, \ 16, \ 25, \ 27, \ 32, \ 64, \ 81, \ 125, \ 243, \ 256,\ 625, \ 1024, \ 3125

How many distinct terms are in the sequence generated by a^b for 2 \leq a \leq 100 and 2 \leq b \leq 100 ?

Brute Force Solution

This code simply calculates all powers from 2^2 to 2^{1000} and determines the number of unique values. Since we are only interested in their uniqueness and not the precise value, there is no need to use Multiple Precision Arithmetic.

# Initialisation
target <- 100
terms <- vector()
i <- 1
# Loop through values of a and b and store powers in vector
for (a in 2:target) {
   for (b in 2:target) {
     terms[i] <- a^b
     i <- i + 1
   }
}
# Determine the number of distinct powers
answer <- length(unique(terms))
print(answer)

View the latest version of this code on GitHub.

How Virtual Tags have transformed SCADA data analysis

Yesterday, I delivered the International Keynote at the Asset Data & Insights Conference in Auckland, New Zealand (the place where R was originally developed). My talk was about how to create value from SCADA data, using a method I developed with my colleagues called Virtual Tags. My talk started with my views on data science strategy, which I also presented to the R User Group in Melbourne. In this post, I like to explain what Virtual Tags are, and how they can be used to improve the value of SCADA data.

Asset Data & Insights Conference

SCADA Systems at Water Treatment Plants

Water treatment plants are mostly fully automated systems, using analysers and the SCADA system to communicate this data. For those of you not familiar with water treatment plants, this video below gives a cute summary of the process.

Water treatment plants need sensors to measure a broad range of parameters. These instruments record data 24 hours per day to control operations. When the process operates effectively, all values fall within a very narrow band. All these values are stored by the SCADA system for typically a year, after which they are destroyed to save storage space.

Water treatment plants measure turbidity (clarity of the water) to assess the effectiveness of filtration. The code snippet below simulates the measurements from a turbidity instrument at a water treatment plant over five hours. The code simulates measurements from a turbidity instrument at a water treatment plant over a period of five hours. Most water quality data has a log-normal distribution with a narrow standard deviation.

# Simulate measured data
set.seed(1234)
n <- 300
wtp <- data.frame(DateTime = seq.POSIXt(ISOdate(1910, 1, 1), length.out = n, by = 60),
                  WTP = rlnorm(n, log(.1), .01))
library(ggplot2)
p <- ggplot(wtp, aes(x = DateTime, y = WTP)) + geom_line(colour = "grey") + 
    ylim(0.09, 0.11) + ylab("Turbidity") + ggtitle("Turbidity simulation")
p

SCADA data simulation

SCADA Historian

The data generated by the SCADA system is used to take operational decisions. The data is created and structured to make decisions in the present, not to solve problems in the future. SCADA Historian systems archive this information for future analysis. Historian systems only store new values when the new reading is more or less than a certain percentage than the previous one. This method saves storage space without sacrificing much accuracy.

For example, when an instrument reads 0.20 and the limit is set at 5%, new values are only recorded when they are below 0.19 or above 0.21. Any further values are stored when they deviate 5% from the new value, and so on. The code snippet below simulates this behaviour, based on the simulated turbidity readings generated earlier. This Historian only stores the data points marked in black.

# Historise data
threshold <- 0.03
h <- 1 # First historised point
# Starting conditions
wtp$historise <- FALSE
wtp$historise[c(1, n)] <- TRUE
# Testing for delta <> threshold
for (i in 2:nrow(wtp)) {
    delta <- wtp$WTP[i] / wtp$WTP[h] if (delta > (1 + threshold) | delta < (1 - threshold)) {
        wtp$historise[i] <- TRUE
        h <- i
    }
}
historian <- subset(wtp, historise == TRUE)
historian$Source <- "Historian"
p <- p + geom_point(data = historian, aes(x = DateTime, y = WTP)) + ggtitle("Historised data")
p

SCADA historian simulation

Virtual Tags

This standard method to generate and store SCADA data works fine to operate systems but does not work so well when using the data for posthoc analysis. The data in Historian is an unequally-spaced time series which makes it harder to analyse the data. The Virtual Tag approach expands these unequal time series to an equally-spaced one, using constant interpolation.

The vt function undertakes the constant interpolation using the approx function. The functionvt is applied to all the DateTime values, using the historised data points. The red line shows how the value is constant until it jumps by more than 5%. This example demonstrates that we have a steady process with some minor spikes, which is the expected outcome of this simulation.

# Create Virtual Ttags
vt <- function(t) approx(historian$DateTime, historian$WTP, xout = t, method = "constant")
turbidity <- lapply(as.data.frame(wtp$DateTime), vt)
wtp$VirtualTag <- turbidity[[1]]$y
p + geom_line(data = wtp, aes(x = DateTime, y = VirtualTag), colour = "red") + ggtitle("Virtual Tags")

Example of Virtual Tags for SCADAYou can find the latest version of this code on GitHub.

The next step in Virtual Tags is to combine the tags from different data points. For example, we are only interested in the turbidity readings when the filter was running. We can do this by combining this data with a valve status or flow in the filter.

This approach might seem cumbersome but it simplifies analysing data from Historian. Virtual Tags enables several analytical processes that would otherwise be hard to. This system also adds context to the SCADA information by linking tags to each other and the processes they describe. If you are interested in more detail, then please download the technical manual for Virtual Tags and how they are implemented in SQL.

The Presentation

And finally, these are the slides of my keynote presentation.

The Ulam Spiral: Euler Problem 28

Euler Problem 28 takes us to the world of the Ulam Spiral. This is a spiral that contains sequential positive integers in a square spiral, marking the prime numbers. Stanislaw Ulam discovered that a lot of primes are located along the diagonals. These diagonals can be described as polynomials. The Ulam Spiral is thus a way of generating quadratic primes (Euler Problem 27).

Euler Problem 28: The Ulam Spiral

Ulam Spiral (WikiMedia).

Euler Problem 28 Definition

Starting with the number 1 and moving to the right in a clockwise direction a 5 by 5 spiral is formed as follows:

21 22 23 24 25
20 07 08 09 10
19 06 01 02 11
18 05 04 03 12
17 16 15 14 13

It can be verified that the sum of the numbers on the diagonals is 101. What is the sum of the numbers on the diagonals in a 1001 by 1001 spiral formed in the same way?

Proposed Solution

To solve this problem we do not need to create a matrix. This code calculates the values of the corners of a matrix with size n. The lowest number in the matrix with size n is n(n-3)+4. The numbers increase by n-1.

The code steps through all matrices from size 3 to 1001. The solution uses only the uneven sized matrices because these have a centre. The answer to the problem is the sum of all numbers.

size <- 1001 # Size of matrix
answer <- 1 # Starting number
# Define corners of subsequent matrices
for (n in seq(from = 3, to = size, by = 2)) {
    corners <- seq(from = n * (n - 3) + 3, by = n - 1, length.out = 4)
    answer <- answer + sum(corners)
}
print(answer)

Plotting the Ulam Spiral

We can go beyond Euler Problem 28 and play with the mathematics. This code snippet plots all the prime numbers in the Ulam Spiral. Watch the video for an explanation of the patterns that appear along the diagonals.

Ulam Spiral prime numbers

Ulam Spiral prime numbers.

The code creates a matrix of the required size and fills it with the Ulam Spiral. The code then identifies all primes using the is.prime function from Euler Problem 7. A heat map visualises the results.

# Ulam Spiral
size <- 201 # Size of matrix
ulam <- matrix(ncol = size, nrow = size)
mid <- floor(size / 2 + 1)
ulam[mid, mid] <- 1
for (n in seq(from = 3, to = size, by = 2)) {
    numbers <- (n * (n - 4) + 5) : ((n + 2) * ((n + 2) - 4) + 4)
    d <- mid - floor(n / 2)
    l <- length(numbers)
    ulam[d, d:(d + n - 1)] <- numbers[(l - n + 1):l]
    ulam[d + n - 1, (d + n - 1):d] <- numbers[(n - 1):(n - 2 + n)]
    ulam[(d + 1):(d + n - 2), d] <- numbers[(l - n):(l - 2 * n + 3)]
    ulam[(d + 1):(d + n - 2), d + n - 1] <- numbers[1:(n - 2)]
}
ulam.primes &lt;- apply(ulam, c(1, 2), is.prime)

# Visualise
library(ggplot2)
library(reshape2)
ulam.primes <- melt(ulam.primes)
ggplot(ulam.primes, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + 
    scale_fill_manual(values = c("white", "black")) + 
    guides(fill=FALSE) +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank()) + 
    theme(axis.title.x=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank(),
          axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank()
          )

View the latest version of this code on GitHub.