Visualising Water Consumption using a Geographic Bubble Chart

A geographic bubble chart is a straightforward method to visualise quantitative information with a geospatial relationship. Last week I was in Vietnam helping the Phú Thọ Water Supply Joint Stock Company with their data science. They asked me to create a map of a sample of their water consumption data. In this post, I share this little ditty to explain how to plot a bubble chart over a map using the

In this post, I share this little ditty to explain how to plot a bubble chart over a map using the ggmap package.

Load and Explore the Data

The sample data contains a list of just over 100 readings from water meters in the city of Việt Trì in Vietnam, plus their geospatial location. This data uses the World Geodetic System of 1984 (WGS84), which is compatible with Google Maps and similar systems.

# Load the data
water <- read.csv("PhuTho/MeterReads.csv")
water$Consumption <- water$read_new - water$read_old

# Summarise the data
head(water)
summary(water$Consumption)

The consumption at each connection is between 0 and 529 cubic metres, with a mean consumption of 23.45 cubic metres.

Visualise the data with a geographic bubble chart

With the ggmap extension of the ggplot package, we can visualise any spatial data set on a map. The only condition is that the spatial coordinates are in the WGS84 datum. The ggmap package adds a geographical layer to ggplot by adding a Google Maps or Open Street Map canvas.

The first step is to download the map canvas. To do this, you need to know the centre coordinates and the zoom factor. To determine the perfect zoon factor requires some trial and error. The ggmap package provides for various map types, which are described in detail in the documentation.

# Load map library
library(ggmap)

# Find the middle of the points
centre <- c(mean(range(water$lon)), mean(range(water$lat)))

# Download the satellite image
viettri <- get_map(centre, zoom = 17, maptype = "hybrid")
g <- ggmap(viettri)

The ggmap package follows the same conventions as ggplot. We first call the map layer and then add any required geom. The point geom creates a nice bubble chart when used in combination with the scale_size_area option. This option scales the points to a maximum size so that they are easily visible. The transparency (alpha) minimises problems with overplotting. This last code snippet plots the map with water consumption.

# Add the points
g + geom_point(data = reads, aes(x = lon, y = lat, size = Consumption), 
        shape = 21, colour = "dodgerblue4", fill = "dodgerblue", alpha = .5) +
    scale_size_area(max_size = 20) + 
    # Size of the biggest point 
    ggtitle("Việt Trì sự tiêu thụ nước")

You can find the code and data for this article on my GitHub repository. With thanks to Ms Quy and Mr Tuyen of Phu Tho water for their permission to use this data.

This map visualises water consumption in the targeted area of Việt Trì. The larger the bubble, the larger the consumption. It is no surprise that two commercial customers used the most water. Ggplot automatically adds the legend for the consumption variable.

Geographic Bubble Chart: Visualising Water Consumption in Vietnam

Data Science for Water Utilities Using R

Data science comes natural to water utilities because of the engineering competencies required to deliver clean and refreshing water. Many water managers I speak to are interested in a more systematic approach to creating value from data.

My work in this area is gaining popularity. Two weeks ago I was the keynote speaker at an asset data conference in New Zealand. My paper about data science strategy for water utilities is the most downloaded paper this year. This week I am in Vietnam, assisting the local Phú Thọ water company with their data science problems.

In all my talks and publications I emphasise the importance of collaboration between utilities and that we should share code because we are all sharing the same problems. I am hoping to develop a global data science coalition for water services to achieve this goal.

My book about making water utilities more customer-centric will soon be published, so time to start another project. My new book will be about Data Science for Water Utilities Using R. This book is currently not more than a collection of existing articles, code snippets and production work from my job. The cover is finished because it motivates me to keep writing.

This article describes my proposed chapter structure with some example code snippets. The most recent version of this code can be found on my GitHub repository. Feel free to leave a comment at the bottom of this article if you like to see additional problems discussed, or if you want to participate by sharing code.

Data Science for Water Utilities

 

Data Science for Water Utilities

The first chapter will provide a strategic overview of data science and how water utilities can use this discipline to create value. This chapter is based on earlier articles and recent presentations on the topic.

Using R

This chapter will make a case for using R by providing just enough information for readers to be able to follow the code in the book. A recurring theme at a data conference in Auckland I spoke at was the problems posed by the high reliance on spreadsheets. This chapter will explain why code is superior and how to use R to achieve this advantage.

Reservoirs

This first practical chapter will discuss how to manage data from reservoirs. The core problem is to find the relationship between depth and volume based on bathymetric survey data. I started toying with bathymetric data from Pretyboy Reservoir in the state of Mayne. The code below downloads and visualises this data.

# RESERVOIRS
library(tidyverse)
library(RColorBrewer)
library(gridExtra)

# Read data
if (!file.exists("Hydroinformatics/prettyboy.csv")) {
    url <- "http://www.mgs.md.gov/ReservoirDataPoints/PrettyBoy1998.dat"
    prettyboy <- read.csv(url, skip = 2, header = FALSE)
    names(prettyboy) <- read.csv(url, nrows = 1, header = FALSE, stringsAsFactors = FALSE)
    write_csv(prettyboy, "Hydroinformatics/prettyboy.csv")
} else prettyboy <- read_csv("Hydroinformatics/prettyboy.csv")
head(prettyboy)

# Remove extremes, duplicates and Anomaly
ext <- c(which(prettyboy$Easting == min(prettyboy$Easting)), 
         which(prettyboy$Easting == max(prettyboy$Easting)),
         which(duplicated(prettyboy)))
prettyboy <- prettyboy[-ext, ]

# Visualise reservoir
bathymetry_colours <- c(rev(brewer.pal(3, "Greens"))[-2:-3], 
                        brewer.pal(9, "Blues")[-1:-3])
ggplot(prettyboy, aes(x = Easting, y = Northing, colour = Depth)) + 
    geom_point(size = .1) + coord_equal() + 
    scale_colour_gradientn(colors = bathymetry_colours) 
Data Science for Water Utilities: Prettyboy reseroir

Bathymetric survey of the Prettyboy reservoir.

In the plot, you can see the lines where the survey boat took soundings. I am working on converting this survey data to a non-convex hull to calculate its volume and to determine the relationship between depth and volume.

Other areas to be covered in this chapter could be hydrology and meteorology, but alas I am not qualified in these subjects. I hope to find somebody who can help me with this part.

Water Quality

The quality of water in tanks and networks is tested using samples. One of the issues in analysing water quality data is the low number of data points due to the cost of laboratory testing. There has been some discussion about how to correctly calculate percentiles and other statistical issues.

This chapter will also describe how to create a water system index to communicate the performance of a water system to non-experts. The last topic in this chapter discusses analysing taste testing data.

Water system performance index.

Water system performance index.

Water Balance

We have developed a model to produce water balances based on SCADA data. I am currently generalising this idea by using the igraph package to define water network geometry. Next year I will start experimenting with a predictive model for water consumption that uses data from the Australian Census and historical data to predict future use.

SCADA Data

Data from SCADA systems are time series. This chapter will discuss how to model this data, find spikes in the readings and conduct predictive analyses.

Customer Perception

This chapter is based on my dissertation on customer perception. Most water utilities do not extract the full value from their customer surveys. In this chapter, I will show how to analyse latent variables in survey data. The code below loads the cleaned data set of the results of a customer survey I undertook in Australia and the USA. The first ten variables are the Personal Involvement Index. This code does a quick exploratory analysis using a boxplot and visualises a factor analysis that uncovers two latent variables.

# CUSTOMERS
library(psych)

# Read data
customers <- read_csv("Hydroinformatics/customers.csv")

# Exploratory Analyis
p1 <- customers[,1:10] %>% 
    gather %>% 
    ggplot(aes(x = key, y = value)) + 
        geom_boxplot() + 
        xlab("Item") + ylab("Response") + ggtitle("Personal Involvement Index")

# Factor analysis
fap <- fa.parallel(customers[,1:10]) grid.arrange(p1, ncol= 2) customers[,1:10] %>%
    fa(nfactors = fap$nfact, rotate = "promax") %>% 
    fa.diagram(main = "Factor Analysis")

Personal Involvement Index - Data Science for Water Utilities

Customer Complaints

Customer complaints are a gift to the business. Unfortunately, most business view complaints punitively. This chapter will explain how to analyse and respond to complaints to improve the level of service to customers.

Customer Contacts

One of the topics in this chapter is how to use Erlang-C modelling to predict staffing levels in contact centres.

Economics

Last but not least, economics is the engine room of any organisation. In the early stages of my career, I specialised in cost estimating, including probabilistic methods. This chapter will include an introduction to Monte Carlo simulation to improve cost estimation reliability.

Data Science for Water Utilities Mind Map

This book is still in its early stages. The mind map below shows the work in progress on the proposed chapters and topic.

Data Science for Water Utilities: The next steps

I started writing bits and pieces of Data Science for Water Utilities using the fabulous bookdown system in R-Studio. It will take me about a year to realise this vision as I need to increase my analytical skills to write about such a broad range of topics. I would love to get some feedback on these two questions:

  1. What is missing in this list? Any practical problems I should include?
  2. Would you like to donate some data and code to include in the book?

Feel free to leave a comment below.

Euler Problem 29: Distinct Powers

Euler Problem 29 is another permutation problem that is quite easy to solve using brute force. The MathBlog site by Kristian Edlund has a nice solution using only pen and paper.

Raising number to a power can have interesting results. The video below explains why this pandigital formula approximates e to billions of decimals:

(1 + 9^{-4^{6 \times 7}})^{3^{2^{85}}} \approx e

Euler Problem 29 Definition

Consider all integer combinations of: a^b for 2 \leq a \leq 5 and \leq b \leq 5 .

2^2=4, \quad 2^3 = 8,\quad 2^4 = 16,\quad 2^5 = 32

3^2 = 9,\quad 3^3 = 27,\quad 3^4 = 81,\quad 3^5 = 243

4^2 = 16,\quad 4^3 = 64,\quad 4^4 = 256, \quad 4^5 = 1024

5^2 = 25,\quad 5^3 = 125,\quad 5^4 = 625,\quad 5^5 = 3125

If they are then placed in numerical order, with any repeats removed, we get the following sequence of 15 distinct terms:

4, \ 8, \ 9, \ 16, \ 25, \ 27, \ 32, \ 64, \ 81, \ 125, \ 243, \ 256,\ 625, \ 1024, \ 3125

How many distinct terms are in the sequence generated by a^b for 2 \leq a \leq 100 and 2 \leq b \leq 100 ?

Brute Force Solution

This code simply calculates all powers from 2^2 to 2^{1000} and determines the number of unique values. Since we are only interested in their uniqueness and not the precise value, there is no need to use Multiple Precision Arithmetic.

# Initialisation
target <- 100
terms <- vector()
i <- 1
# Loop through values of a and b and store powers in vector
for (a in 2:target) {
   for (b in 2:target) {
     terms[i] <- a^b
     i <- i + 1
   }
}
# Determine the number of distinct powers
answer <- length(unique(terms))
print(answer)

View the latest version of this code on GitHub.

How Virtual Tags have transformed SCADA data analysis

Yesterday, I delivered the International Keynote at the Asset Data & Insights Conference in Auckland, New Zealand (the place where R was originally developed). My talk was about how to create value from SCADA data, using a method I developed with my colleagues called Virtual Tags. My talk started with my views on data science strategy, which I also presented to the R User Group in Melbourne. In this post, I like to explain what Virtual Tags are, and how they can be used to improve the value of SCADA data.

Asset Data & Insights Conference

SCADA Systems at Water Treatment Plants

Water treatment plants are mostly fully automated systems, using analysers and the SCADA system to communicate this data. For those of you not familiar with water treatment plants, this video below gives a cute summary of the process.

Water treatment plants need sensors to measure a broad range of parameters. These instruments record data 24 hours per day to control operations. When the process operates effectively, all values fall within a very narrow band. All these values are stored by the SCADA system for typically a year, after which they are destroyed to save storage space.

Water treatment plants measure turbidity (clarity of the water) to assess the effectiveness of filtration. The code snippet below simulates the measurements from a turbidity instrument at a water treatment plant over five hours. The code simulates measurements from a turbidity instrument at a water treatment plant over a period of five hours. Most water quality data has a log-normal distribution with a narrow standard deviation.

# Simulate measured data
set.seed(1234)
n <- 300
wtp <- data.frame(DateTime = seq.POSIXt(ISOdate(1910, 1, 1), length.out = n, by = 60),
                  WTP = rlnorm(n, log(.1), .01))
library(ggplot2)
p <- ggplot(wtp, aes(x = DateTime, y = WTP)) + geom_line(colour = "grey") + 
    ylim(0.09, 0.11) + ylab("Turbidity") + ggtitle("Turbidity simulation")
p

SCADA data simulation

SCADA Historian

The data generated by the SCADA system is used to take operational decisions. The data is created and structured to make decisions in the present, not to solve problems in the future. SCADA Historian systems archive this information for future analysis. Historian systems only store new values when the new reading is more or less than a certain percentage than the previous one. This method saves storage space without sacrificing much accuracy.

For example, when an instrument reads 0.20 and the limit is set at 5%, new values are only recorded when they are below 0.19 or above 0.21. Any further values are stored when they deviate 5% from the new value, and so on. The code snippet below simulates this behaviour, based on the simulated turbidity readings generated earlier. This Historian only stores the data points marked in black.

# Historise data
threshold <- 0.03
h <- 1 # First historised point
# Starting conditions
wtp$historise <- FALSE
wtp$historise[c(1, n)] <- TRUE
# Testing for delta <> threshold
for (i in 2:nrow(wtp)) {
    delta <- wtp$WTP[i] / wtp$WTP[h] if (delta > (1 + threshold) | delta < (1 - threshold)) {
        wtp$historise[i] <- TRUE
        h <- i
    }
}
historian <- subset(wtp, historise == TRUE)
historian$Source <- "Historian"
p <- p + geom_point(data = historian, aes(x = DateTime, y = WTP)) + ggtitle("Historised data")
p

SCADA historian simulation

Virtual Tags

This standard method to generate and store SCADA data works fine to operate systems but does not work so well when using the data for posthoc analysis. The data in Historian is an unequally-spaced time series which makes it harder to analyse the data. The Virtual Tag approach expands these unequal time series to an equally-spaced one, using constant interpolation.

The vt function undertakes the constant interpolation using the approx function. The functionvt is applied to all the DateTime values, using the historised data points. The red line shows how the value is constant until it jumps by more than 5%. This example demonstrates that we have a steady process with some minor spikes, which is the expected outcome of this simulation.

# Create Virtual Ttags
vt <- function(t) approx(historian$DateTime, historian$WTP, xout = t, method = "constant")
turbidity <- lapply(as.data.frame(wtp$DateTime), vt)
wtp$VirtualTag <- turbidity[[1]]$y
p + geom_line(data = wtp, aes(x = DateTime, y = VirtualTag), colour = "red") + ggtitle("Virtual Tags")

Example of Virtual Tags for SCADA

You can find the latest version of this code on GitHub.

The next step in Virtual Tags is to combine the tags from different data points. For example, we are only interested in the turbidity readings when the filter was running. We can do this by combining this data with a valve status or flow in the filter.

This approach might seem cumbersome but it simplifies analysing data from Historian. Virtual Tags enables several analytical processes that would otherwise be hard to. This system also adds context to the SCADA information by linking tags to each other and the processes they describe. If you are interested in more detail, then please download the technical manual for Virtual Tags and how they are implemented in SQL.

The Presentation

Anf finally, these are the slides of my keynote presentation.

The Ulam Spiral: Euler Problem 28

Euler Problem 28 takes us to the world of the Ulam Spiral. This is a spiral that contains sequential positive integers in a square spiral, marking the prime numbers. Stanislaw Ulam discovered that a lot of primes are located along the diagonals. These diagonals can be described as polynomials. The Ulam Spiral is thus a way of generating quadratic primes (Euler Problem 27).

Euler Problem 28: The Ulam Spiral

Ulam Spiral (WikiMedia).

Euler Problem 28 Definition

Starting with the number 1 and moving to the right in a clockwise direction a 5 by 5 spiral is formed as follows:

21 22 23 24 25
20 07 08 09 10
19 06 01 02 11
18 05 04 03 12
17 16 15 14 13

It can be verified that the sum of the numbers on the diagonals is 101. What is the sum of the numbers on the diagonals in a 1001 by 1001 spiral formed in the same way?

Proposed Solution

To solve this problem we do not need to create a matrix. This code calculates the values of the corners of a matrix with size n. The lowest number in the matrix with size n is n(n-3)+4. The numbers increase by n-1.

The code steps through all matrices from size 3 to 1001. The solution uses only the uneven sized matrices because these have a centre. The answer to the problem is the sum of all numbers.

size <- 1001 # Size of matrix
answer <- 1 # Starting number
# Define corners of subsequent matrices
for (n in seq(from = 3, to = size, by = 2)) {
    corners <- seq(from = n * (n - 3) + 3, by = n - 1, length.out = 4)
    answer <- answer + sum(corners)
}
print(answer)

Plotting the Ulam Spiral

We can go beyond Euler Problem 28 and play with the mathematics. This code snippet plots all the prime numbers in the Ulam Spiral. Watch the video for an explanation of the patterns that appear along the diagonals.

Ulam Spiral prime numbers

Ulam Spiral prime numbers.

The code creates a matrix of the required size and fills it with the Ulam Spiral. The code then identifies all primes using the is.prime function from Euler Problem 7. A heat map visualises the results.

# Ulam Spiral
size <- 201 # Size of matrix
ulam <- matrix(ncol = size, nrow = size)
mid <- floor(size / 2 + 1)
ulam[mid, mid] <- 1
for (n in seq(from = 3, to = size, by = 2)) {
    numbers <- (n * (n - 4) + 5) : ((n + 2) * ((n + 2) - 4) + 4)
    d <- mid - floor(n / 2)
    l <- length(numbers)
    ulam[d, d:(d + n - 1)] <- numbers[(l - n + 1):l]
    ulam[d + n - 1, (d + n - 1):d] <- numbers[(n - 1):(n - 2 + n)]
    ulam[(d + 1):(d + n - 2), d] <- numbers[(l - n):(l - 2 * n + 3)]
    ulam[(d + 1):(d + n - 2), d + n - 1] <- numbers[1:(n - 2)]
}
ulam.primes &lt;- apply(ulam, c(1, 2), is.prime)

# Visualise
library(ggplot2)
library(reshape2)
ulam.primes <- melt(ulam.primes)
ggplot(ulam.primes, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + 
    scale_fill_manual(values = c("white", "black")) + 
    guides(fill=FALSE) +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank()) + 
    theme(axis.title.x=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank(),
          axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank()
          )

View the latest version of this code on GitHub.

Using the iGraph package to Analyse the Enron Corpus

The Enron scandal is one of the most famous corporate governance failures in the history of capitalism. The case itself is interesting for its sake, but in this post, I am more interested in one of the data sets that the subsequent investigation has provided.

This blog post analyses an extensive collection of e-mails from former Enron employees. The Enron corpus is analysed using network analysis tools provided by the iGraph package. Network analysis is a versatile technique that can be used to add value to a lot of different data sets, including the complex corporate relationships of Donald Trump.

The Enron Corpus

As part of their inquiries, The Federal Energy Regulatory Commission used an extensive collection of emails from Enron employees. The Enron corpus is one of the only publicly available collections of emails available for research. This dataset also provides a fascinating playground for citizen data scientists.

The set has privacy issues as it contains messages from living people. When analysing this data set, we need to keep in mind that the majority of former Enron employees were innocent people who lost their jobs due to the greed of their overlords. The code in this post only analyses the e-mail headers, ignoring the content.

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001 - Enron corpus analysis

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001.

The Enron Corpus is a large database of half a million of emails generated by more than 100 Enron employees. You can download the corpus from the Carnegie Mellon School of Computer Science. The first code snippet downloads the 7 May 2015 version of the dataset (about 423Mb, tarred and gzipped) and untars it to your working directory.

# Enron Email Dataset: https://www.cs.cmu.edu/~./enron/
download.file("https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz", destfile = "enron_mail_20150507.tgz")
untar("enron_mail_20150507.tgz")

Preparing the Data

The main folder is maildir, which holds all the personal accounts. Our first task is to load the required libraries and create a list of available emails. This code results in 517,401 e-mail files with 44,859 emails in the inboxes of users.

# E-mail corpus consists of nested folders per user with e-mails as text files
# Create list of all available e-mails
emails <- list.files("maildir/", full.names = T, recursive = T)
length(emails)
# Filter by inbox only
emails <- emails[grep("/inbox", emails)]
length(emails)

The bulk of the code creates a list of emails between Enron employees. The code performs a lot of string manipulations to extract the information from the text files. The content of the e-mails is ignored, the code only retrieves the sender and the receiver. The analysis is limited to e-mails between employees in the corpus. Only those receivers whose inbox forms part of the analysis are included. The result of this code is a data frame with the usernames of the sender and receiver for each email. The data frame contains 2779 emails that meet the criteria.

# Create list of sender and receiver (inbox owner)
inboxes <- data.frame(
    from = apply(as.data.frame(emails), 1, function(x){readLines(x, warn = F)[3]}),
    to = emails,
    stringsAsFactors = F
)

# Keep only enron.com and strip all but username
library(stringr) # String manipulation
inboxes <- inboxes[grepl("@enron.com", inboxes$from),]
inboxes$from <- str_sub(inboxes$from, 7, nchar(inboxes$from) - 10)
to <- str_split(inboxes$to, "/")
inboxes$to <- sapply(to, "[", 3)

# Create list of usernames
users <- data.frame(user = paste0("maildir/", unique(inboxes$to)))

# Remove those without sent mails
sent <- apply(users, 1, function(x){sum(grepl("sent", dir(x)))})
users <- subset(users, sent != 0)

# Replace username with e-mail name
users$mailname <- NA
for (i in 1:nrow(users)){
sentmail <- dir(paste0(users$user[i], "/sent_items/"))
name <- readLines(paste0(users$user[i], "/sent_items/", sentmail[1]), warn = F)[3]
name <- str_sub(name, 7, nchar(name)-10)
users$mailname[i] <- name
}
users$user <- str_sub(users$user, 9)
inboxes <- merge(inboxes, by.x="to", users, by.y="user")
inboxes <- data.frame(from = inboxes$from, to = inboxes$mailname)

inboxes$from <- as.character(inboxes$from)
inboxes$to <- as.character(inboxes$to)

# Only e-mails between inbox users
inboxes <- inboxes[inboxes$from %in% inboxes$to,]

# Remove no.address
inboxes <- subset(inboxes, from != "no.address" & to != "no.address")

# Remove emails to self
inboxes <- subset(inboxes, inboxes$from != inboxes$to)

Network Analysis

The last code snippet defines a graph from the table of e-mails. Each employee is a node in the network, and each e-mail is an edge (line). The iGraph package is a powerful tool to analyse networks. The graph_from_edgelist function creates a network object that can be analysed using the iGraph package. The graph is directed because the information flows from the sender to the receiver.

In the next step, the Spingglass algorithm finds community structure within the data. A community is a group of nodes that are more connected with each other than with any other node.

The last step visualises the network. The diagram is drawn using the Fruchterman-Reingold algorithm, which places the most connected nodes at the centre of the picture. The background colours in the diagram indicate the eight communities.

The graph tells us a lot about the group of employees in the Enron corpus and how they relate to each other. Each of the communities represents a tightly connected group of employees that mainly e-mail each other. Any connections between communities are shown in red. When the vertex.label = NA line is removed, the usernames are displayed at each node.

We can see groups that never email each other, lonely hangers-on and tightly knit cliques within Enron. In the centre of the graph we see w few individuals who are connectors between groups because they send a lot of emails to people outside their community. On the fringes of the graph are the hangers-on who only once or twice emailed somebody in the corpus but were still included in the investigation.

library(igraph)
g <- graph_from_edgelist(as.matrix(inboxes), directed = T)
coms <- spinglass.community(g)

# Plot network
par(mar = c(0,0,2,0))
plot(coms, g, 
     vertex.label=NA, 
     layout = layout.fruchterman.reingold,
     vertex.size = 3
)

View the most recent version of the code on GitHub.

This analysis provides only a quick glimpse into the knowledge contained in the Enron corpus. An extensive range of tools is available to analyse such networks. An interesting exercise would be to overlap this network with the organisation chart to see the relationships between teams. Have fun playing with this fantastic data set!

Enron corpus network with communities.

 

Generating Quadratic Primes: Euler Problem 27

Prime numbers are the engine of the Internet economy. One of the reasons prime numbers are so useful is that they cannot be generated through an algorithm. This impossibility has not stopped mathematicians from trying to find formulas to generate prime numbers.

Euler problem 27 deals with quadratic formulas that can be used to generate sets of prime numbers. We have already discussed this in the post about the Ulam Spiral. This Numerphile video discusses quadratic primes.

Euler Problem 27 Definition

Euler discovered the remarkable quadratic formula:

n^2+n+41

It turns out that the formula will produce 40 primes for the consecutive integer values 0 \leq n \leq 39. However, when n=40, 40^2+40+41=40(40+1)+41 is divisible by 41 , and certainly when n=41, 41^2+41+41 is clearly divisible by 41.

The incredible formula n^2-79n+1601 was discovered, which produces 80 primes for the consecutive values 0 \leq n \leq 79 . The product of the coefficients, -79 and 1601, is -126479.

Considering quadratics of the form: n^2+an+bn^2+an+b ,

where |a| < 1000 and |b| \leq 1000 , where |n| is the modulus/absolute value of n , e.g. |11|=11 and |-4|=4.

Find the product of the coefficients, a and b , for the quadratic expression that produces the maximum number of primes for consecutive values of n, starting with n=0 .

Proposed Solution

The only way to solve this problem is through brute force and reduce the solution space to optimise it for speed (source: mathblog.dk). Because the outcome of the equation must be prime for n = 0, b also has to be prime. We can use the prime sieve from Euler Problem 3, which reduces it from 2000 to 168 options. When we insert n = 1 it follows that a has to be an even number. If 1+a+b has to be prime and b has to be a prime number, then a can only be an odd number. However, when b = 2, a has to be even.

Euler Problem 27 code

seq.a <- seq(-999, 1001, 2) # a has to be odd
seq.b <- (esieve(1000)) # b has to be prime
max.count <- 0
for (a in seq.a) {
    if (a == 2) 
        seq.a <- seq(-1000, 1000, 2) # a has to be even
    for (b in seq.b) {
        n <- 0 # Find sequence of primes for a and b
        while (is.prime(n^2 + a * n + b)) {
            n <- n + 1 } # Store maximum values if (n > max.count) {
            max.count <- n
            max.a <- a
            max.b <- b
        }
    }
}
answer <- max.a * max.b
print(answer)

View the latest version of this code on GitHub.

The Viral Recurring Decimal: Euler Problem 26

A few years ago a fraction broke the internet. What happens when you divide 1 by 998001?

\frac{1}{998001} = 0.000001002003004005006007008009010011012013014015 \ldots

What is special about this fraction is that it lists every three-decimal number except for 998. Look carefully at the sequence to see that is 000, 001, 0002, 003, 004, 005 and so on. After it has reached 999, the sequence continues from the start. This fraction thus has 2997 recurring decimals. James Grime from Numberphile explains this mathematical oddity with his usual enthusiasm.

The decimal fraction of 1/998001 is a recurring decimal. These are decimal numbers with periodic digits (repeating its values at regular intervals). Euler problem 26 asks us to analyse recurring decimals (reciprocal cycles).

Euler Problem 26 Definition

A unit fraction contains 1 in the numerator. The decimal representation of the unit fractions with denominators 2 to 10 are given:

1/2 = 0.5
1/3 = 0.(3)
1/4 = 0.25
1/5 = 0.2
1/6 = 0.1(6)
1/7 = 0.(142857)
1/8 = 0.125
1/9 = 0.(1)
1/10 = 0.1

Where 0.1(6) means 0.166666…, and has a 1-digit recurring cycle. It can be seen that 1/7 has a 6-digit recurring cycle. Find the value of d < 1000 for which 1/d contains the longest recurring cycle in its decimal fraction part.

Solution

A051626 describes the length of the recurring numbers in 1/n in the On-Line Encyclopaedia of Integer Sequences. To solve Euler Problem 26, we need to generate the first 1000 numbers of this sequence and find out which number has the longest recurring cycle.

R can only display up to 22 decimals by using options(digits=22). The base R capability is unsuitable for solving this problem, so I wrote some code to perform long division the old-fashioned way.

The recur function divides 1 by any arbitrary integer. The code continues until the decimal terminates, for example 1/4 = 0.25, or when a recurring pattern emerges, e.g. 1/7 = 0.(142857).

The function has two arguments: n is the input number. The output argument determines the outcome of the function: “len” for the length of the recurring decimals. Any other value shows the result of the calculation. The output of the function is a string. Using the European notation, the recurring part of the decimals is shown between brackets, e.g. 1/14 = 0.0(714285).

recur <- function(x, output = "") {
    # Prepare variable
    if (x == 0) return(NaN)
    if (x == 1) return(0)
    x <- floor(abs(x))
    # Initiate vectors to store decimals and remainders
    dec <- vector()
    rem <- vector()
    # Initiate values
    i <- 1
    r <- 10
    rem <- r
    # Long division
    repeat {
        dec[i] <- floor(r / x)
        r <- 10 * (r %% x)
        # Test wether the number is terminating or repeating
        if (r == 0 | r %in% rem) break
        rem[i + 1] <- r
        i <- i + 1 
    }
    # Determine number of recurring digits
    rep <- ifelse(r != 0, length(rem) - which(r == rem) + 1, 0)
    # Output
    if (output == "len")
        return(rep)
    else {
        if (rep != 0) {
            if (rep == length(dec)) 
                l <- "("
            else
                l <- c(dec[1:(length(dec) - rep)], "(")
            dec <- c(l, dec[(length(dec) - rep + 1):length(dec)], ")")
        }
        return(paste0("0.", paste0(dec, collapse = "", sep = "")))
        }
}

A051626 <- sapply(1:1000, recur, "len")
answer <- which.max(A051626)
print(answer)

recur(998001)

You can view the latest version of this code on GitHub.

Large integers in R: Fibonacci number with 1000 digits, Euler Problem 25

The Fibonacci Sequence in nature: The nautilus shell. Euler problem 25

The Fibonacci Sequence occurs in nature: The nautilus shell.

Euler Problem 25 takes us back to the Fibonacci sequence and the problems related to working with very large integers.

The Fibonacci sequence follows a simple mathematical rule but it can create things of great beauty. This pattern occurs quite often in nature, like to nautilus shell shown in the image. The video by Arthur Benjamin at the end of this post illustrates some of the magic of this sequence.

Large Integers in R

By default, numbers with more than 7 digits are shown in scientific notation in R, which reduces the accuracy of the calculation. You can change the precision of large integers with the options function but R struggles with integers with more than 22 digits. This example illustrates this issue.

factorial(24)
[1] 6.204484e+23
> options(digits=22)
> factorial(24)
[1] 620448401733239409999872

However, finding a number of 1000 digits is a problem with using special functions.

Euler Problem 25 Definition

The Fibonacci sequence is defined by the recurrence relation:

F_n = F_{n-1} + F_{n-2} , where F_1 = 1 and F_2 = 1 . The 12th term, F_{12} , is the first term to contain three digits.

What is the index of the first term in the Fibonacci sequence to contain 1000 digits?

Proposed Solutions

The first solution uses the GMP library to manage very large integers. This library also contains a function to generate Fibonacci numbers. This solution cycles through the Fibonacci sequence until it finds a number with 1000 digits.

library(gmp) # GNU Multiple Precision Arithmetic Library
n <- 1
fib <- 1
while (nchar(as.character(fib)) < 1000) {
   fib <- fibnum(n) # Determine next Fibonacci number
   n <- n + 1
}
answer <- n
print(answer)

This is a very fast solution but my aim is to solve the first 100 Project Euler problems using only base-R code. The big.add function I developed to solve Euler Problem 13.

t <- proc.time()
fib <- 1 # First Fibonaci number
cur <- 1 # Current number in sequence
pre <- 1 # Previous number in sequence
index <- 2
while (nchar(fib) < 1000) {
    fib <- big.add(cur, pre) # Determine next Fibonacci number
    pre <- cur
    cur <- fib
    index <- index + 1
}
answer <- index
print(answer)

This code is much slower than the GMP library but it was fun to develop.

View the latest version of this code on GitHub.

The Magic of the Fibonacci Numbers

Lexicographic Permutations: Euler Problem 24

Euler Problem 24 asks to develop lexicographic permutations which are ordered arrangements of objects in lexicographic order. Tushar Roy of Coding Made Simple has shared a great introduction on how to generate lexicographic permutations.

Euler Problem 24 Definition

A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4. If all of the permutations are listed numerically or alphabetically, we call it lexicographic order. The lexicographic permutations of 0, 1 and 2 are:

012 021 102 120 201 210

What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?

Brute Force Solution

The digits 0 to 9 have 10! = 3628800 permutations (including combinations that start with 0). Most of these permutations are, however, not in lexicographic order. A brute-force way to solve the problem is to determine the next lexicographic permutation of a number string and repeat this one million times.

nextPerm <- function(a) {
    # Find longest non-increasing suffix
    i <- length(a) while (i > 1 && a[i - 1] >= a[i])
        i <- i - 1
    # i is the head index of the suffix
    # Are we at the last permutation?
    if (i <= 1) return (NA)
    # a[i - 1] is the pivot
    # Find rightmost element that exceeds the pivot
    j <- length(a)
    while (a[j] <= a[i - 1]) 
        j <- j - 1
    # Swap pivot with j
    temp <- a[i - 1]
    a[i - 1] <- a[j]
    a[j] <- temp
    # Reverse the suffix
    a[i:length(a)] <- rev(a[i:length(a)])
    return(a)
    }

numbers <- 0:9
for (i in 1:(1E6 - 1)) numbers <- nextPerm(numbers)
answer <- numbers
print(answer)

This code takes the following steps:

  1. Find largest index i such that a_{i-1} < a_i.
    1. If no such index exists, then this is already the last permutation.
  2. Find largest index j such that j \geq i and a_j > a_{i-1}.
  3. Swap a_j and a_{i-1}.
  4. Reverse the suffix starting at a_i.

Combinatorics

A more efficient solution is to use combinatorics, thanks to MathBlog. The last nine digits can be ordered in 9! = 362880 ways. So the first 9! permutations start with a 0. By extending this thought, it follows that the millionth permutation must start with a 2.

\lfloor (1000000 - 1) / 9! \rfloor  = 2

From this rule, it follows that the 725761st permutation is 2013456789. We now need 274239 more lexicographic permutations:

(1000000 - 1) - (2 \times 9!) = 274239

We can repeat this logic to find the next digit. The last 8 digits can be ordered in 40320 ways. The second digit is the 6th digit in the remaining numbers, which is 7 (2013456789).

\lfloor 274239 / 8! \rfloor  = 6

274239 - (6 \times 7!) = 32319

This process is repeated until all digits have been used.

numbers <- 0:9
n <- length(numbers)
answer <- vector(length = 10)
remain <- 1E6 - 1
for (i in 1:n) {
    j <- floor(remain / factorial(n - i))
    answer[i] <- numbers[j + 1]
    remain <- remain %% factorial(n - i)
    numbers <- numbers[-(j + 1)]
}
answer <- paste(answer, collapse = "")
print(answer)

View the latest version of this code on GitHub.

R blogger Tony’s Bubble Universe created a generalised function to solve this problem a few years ago.