Tic Tac Toe Simulation — Random Moves

Tic Tac Toe Simulation - Part 1Tic Tac Toe might be a futile children’s game but it can also teach us about artificial intelligence. Tic Tac Toe, or Naughts and Crosses, is a zero-sum game with perfect information. Both players know exactly what the other did and when nobody makes a mistake, the game will always end in a draw.

Tic Tac Toe is a simple game but also the much more complex game of chess is a zero-sum game with perfect information.

In this two-part post, I will build an unbeatable Tic Tac Toe Simulation. This first part deals with the mechanics of the game. The second post will present an algorithm for a perfect game.

Drawing the Board

This first code snippet draws the Tic Tac Toe simulation board. The variable xo holds the identity of the pieces and the vector board holds the current game. Player X is denoted with -1 and player O with +1. The first part of the function draws the board and the naughts and crosses. The second part of the code check for three in a row and draws the corresponding line.

draw.board <- function(board) { # Draw the board
    xo <- c("X", " ", "O") # Symbols
    par(mar = rep(0,4))
    plot.new()
    plot.window(xlim = c(0,30), ylim = c(0,30))
    abline(h = c(10, 20), col="darkgrey", lwd = 4)
    abline(v = c(10, 20), col="darkgrey", lwd = 4)
    pieces <- xo[board + 2]
    text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), pieces, cex = 6)
    # Identify location of any three in a row
    square <- t(matrix(board, nrow = 3))
    hor <- abs(rowSums(square))
    if (any(hor == 3)) 
      hor <- (4 - which(hor == 3)) * 10 - 5 
    else 
      hor <- 0
    ver <- abs(colSums(square))
    if (any(ver == 3)) 
      ver <- which(ver == 3) * 10 - 5 
    else
      ver <- 0
    diag1 <- sum(diag(square))
    diag2 <- sum(diag(t(apply(square, 2, rev)))) 
    # Draw winning lines 
    if (hor > 0) lines(c(0, 30), rep(hor, 2), lwd=10, col="red")
    if (ver > 0) lines(rep(ver, 2), c(0, 30), lwd=10, col="red")
    if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd=10, col="red")
    if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd=10, col="red")
}

Random Tic Tac Toe

The second part of the code generates ten random games and creates and animated GIF-file. The code adds random moves until one of the players wins (winner <> 0) or the board is full (no zeroes in the game vector). The eval.winner function checks for three in a row and declares a winner when found.

There are 255,168 possible legal games in Tic Tac Toe, 46,080 of which end in a draw. This implies that these randomised games result in a draw 18% of the time.

eval.winner <- function(board) { # Identify winner
    square <- t(matrix(board, nrow = 3))
    hor <- rowSums(square)
    ver <- colSums(square)
    diag1 <- sum(diag(square))
    diag2 <- sum(diag(t(apply(square, 2, rev))))
    if (3 %in% c(hor, ver, diag1, diag2)) return (1)
    else
        if (-3 %in% c(hor, ver, diag1, diag2)) return (2)
    else
        return(0)
}

# Random game
library(animation)
saveGIF ({
 for (i in 1:10) {
 game <- rep(0, 9) # Empty board
 winner <- 0 # Define winner
 player <- -1 # First player
 draw.board(game)
 while (0 %in% game & winner == 0) { # Keep playing until win or full board
   empty <- which(game == 0) # Define empty squares
   move <- empty[sample(length(empty), 1)] # Random move
   game[move] <- player # Change board
   draw.board(game)
   winner <- eval.winner(game) # Evaulate game
   player <- player * -1 # Change player
 }
 draw.board(game)
 }
 },
 interval = 0.25, movie.name = "ttt.gif", ani.width = 600, ani.height = 600)

Tic Tac Toe Simulation

In a future post, I will outline how to program the computer to play against itself, just like in the 1983 movie War Games.

Euler Problem 15: Pathways Through a Lattice

Euler Problem 15 analyses taxicab geometry. This system replaces the usual distance function with the sum of the absolute differences of their Cartesian coordinates. In other words, the distance a taxi would travel in a grid plan. The fifteenth Euler problem asks to determine the number of possible routes a taxi can take in a city of a certain size.

Euler Problem 15 Definition

Starting in the top left corner of a 2×2 grid, and only being able to move to the right and down, there are exactly 6 routes to the bottom right corner. How many possible routes are there through a 20×20 grid?

Euler Problem 15: Lattice Paths

Solution

The defined lattice is one larger than the number of squares. Along the edges of the matrix, only one pathway is possible: straight to the right or down. We can calculate the number of possible pathways for the remaining number by adding the number to the right and below the point.

p_{i,j}=p_{i,j{+1}}+p_{{i+1},j}

For the two by two lattice the solution space is:

6  3  1
3  2  1
1  1  0

The total number of pathways from the upper left corner to the lower right corner is thus 6. This logic can now be applied to a grid of any arbitrary size using the following code.

The code defines the lattice and initiates the boundary conditions. The bottom row and the right column are filled with 1 as there is only one solution from these points. The code then calculates the pathways by working backwards through the matrix. The final solution is the number is the first cell.

# Define lattice
nLattice <- 20
lattice = matrix(ncol=nLattice + 1, nrow=nLattice + 1)

# Boundary conditions
lattice[nLattice + 1,-(nLattice + 1)] <- 1
lattice[-(nLattice + 1), nLattice + 1] <- 1

# Calculate Pathways
for (i in nLattice:1) {
    for (j in nLattice:1) {
        lattice[i,j] <- lattice[i+1, j] + lattice[i, j+1]
    }
}

answer <- lattice[1,1]
print(answer)

Taxicab Geometry

Create Air Travel Route Maps in ggplot: A Visual Travel Diary

Create Air Travel Route Maps: Emirates Route MapI have been lucky to fly to a few countries around the world. Like any other bored traveller, I thumb through the airline magazines and look at the air travel route maps. These maps are beautifully stylised depictions of the world with gently curved lines between the many destinations. I always wanted such a map for my own travel adventures.

Create Air Travel Route Maps using ggplot2

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

# Read flight list
flights <- read.csv("flights.csv", stringsAsFactors = FALSE)

# Lookup coordinates
library(ggmap)
airports <- unique(c(flights$From, flights$To))
coords <- geocode(airports)
airports <- data.frame(airport=airports, coords)

We now we have a data frame of airports with their coordinates and can create air travel route maps. The data frames are merged so that we can create air travel route maps using the curve geom. The borders function of ggplot2 creates the map data. The ggrepel package helps to prevent overplotting of text.

# Add coordinates to flight list
flights <- merge(flights, airports, by.x="To", by.y="airport")
flights <- merge(flights, airports, by.x="From", by.y="airport")

# Plot flight routes
library(ggplot2)
library(ggrepel)
worldmap <- borders("world", colour="#efede1", fill="#efede1") # create a layer of borders
ggplot() + worldmap + 
 geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, yend = lat.y), col = "#b29e7d", size = 1, curvature = .2) + 
 geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + 
 geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), col = "black", size = 2, segment.color = NA) + 
 theme(panel.background = element_rect(fill="white"), 
 axis.line = element_blank(),
 axis.text.x = element_blank(),
 axis.text.y = element_blank(),
 axis.ticks = element_blank(),
 axis.title.x = element_blank(),
 axis.title.y = element_blank()
 )

I also tried to use ggmap package to display the maps to get a satellite image background. This did not work because the curve geom struggles with the map projection methods used in ggmap. Another problem is that the flight from Auckland to Los Angeles is drawn the wrong way. I hope no flat-earthers will see this map because they might use it as prove that the world is flat.

Euler Problem 14: Longest Collatz Sequence

Euler Problem 14 looks at the Collatz Conjecture. These playful sequences, named after German mathematician Lothar Collatz (1910–1990), cause mathematicians a lot of headaches. This video introduces the problem much better than I can describe it.

Euler Problem 14 Definition

The following iterative sequence is defined for the set of positive integers:

n \rightarrow n/2 ( n is even)
n \rightarrow 3n + 1 ( n is odd)

Using the rule above and starting with 13, we generate the following sequence:

13 \rightarrow 40 \rightarrow 20 \rightarrow 10 \rightarrow 5 \rightarrow 16 \rightarrow 8 \rightarrow 4 \rightarrow 2 \rightarrow 1

This sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1. Which starting number, under one million, produces the longest chain? Note: Once the chain starts the terms are allowed to go above one million.

Solution

This problem is highly computationally intensive and it highlights R’s lack of speed. Generating one million Collatz sequences and finding the longest one requires a lot more than a minute of processing time allowed for in Project Euler.

collatz.chain <- function(n) {
    chain <- vector()
    i <- 1
    while (n! = 1) {
        if (n%%2 == 0) 
            n <- n / 2
        else
            n <- 3 * n + 1
        chain[i] <- n
        i <- i + 1
    }
    return(chain)
}
answer <- 0
collatz.max <- 0
for (n in 1:1E6) {
    collatz.length <- length(collatz.chain(n)) 
    if (collatz.length > collatz.max) {
        answer <- n
        collatz.max <- collatz.length
        }
}
print(answer)

The second version of the code contains some optimisations. The code stores the length of all sequences in an array. When the code generates a sequence and lands on a number already analysed, then it adds that previous number to the current one and moves on. This approach requires more memory but saves a lot of computation time. A minor tweak to the code optimises the rule for uneven numbers. Tripling an uneven number and adding one will always result in an even number so we can skip one step. This solution is more than twice as fast as the first version.

collatz.length <- vector(length=1e6)
collatz.length[1] <- 0
for (n in 2:1e6) {
    x <- n
    count <- 0 while (x != 1 & x >= n) {
        if (x %% 2 == 0) {
            x <- x / 2
            count <- count + 1
        }
        else {
            x <- (3 * x + 1) / 2
            count <- count + 2
        } 
    }
    count <- count + collatz.length[x]
    collatz.length[n] <- count
}
answer <- which.max(collatz.length)
print(answer)

Visualising Collatz Sequences

The Collatz sequence is an example of a simple mathematical rule that can create an unpredictable pattern. The number of steps required to reach 1 is listed in A006577 of the Online Encyclopedia of Integer Sequences.

The image below visualises the number of steps for the first 1000 positive numbers. The scatterplot shows some interesting patterns. Does this visualisation show that the Collatz Sequence does have a pattern after all?

Euler Problem 14: Number of halving and tripling steps to reach 1 in the Collatz problem.

Collatz Chains

The Collatz sequences can also be visualised using networks. Each step between two numbers is an edge and the numbers are the vertices. For example, the network for the Collatz sequence for number 10 is 5–16, 16–8, 8–4, 4–2, 2–1. When generating subsequent sequences the network will start to overlap and a tree of sequences appears. The tree below combines the Collatz sequences for the numbers 2 to 26. Number 27 has a very long sequence, making the tree much harder to read.

Network of Collatz sequences n=2-26

Network of Collatz sequences n=2-26

edgelist <- data.frame(a = 2, b = 1)
for (n in 3:26) {
   chain <- as.character(c(n, collatz.chain(n)))
   chain <- data.frame(a = chain[-length(chain)], b = chain[-1])
   edgelist <- rbind(edgelist, chain)
}
library(igraph)
g <- graph.edgelist(as.matrix(edgelist))
g <- simplify(g)
par(mar=rep(0,4))
V(g)$color <- degree(g, mode = "out") + 1
plot(g, 
     layout=layout.kamada.kawai,
     vertex.color=V(g)$color, 
     vertex.size=6,
     vertex.label.cex=.7,
     vertex.label.color="black",
     edge.arrow.size=.1,
     edge.color="black"
     )

 

Euler Problem 13: Large Sum of 1000 Digits

Euler Problem 13 asks to add one hundred numbers with fifty digits. This seems like a simple problem where it not that most computers are not designed to deal with numbers with a lot of integers. For example:

 2^{64} = 18446744073709551616

When asking R to compute this value we get 1.844674e+19, losing most of the digits and limiting the accuracy of the results. Computers solve this problem using Arbitrary-precision Arithmetic. There are many software libraries that can process long integers without loosing accuracy. Euler Problem 13 requires this type of approach.

Euler Problem 13 Definition

Work out the first ten digits of the sum of the following one-hundred 50-digit numbers.

Solution

The easy way to solve this problem is to use the gmp package for working with very large integers. This package uses a special number types such as Big Rational and Big Integer. The number of digits in these number types is only limited by the size of the memory.

library(gmp)
numbers <- readLines("Euler/p013_numbers.txt")
digits <- sum(as.bigz(numbers))
answer <- substr(as.character(digits),1,10)

Using Base-R

To find the solution to this problem using only base R, I wrote a function to add numbers using strings instead of integers. The function adds leading zeros to the smallest number to make them both the same length. The function then proceeds to add numbers in the same way we were taught in primary school. This function can in principle be used for several other Euler Problems using large integers.

# Add numbers with many digits
big.add <- function(a, b) {
    # Add leading zeros to smallest numer
    if (nchar(a) < nchar(b)) 
        a <- paste0(paste(rep(0, nchar(b) - nchar(a)), collapse = ""), a) 
    if (nchar(a) > nchar(b)) 
        b <- paste0(paste(rep(0, nchar(a) - nchar(b)), collapse = ""), b)
    solution <- vector()
    remainder <- 0
    for (i in nchar(b):1) {
        p <- as.numeric(substr(a, i, i))
        q <- as.numeric(substr(b, i, i))
        r <- p + q + remainder 
        if (r >= 10 & i!=1) {
            solution <- c(solution, r %% 10)
            remainder <- (r - (r %% 10))/10
        } else {
            solution <- c(solution, r)
            remainder <- 0
        }
    }
    return(paste(rev(solution), collapse = ""))
}

With this function, the problem is easy to solve. The second part of the code runs this function over the one hundred numbers provided on the Euler Problem page and calculates the answer.

numbers <- readLines("Euler/p013_numbers.txt")
for (i in numbers) {
    answer <- big.add(answer, i)
}
answer <- substr(answer, 1, 10)

Multiplying Big Numbers

You can expand this function to multiply a very large number with a smaller number using the Reduce function. This function adds the number a to itself, using the big.add function. The outcome of the addition is used in the next iteration until it has been repeated b times. The number b in this function needs to be a ‘low’ number because it uses a vector of the length b.

big.mult <- function(a, b) {
    Reduce(big.add, rep(a, as.numeric(b)))
}

Euler Problem 12: Highly Divisible Triangular Number

Euler Problem 12: Divisors of triangular numbers.

The divisors of 10 illustrated with Cuisenaire rods: 1, 2, 5, and 10 (Wikipedia).

Euler Problem 12 takes us to the realm of triangular numbers and proper divisors.

The image on the left shows a hands-on method to visualise the number of divisors of an integer. Cuisenaire rods are learning aids that provide a hands-on way to explore mathematics.

Euler Problem 12 Definition

The sequence of triangle numbers is generated by adding the natural numbers. So the 7th triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28 . The first ten terms would be: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, \ldots Let us list the factors of the first seven triangle numbers:

1: 1

3: 1, 3

6: 1, 2, 3, 6

10: 1, 2, 5, 10

15: 1, 3, 5, 15

21: 1, 3, 7 ,21

28: 1, 2, 4, 7, 14, 28

We can see that 28 is the first triangle number to have over five divisors. What is the value of the first triangle number to have over five hundred divisors?

Solution

Vishal Kataria explains a simple method to determine the number of divisors using prime factorization as explained by in his video below. The prime factorization of n is given by:

n = p^{\alpha_1}_1 \times p^{\alpha_2}_2 \times p^{\alpha_k}_k

The number of proper divisors is:

d = (\alpha_1 + 1) (\alpha_2 + 1) \ldots (\alpha_k + 1)

The code reuses the prime factorisation function developed for Euler Problem 3. This function results in a vector of all prime factors, e.g. the prime factors of 28 are 2, 2 and 7.

The code to solve this problem determines the values for alpha using the run length function. This function counts the number of times each element in a sequence is repeated. The outcome of this function is a vector of the values and the number of times each is repeated. The prime factors of 28 are 2 and 7 and their run lengths are 2 and 1. The number of divisors can now be determined.

28 = 2^2 \times 7^1

d = (2+1)(1+1) = 6

The code to solve Euler Problem 12 is shown below. The loop continues until it finds a triangular number with 500 divisors. The first two lines increment the index and create the next triangular number. The third line in the loop determines the number of times each factor is repeated (the run lengths). The last line calculates the number of divisors using the above-mentioned formula.

i <- 0
divisors <- 0
while (divisors < 500) {
    i <- i + 1
    triangle <- (i * (i+1)) / 2
    pf <- prime.factors(triangle)
    alpha <- rle(pf)
    divisors <- prod(alpha$lengths+1)
}
answer <- triangle
print(answer)

Percentile Calculations in Water Quality Regulations

Percentile Calculations in Water Quality RegulationsPercentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use?

The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile.

set.seed(1969)
test.data <- rnorm(n = 10000, mean = 100, sd = 15)
library(ggplot2)
ggplot(as.data.frame(test.data), aes(test.data)) + 
 geom_histogram(binwidth = 1, aes(y = ..density..), fill = "dodgerblue") + 
 geom_line(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", size = 1) + 
 geom_area(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), 
 colour = "red", fill="red", alpha = 0.5, xlim = quantile(test.data, c(0.5, 0.75))) + 
 theme(text = element_text(size = 16))

The quantile default function and the 95th percentile give the following results:

> quantile(test.data)
       0%       25%       50%       75%      100% 
 39.91964  89.68041 100.16437 110.01910 153.50195 

> quantile(test.data, probs=0.95)
     95% 
124.7775 

Methods of percentile calculations

The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods.

The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code.

> sapply(1:9, function(m) quantile(test.data, 0.95, type = m))

     95%      95%      95%      95%      95%      95%      95%      95%      95% 
124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 

Percentile calculations in water quality

The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that:

“The 95th percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.”

The Victorian regulators also specify that the percentile should be calculated with the Weibull Method. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R?

The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name \hat{Q}_8(p). Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data?

In case you are interested in how the Weibull method works, the weibull.quantile function shown below calculates a quantile p for a vector x using this method. This function gives the same result as quantile(x, p, type=6).

weibull.quantile <- function(x, p) {
    # Order Samples from large to small
    x <- x[order(x, decreasing = FALSE)]
    # Determine ranking of percentile according to Weibull (1939)
    r <- p * (length(x) + 1)
    # Linear interpolation
    rfrac <- (r - floor(r))
    return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1])
}

Turbidity Data Example

Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits.

http://r.prevos.net/wp-content/uploads/sites/11/2017/02/turbidity.png

Turbidity at customer tap for each zone in the Tarnagulla system in 2016 (n=53).

When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result.

ZoneR1R2R3R4R5R6R7R8R9
Bealiba0.3000.3000.2000.2400.2900.3000.2450.3000.300
Dunolly0.400000.400000.300000.340000.390000.435000.345000.405000.40125
Laanecoorie0.500000.500000.400000.440000.490000.535000.445000.505000.50125
Tarnagulla0.40.40.40.40.40.40.40.40.4

The graph and the table were created with the following code snippet:

ggplot(turbidity, aes(Result)) + 
 geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) + 
 facet_wrap(~Zone) + 
 theme(text=element_text(size=16))

tapply(turbidity$Result, turbidity$Zone, 
 function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))

Euler Problem 11: Largest Product in a Grid

Euler Problem 11 Definition

In the 20×20 grid below, four numbers along a diagonal line have been marked in red.

Euler problem 11

The product of these numbers is 26 × 63 × 78 × 14 = 1,788,696. What is the greatest product of four adjacent numbers in the same direction (up, down, left, right, or diagonally) in the 20 by 20 grid?

Solution

The solution applies straightforward vector arithmetic. The product of all verticals is an array of the product of rows 1 to 4, rows 2 to 5 and so on. The code uses a similar logic for the horizontals and the diagonals.

#Read and convert data
square <- readLines("Euler/p011_matrix.txt")
square <- as.numeric(unlist(lapply(square, function(x){strsplit(x, " ")})))
square <- matrix(square, ncol=20)

# Define products
prod.vert <- square[1:17, ] * square[2:18, ] * square[3:19, ] * square[4:20, ]
prod.hori <- square[,1:17] * square[,2:18] * square[,3:19] * square[,4:20]
prod.dia1 <- square[1:17, 1:17] * square[2:18, 2:18] * square[3:19, 3:19] * square[4:20, 4:20]
prod.dia2 <- square[4:20, 1:17] * square[3:19, 2:18] * square[2:18, 3:19] * square[1:17, 4:20]

answer <- max(prod.vert, prod.hori, prod.dia1, prod.dia2)

 

 

 

Lifting the Big Data Veil: Data Science Strategy for Water Utilities

Data Science Strategy for water utilities: The Data Science Venn Diagram (Conway, 2010).

The Data Science Venn Diagram (Conway, 2010).

In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data.

Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics.

The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality.

Good data science needs to comply with these four basic principles:

  • Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality.
  • Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions.
  • Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions.
  • Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics.

I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic.

Data Science Strategy for Water Utilities

Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies.

You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall.

Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue.

Euler Problem 10: Summation of Primes

Euler Problem 10 asks for the summation of primes. Computationally this is a simple problem because we can re-use the prime sieve developed for problem 3.

When generating a large number of primes the erratic pattern at which they occur is much more interesting than their sum. Mathematicians consider primes the basic building blocks of number theory. No matter how hard we look, however, they do not seem to obey any logical sequence. This Euler Problem is simply

Euler Problem 10 Definition

The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17. Find the sum of all the primes below two million.

Solution

The sieve of Eratosthenes function used in Euler Problem 3 can be reused once again to generate the prime numbers between two and two million. An interesting problem occurs when I run the code. When I sum all the primes without the as.numeric conversion, R throws an integer overflow error and recommends the conversion.

primes <- esieve(2e6)
answer <- (sum(as.numeric(primes)))
print(answer)

This results in a vector containing the first 148,933 prime numbers. The largest prime gap is 132 and it seems that sexy primes are more common than any of the other twin primes (note the spikes at intervals of 6 in the bar chart).

Euler Problem 10: Prime gaps for all primes up to two million

The summing of primes reveals an interesting problem in mathematics. Goldbach’s conjecture is one of the oldest and best-known unsolved problems in number theory and states that:

Every even integer greater than 2 can be expressed as the sum of two primes.

Note that this conjecture is only about even numbers. Goldbach also theorised that every odd composite number can be written as the sum of a prime and twice a square. This conjecture is the subject of the Euler Problem 46, which I will work on soon.