Euler Problem 17: Number Letter Counts

Euler Problem 17: written numbersEuler Problem 17 asks to count the letters in numbers written as words. This is a skill we all learnt in primary school mainly useful when writing cheques—to those that still use them.

Each language has its own rules for writing numbers. My native language Dutch has very different logic to English. Both Dutch and English use compound words after the number twelve.

Linguists have theorised this is evidence that early Germanic numbers were duodecimal. This factoid is supported by the importance of a “dozen” as a counting word and the twelve hours in the clock. There is even a Dozenal Society that promotes the use of a number system based on 12.

The English language changes the rules when reaching the number 21. While we say eight-teen in English, we do no say “one-twenty”. Dutch stays consistent and the last number is always spoken first. For example, 37 in English is “thirty-seven”, while in Dutch it is written as “zevenendertig” (seven and thirty).

Euler Problem 17 Definition

If the numbers 1 to 5 are written out in words: one, two, three, four, five, then there are 3 + 3 + 5 + 4 + 4 = 19 letters used in total. If all the numbers from 1 to 1000 (one thousand) inclusive were written out in words, how many letters would be used?

NOTE: Do not count spaces or hyphens. For example, 342 (three hundred and forty-two) contains 23 letters and 115 (one hundred and fifteen) contains 20 letters. The use of “and” when writing out numbers is in compliance with British usage.

Solution

The first piece of code provides a function that generates the words for numbers 1 to 999,999. This is more than the problem asks for, but it might be a useful function for another application. The last line concatenates all words together and removes the spaces.

numword.en <- function(x) { if (x > 999999) return("Error: Oustide my vocabulary")
    # Vocabulary 
    single <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine")
    teens <- c( "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
    tens <- c("ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
    # Translation
    numword.10 <- function (y) {
        a <- y %% 100
        if (a != 0) {
            and <- ifelse(y > 100, "and", "")
            if (a < 20)
                return (c(and, c(single, teens)[a]))
            else
                return (c(and, tens[floor(a / 10)], single[a %% 10]))
        }
    }
    numword.100 <- function (y) {
        a <- (floor(y / 100) %% 100) %% 10
        if (a != 0)
            return (c(single[a], "hundred"))
    }
    numword.1000 <- function(y) {
        a <- (1000 * floor(y / 1000)) / 1000
        if (a != 0)
            return (c(numword.100(a), numword.10(a), "thousand"))
    }
    numword <- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse=" ")
    return (trimws(numword))
}

answer <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.en), collapse="")))
print(answer)

Writing Numbers in Dutch

I went beyond Euler Problem 17 by translating the code to spell numbers in Dutch. Interesting bit of trivia is that it takes 307 fewer characters to spell the numbers 1 to 1000 in Dutch than it does in English.

It would be good if other people can submit functions for other languages in the comment section. Perhaps we can create an R package with a multi-lingual function for spelling numbers.

numword.nl <- function(x) {
    if (x > 999999) return("Error: Getal te hoog.")
    single <- c("een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "negen")
    teens <- c( "tien", "elf", "twaalf", "dertien", "veertien", "fifteen", "zestien", "zeventien", "achtien", "negentien")
    tens <- c("tien", "twintig", "dertig", "veertig", "vijftig", "zestig", "zeventig", "tachtig", "negengtig")
    numword.10 <- function(y) {
        a <- y %% 100
        if (a != 0) {
            if (a < 20)
                return (c(single, teens)[a])
            else
                return (c(single[a %% 10], "en", tens[floor(a / 10)]))
        }
    }
    numword.100 <- function(y) {
        a <- (floor(y / 100) %% 100) %% 10
        if (a == 1)
            return ("honderd")
        if (a > 1) 
            return (c(single[a], "honderd"))
    }
    numword.1000 <- function(y) {
        a <- (1000 * floor(y / 1000)) / 1000
        if (a == 1)
            return ("duizend ")
        if (a > 0)
            return (c(numword.100(a), numword.10(a), "duizend "))
    }
    numword<- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse="")
    return (trimws(numword))
}

antwoord <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.nl), collapse="")))
print(antwoord)

print(answer - antwoord)

Euler Problem 16: Power Digit Sum

Euler Problem 16: Power Digit SumEuler Problem 16 is reminiscent of the famous fable of wheat and chess. Lahur Sessa invented the game of chess for King Iadava. The king was very pleased with the game and asked Lahur to name his reward.

Lahur asked the king to place one grain of rice on the first square of a chessboard, two on the next square, four on the third square and so on until the board is filled. The king was happy with his humble request until his mathematicians worked out that it would take millions of tonnes of grain. Assuming there are 25 grains of wheat in a gramme, the last field will contain more than 461,168,602,000 tonnes of grain.

Euler Problem 16 Definition

2^{15} = 32768 and the sum of its digits is 3 + 2 + 7 + 6 + 8 = 26 . What is the sum of the digits of the number 2^{1000} ?

Solution

The most straightforward solution uses the GMP package for Multiple Precision Arithmetic to calculate big integers. The as.bigz function results in a special class of arbitrarily large integer numbers

# Raise 2 to the power 1000
library(gmp)
digits <- as.bigz(2^1000) # Define number
# Sum all digits
answer <- sum(as.numeric(unlist(strsplit(as.character(digits), ""))))
print(answer)

We can also solve this problem in base-r with the bigg.add function which I developed for Euler Problem 13. This function uses basic string operations to add to arbitrarily large numbers. Raising a number to the power of two can also be written as a series of additions:

2^4 = 2 \times 2 \times 2 \times 2 = ((2+2)+(2+2)) + ((2+2)+(2+2))

The solution to this problem is to add 2 + 2 then add the outcome of that equation to itself, and so on. Repeat this one thousand times to raise the number two to the power of one thousand.

# Raise 2 to the power 1000
pow <- 2
for (i in 2:1000)
    pow <- big.add(pow, pow)
# Sum all digits
answer <- sum(as.numeric(unlist(strsplit(pow, ""))))
print(answer)

Tic Tac Toe Simulation — Random Moves

Tic Tac Toe Simulation - Part 1Tic Tac Toe might be a futile children’s game but it can also teach us about artificial intelligence. Tic Tac Toe, or Naughts and Crosses, is a zero-sum game with perfect information. Both players know exactly what the other did and when nobody makes a mistake, the game will always end in a draw.

Tic Tac Toe is a simple game but also the much more complex game of chess is a zero-sum game with perfect information.

In this two-part post, I will build an unbeatable Tic Tac Toe Simulation. This first part deals with the mechanics of the game. The second post will present an algorithm for a perfect game.

Drawing the Board

This first code snippet draws the Tic Tac Toe simulation board. The variable xo holds the identity of the pieces and the vector board holds the current game. Player X is denoted with -1 and player O with +1. The first part of the function draws the board and the naughts and crosses. The second part of the code check for three in a row and draws the corresponding line.

draw.board <- function(board) { # Draw the board
    xo <- c("X", " ", "O") # Symbols
    par(mar = rep(0,4))
    plot.new()
    plot.window(xlim = c(0,30), ylim = c(0,30))
    abline(h = c(10, 20), col="darkgrey", lwd = 4)
    abline(v = c(10, 20), col="darkgrey", lwd = 4)
    pieces <- xo[board + 2]
    text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), pieces, cex = 6)
    # Identify location of any three in a row
    square <- t(matrix(board, nrow = 3))
    hor <- abs(rowSums(square))
    if (any(hor == 3)) 
      hor <- (4 - which(hor == 3)) * 10 - 5 
    else 
      hor <- 0
    ver <- abs(colSums(square))
    if (any(ver == 3)) 
      ver <- which(ver == 3) * 10 - 5 
    else
      ver <- 0
    diag1 <- sum(diag(square))
    diag2 <- sum(diag(t(apply(square, 2, rev)))) 
    # Draw winning lines 
    if (hor > 0) lines(c(0, 30), rep(hor, 2), lwd=10, col="red")
    if (ver > 0) lines(rep(ver, 2), c(0, 30), lwd=10, col="red")
    if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd=10, col="red")
    if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd=10, col="red")
}

Random Tic Tac Toe

The second part of the code generates ten random games and creates and animated GIF-file. The code adds random moves until one of the players wins (winner <> 0) or the board is full (no zeroes in the game vector). The eval.winner function checks for three in a row and declares a winner when found.

There are 255,168 possible legal games in Tic Tac Toe, 46,080 of which end in a draw. This implies that these randomised games result in a draw 18% of the time.

eval.winner <- function(board) { # Identify winner
    square <- t(matrix(board, nrow = 3))
    hor <- rowSums(square)
    ver <- colSums(square)
    diag1 <- sum(diag(square))
    diag2 <- sum(diag(t(apply(square, 2, rev))))
    if (3 %in% c(hor, ver, diag1, diag2)) return (1)
    else
        if (-3 %in% c(hor, ver, diag1, diag2)) return (2)
    else
        return(0)
}

# Random game
library(animation)
saveGIF ({
 for (i in 1:10) {
 game <- rep(0, 9) # Empty board
 winner <- 0 # Define winner
 player <- -1 # First player
 draw.board(game)
 while (0 %in% game & winner == 0) { # Keep playing until win or full board
   empty <- which(game == 0) # Define empty squares
   move <- empty[sample(length(empty), 1)] # Random move
   game[move] <- player # Change board
   draw.board(game)
   winner <- eval.winner(game) # Evaulate game
   player <- player * -1 # Change player
 }
 draw.board(game)
 }
 },
 interval = 0.25, movie.name = "ttt.gif", ani.width = 600, ani.height = 600)

Tic Tac Toe Simulation

In a future post, I will outline how to program the computer to play against itself, just like in the 1983 movie War Games.

Euler Problem 15: Pathways Through a Lattice

Euler Problem 15 analyses taxicab geometry. This system replaces the usual distance function with the sum of the absolute differences of their Cartesian coordinates. In other words, the distance a taxi would travel in a grid plan. The fifteenth Euler problem asks to determine the number of possible routes a taxi can take in a city of a certain size.

Euler Problem 15 Definition

Starting in the top left corner of a 2×2 grid, and only being able to move to the right and down, there are exactly 6 routes to the bottom right corner. How many possible routes are there through a 20×20 grid?

Euler Problem 15: Lattice Paths

Solution

The defined lattice is one larger than the number of squares. Along the edges of the matrix, only one pathway is possible: straight to the right or down. We can calculate the number of possible pathways for the remaining number by adding the number to the right and below the point.

p_{i,j}=p_{i,j{+1}}+p_{{i+1},j}

For the two by two lattice the solution space is:

6  3  1
3  2  1
1  1  0

The total number of pathways from the upper left corner to the lower right corner is thus 6. This logic can now be applied to a grid of any arbitrary size using the following code.

The code defines the lattice and initiates the boundary conditions. The bottom row and the right column are filled with 1 as there is only one solution from these points. The code then calculates the pathways by working backwards through the matrix. The final solution is the number is the first cell.

# Define lattice
nLattice <- 20
lattice = matrix(ncol=nLattice + 1, nrow=nLattice + 1)

# Boundary conditions
lattice[nLattice + 1,-(nLattice + 1)] <- 1
lattice[-(nLattice + 1), nLattice + 1] <- 1

# Calculate Pathways
for (i in nLattice:1) {
    for (j in nLattice:1) {
        lattice[i,j] <- lattice[i+1, j] + lattice[i, j+1]
    }
}

answer <- lattice[1,1]
print(answer)

Taxicab Geometry

Create Air Travel Route Maps in ggplot: A Visual Travel Diary

Create Air Travel Route Maps: Emirates Route MapI have been lucky to fly to a few countries around the world. Like any other bored traveller, I thumb through the airline magazines and look at the air travel route maps. These maps are beautifully stylised depictions of the world with gently curved lines between the many destinations. I always wanted such a map for my own travel adventures.

Create Air Travel Route Maps using ggplot2

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

# Read flight list
flights <- read.csv("flights.csv", stringsAsFactors = FALSE)

# Lookup coordinates
library(ggmap)
airports <- unique(c(flights$From, flights$To))
coords <- geocode(airports)
airports <- data.frame(airport=airports, coords)

We now we have a data frame of airports with their coordinates and can create air travel route maps. The data frames are merged so that we can create air travel route maps using the curve geom. The borders function of ggplot2 creates the map data. The ggrepel package helps to prevent overplotting of text.

# Add coordinates to flight list
flights <- merge(flights, airports, by.x="To", by.y="airport")
flights <- merge(flights, airports, by.x="From", by.y="airport")

# Plot flight routes
library(ggplot2)
library(ggrepel)
worldmap <- borders("world", colour="#efede1", fill="#efede1") # create a layer of borders
ggplot() + worldmap + 
 geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, yend = lat.y), col = "#b29e7d", size = 1, curvature = .2) + 
 geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + 
 geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), col = "black", size = 2, segment.color = NA) + 
 theme(panel.background = element_rect(fill="white"), 
 axis.line = element_blank(),
 axis.text.x = element_blank(),
 axis.text.y = element_blank(),
 axis.ticks = element_blank(),
 axis.title.x = element_blank(),
 axis.title.y = element_blank()
 )

I also tried to use ggmap package to display the maps to get a satellite image background. This did not work because the curve geom struggles with the map projection methods used in ggmap. Another problem is that the flight from Auckland to Los Angeles is drawn the wrong way. I hope no flat-earthers will see this map because they might use it as prove that the world is flat.

Alternative Visualisation

Another way of visualising this type of data is using a network diagram provided by the igraph package. This visualisation shows the logic between the locations and not their spatial locations.

# Network visualisation
library(igraph)
edgelist <- as.matrix(flights[c("From", "To")])
g <- graph_from_edgelist(edgelist, directed = TRUE)
g <- simplify(g)
par(mar=rep(0,4))
plot.igraph(g, 
 edge.arrow.size=0,
 edge.color="black",
 edge.curved=TRUE,
 edge.width=2,
 vertex.size=3,
 vertex.color=NA, 
 vertex.frame.color=NA, 
 vertex.label=V(g)$name,
 vertex.label.cex=3,
 layout=layout.fruchterman.reingold
)

Euler Problem 14: Longest Collatz Sequence

Euler Problem 14 looks at the Collatz Conjecture. These playful sequences, named after German mathematician Lothar Collatz (1910–1990), cause mathematicians a lot of headaches. This video introduces the problem much better than I can describe it.

Euler Problem 14 Definition

The following iterative sequence is defined for the set of positive integers:

n \rightarrow n/2 ( n is even)
n \rightarrow 3n + 1 ( n is odd)

Using the rule above and starting with 13, we generate the following sequence:

13 \rightarrow 40 \rightarrow 20 \rightarrow 10 \rightarrow 5 \rightarrow 16 \rightarrow 8 \rightarrow 4 \rightarrow 2 \rightarrow 1

This sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1. Which starting number, under one million, produces the longest chain? Note: Once the chain starts the terms are allowed to go above one million.

Solution

This problem is highly computationally intensive and it highlights R’s lack of speed. Generating one million Collatz sequences and finding the longest one requires a lot more than a minute of processing time allowed for in Project Euler.

collatz.chain <- function(n) {
    chain <- vector()
    i <- 1
    while (n! = 1) {
        if (n%%2 == 0) 
            n <- n / 2
        else
            n <- 3 * n + 1
        chain[i] <- n
        i <- i + 1
    }
    return(chain)
}
answer <- 0
collatz.max <- 0
for (n in 1:1E6) {
    collatz.length <- length(collatz.chain(n)) 
    if (collatz.length > collatz.max) {
        answer <- n
        collatz.max <- collatz.length
        }
}
print(answer)

The second version of the code contains some optimisations. The code stores the length of all sequences in an array. When the code generates a sequence and lands on a number already analysed, then it adds that previous number to the current one and moves on. This approach requires more memory but saves a lot of computation time. A minor tweak to the code optimises the rule for uneven numbers. Tripling an uneven number and adding one will always result in an even number so we can skip one step. This solution is more than twice as fast as the first version.

collatz.length <- vector(length=1e6)
collatz.length[1] <- 0
for (n in 2:1e6) {
    x <- n
    count <- 0 while (x != 1 & x >= n) {
        if (x %% 2 == 0) {
            x <- x / 2
            count <- count + 1
        }
        else {
            x <- (3 * x + 1) / 2
            count <- count + 2
        } 
    }
    count <- count + collatz.length[x]
    collatz.length[n] <- count
}
answer <- which.max(collatz.length)
print(answer)

Visualising Collatz Sequences

The Collatz sequence is an example of a simple mathematical rule that can create an unpredictable pattern. The number of steps required to reach 1 is listed in A006577 of the Online Encyclopedia of Integer Sequences.

The image below visualises the number of steps for the first 1000 positive numbers. The scatterplot shows some interesting patterns. Does this visualisation show that the Collatz Sequence does have a pattern after all?

Euler Problem 14: Number of halving and tripling steps to reach 1 in the Collatz problem.

Collatz Chains

The Collatz sequences can also be visualised using networks. Each step between two numbers is an edge and the numbers are the vertices. For example, the network for the Collatz sequence for number 10 is 5–16, 16–8, 8–4, 4–2, 2–1. When generating subsequent sequences the network will start to overlap and a tree of sequences appears. The tree below combines the Collatz sequences for the numbers 2 to 26. Number 27 has a very long sequence, making the tree much harder to read.

Network of Collatz sequences n=2-26

Network of Collatz sequences n=2-26

edgelist <- data.frame(a = 2, b = 1)
for (n in 3:26) {
   chain <- as.character(c(n, collatz.chain(n)))
   chain <- data.frame(a = chain[-length(chain)], b = chain[-1])
   edgelist <- rbind(edgelist, chain)
}
library(igraph)
g <- graph.edgelist(as.matrix(edgelist))
g <- simplify(g)
par(mar=rep(0,4))
V(g)$color <- degree(g, mode = "out") + 1
plot(g, 
     layout=layout.kamada.kawai,
     vertex.color=V(g)$color, 
     vertex.size=6,
     vertex.label.cex=.7,
     vertex.label.color="black",
     edge.arrow.size=.1,
     edge.color="black"
     )

 

Euler Problem 13: Large Sum of 1000 Digits

Euler Problem 13 asks to add one hundred numbers with fifty digits. This seems like a simple problem where it not that most computers are not designed to deal with numbers with a lot of integers. For example:

 2^{64} = 18446744073709551616

When asking R to compute this value we get 1.844674e+19, losing most of the digits and limiting the accuracy of the results. Computers solve this problem using Arbitrary-precision Arithmetic. There are many software libraries that can process long integers without loosing accuracy. Euler Problem 13 requires this type of approach.

Euler Problem 13 Definition

Work out the first ten digits of the sum of the following one-hundred 50-digit numbers.

Solution

The easy way to solve this problem is to use the gmp package for working with very large integers. This package uses a special number types such as Big Rational and Big Integer. The number of digits in these number types is only limited by the size of the memory.

library(gmp)
numbers <- readLines("Euler/p013_numbers.txt")
digits <- sum(as.bigz(numbers))
answer <- substr(as.character(digits),1,10)

Using Base-R

To find the solution to this problem using only base R, I wrote a function to add numbers using strings instead of integers. The function adds leading zeros to the smallest number to make them both the same length. The function then proceeds to add numbers in the same way we were taught in primary school. This function can in principle be used for several other Euler Problems using large integers.

# Add numbers with many digits
big.add <- function(a, b) {
    # Add leading zeros to smallest numer
    if (nchar(a) < nchar(b)) 
        a <- paste0(paste(rep(0, nchar(b) - nchar(a)), collapse = ""), a) 
    if (nchar(a) > nchar(b)) 
        b <- paste0(paste(rep(0, nchar(a) - nchar(b)), collapse = ""), b)
    solution <- vector()
    remainder <- 0
    for (i in nchar(b):1) {
        p <- as.numeric(substr(a, i, i))
        q <- as.numeric(substr(b, i, i))
        r <- p + q + remainder 
        if (r >= 10 & i!=1) {
            solution <- c(solution, r %% 10)
            remainder <- (r - (r %% 10))/10
        } else {
            solution <- c(solution, r)
            remainder <- 0
        }
    }
    return(paste(rev(solution), collapse = ""))
}

With this function, the problem is easy to solve. The second part of the code runs this function over the one hundred numbers provided on the Euler Problem page and calculates the answer.

numbers <- readLines("Euler/p013_numbers.txt")
for (i in numbers) {
    answer <- big.add(answer, i)
}
answer <- substr(answer, 1, 10)

Multiplying Big Numbers

You can expand this function to multiply a very large number with a smaller number using the Reduce function. This function adds the number a to itself, using the big.add function. The outcome of the addition is used in the next iteration until it has been repeated b times. The number b in this function needs to be a ‘low’ number because it uses a vector of the length b.

big.mult <- function(a, b) {
    Reduce(big.add, rep(a, as.numeric(b)))
}

Euler Problem 12: Highly Divisible Triangular Number

Euler Problem 12: Divisors of triangular numbers.

The divisors of 10 illustrated with Cuisenaire rods: 1, 2, 5, and 10 (Wikipedia).

Euler Problem 12 takes us to the realm of triangular numbers and proper divisors.

The image on the left shows a hands-on method to visualise the number of divisors of an integer. Cuisenaire rods are learning aids that provide a hands-on way to explore mathematics.

Euler Problem 12 Definition

The sequence of triangle numbers is generated by adding the natural numbers. So the 7th triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28 . The first ten terms would be: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, \ldots Let us list the factors of the first seven triangle numbers:

1: 1

3: 1, 3

6: 1, 2, 3, 6

10: 1, 2, 5, 10

15: 1, 3, 5, 15

21: 1, 3, 7 ,21

28: 1, 2, 4, 7, 14, 28

We can see that 28 is the first triangle number to have over five divisors. What is the value of the first triangle number to have over five hundred divisors?

Solution

Vishal Kataria explains a simple method to determine the number of divisors using prime factorization as explained by in his video below. The prime factorization of n is given by:

n = p^{\alpha_1}_1 \times p^{\alpha_2}_2 \times p^{\alpha_k}_k

The number of proper divisors is:

d = (\alpha_1 + 1) (\alpha_2 + 1) \ldots (\alpha_k + 1)

The code reuses the prime factorisation function developed for Euler Problem 3. This function results in a vector of all prime factors, e.g. the prime factors of 28 are 2, 2 and 7.

The code to solve this problem determines the values for alpha using the run length function. This function counts the number of times each element in a sequence is repeated. The outcome of this function is a vector of the values and the number of times each is repeated. The prime factors of 28 are 2 and 7 and their run lengths are 2 and 1. The number of divisors can now be determined.

28 = 2^2 \times 7^1

d = (2+1)(1+1) = 6

The code to solve Euler Problem 12 is shown below. The loop continues until it finds a triangular number with 500 divisors. The first two lines increment the index and create the next triangular number. The third line in the loop determines the number of times each factor is repeated (the run lengths). The last line calculates the number of divisors using the above-mentioned formula.

i <- 0
divisors <- 0
while (divisors < 500) {
    i <- i + 1
    triangle <- (i * (i+1)) / 2
    pf <- prime.factors(triangle)
    alpha <- rle(pf)
    divisors <- prod(alpha$lengths+1)
}
answer <- triangle
print(answer)

Percentile Calculations in Water Quality Regulations

Percentile Calculations in Water Quality RegulationsPercentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use?

The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile.

set.seed(1969)
test.data <- rnorm(n = 10000, mean = 100, sd = 15)
library(ggplot2)
ggplot(as.data.frame(test.data), aes(test.data)) + 
 geom_histogram(binwidth = 1, aes(y = ..density..), fill = "dodgerblue") + 
 geom_line(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", size = 1) + 
 geom_area(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), 
 colour = "red", fill="red", alpha = 0.5, xlim = quantile(test.data, c(0.5, 0.75))) + 
 theme(text = element_text(size = 16))

The quantile default function and the 95th percentile give the following results:

> quantile(test.data)
       0%       25%       50%       75%      100% 
 39.91964  89.68041 100.16437 110.01910 153.50195 

> quantile(test.data, probs=0.95)
     95% 
124.7775 

Methods of percentile calculations

The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods.

The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code.

> sapply(1:9, function(m) quantile(test.data, 0.95, type = m))

     95%      95%      95%      95%      95%      95%      95%      95%      95% 
124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 

Percentile calculations in water quality

The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that:

“The 95th percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.”

The Victorian regulators also specify that the percentile should be calculated with the Weibull Method. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R?

The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name \hat{Q}_8(p). Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data?

In case you are interested in how the Weibull method works, the weibull.quantile function shown below calculates a quantile p for a vector x using this method. This function gives the same result as quantile(x, p, type=6).

weibull.quantile <- function(x, p) {
    # Order Samples from large to small
    x <- x[order(x, decreasing = FALSE)]
    # Determine ranking of percentile according to Weibull (1939)
    r <- p * (length(x) + 1)
    # Linear interpolation
    rfrac <- (r - floor(r))
    return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1])
}

Turbidity Data Example

Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits.

http://r.prevos.net/wp-content/uploads/sites/11/2017/02/turbidity.png

Turbidity at customer tap for each zone in the Tarnagulla system in 2016 (n=53).

When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result.

Zone R1 R2 R3 R4 R5 R6 R7 R8 R9
Bealiba 0.300 0.300 0.200 0.240 0.290 0.300 0.245 0.300 0.300
Dunolly 0.40000 0.40000 0.30000 0.34000 0.39000 0.43500 0.34500 0.40500 0.40125
Laanecoorie 0.50000 0.50000 0.40000 0.44000 0.49000 0.53500 0.44500 0.50500 0.50125
Tarnagulla 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4

The graph and the table were created with the following code snippet:

ggplot(turbidity, aes(Result)) + 
 geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) + 
 facet_wrap(~Zone) + 
 theme(text=element_text(size=16))

tapply(turbidity$Result, turbidity$Zone, 
 function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))

Euler Problem 11: Largest Product in a Grid

Euler Problem 11 Definition

In the 20×20 grid below, four numbers along a diagonal line have been marked in red.

Euler problem 11

The product of these numbers is 26 × 63 × 78 × 14 = 1,788,696. What is the greatest product of four adjacent numbers in the same direction (up, down, left, right, or diagonally) in the 20 by 20 grid?

Solution

The solution applies straightforward vector arithmetic. The product of all verticals is an array of the product of rows 1 to 4, rows 2 to 5 and so on. The code uses a similar logic for the horizontals and the diagonals.

#Read and convert data
square <- readLines("Euler/p011_matrix.txt")
square <- as.numeric(unlist(lapply(square, function(x){strsplit(x, " ")})))
square <- matrix(square, ncol=20)

# Define products
prod.vert <- square[1:17, ] * square[2:18, ] * square[3:19, ] * square[4:20, ]
prod.hori <- square[,1:17] * square[,2:18] * square[,3:19] * square[,4:20]
prod.dia1 <- square[1:17, 1:17] * square[2:18, 2:18] * square[3:19, 3:19] * square[4:20, 4:20]
prod.dia2 <- square[4:20, 1:17] * square[3:19, 2:18] * square[2:18, 3:19] * square[1:17, 4:20]

answer <- max(prod.vert, prod.hori, prod.dia1, prod.dia2)