Lifting the Big Data Veil: Data Science Strategy for Water Utilities

Data Science Strategy for water utilities: The Data Science Venn Diagram (Conway, 2010).

The Data Science Venn Diagram (Conway, 2010).

In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data.

Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics.

The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality.

Good data science needs to comply with these four basic principles:

  • Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality.
  • Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions.
  • Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions.
  • Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics.

I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic.

Data Science Strategy for Water Utilities

Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies.

You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall.

Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue.

Euler Problem 10: Summation of Primes

Euler Problem 10 asks for the summation of primes. Computationally this is a simple problem because we can re-use the prime sieve developed for problem 3.

When generating a large number of primes the erratic pattern at which they occur is much more interesting than their sum. Mathematicians consider primes the basic building blocks of number theory. No matter how hard we look, however, they do not seem to obey any logical sequence. This Euler Problem is simply

Euler Problem 10 Definition

The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17. Find the sum of all the primes below two million.


The sieve of Eratosthenes function used in Euler Problem 3 can be reused once again to generate the prime numbers between two and two million. An interesting problem occurs when I run the code. When I sum all the primes without the as.numeric conversion, R throws an integer overflow error and recommends the conversion.

primes <- esieve(2e6)
answer <- (sum(as.numeric(primes)))

This results in a vector containing the first 148,933 prime numbers. The largest prime gap is 132 and it seems that sexy primes are more common than any of the other twin primes (note the spikes at intervals of 6 in the bar chart).

Euler Problem 10: Prime gaps for all primes up to two million

The summing of primes reveals an interesting problem in mathematics. Goldbach’s conjecture is one of the oldest and best-known unsolved problems in number theory and states that:

Every even integer greater than 2 can be expressed as the sum of two primes.

Note that this conjecture is only about even numbers. Goldbach also theorised that every odd composite number can be written as the sum of a prime and twice a square. This conjecture is the subject of the Euler Problem 46, which I will work on soon.

SCADA spikes in Water Treatment Data

Turbid waterSCADA spikes are events in the data stream of water treatment plants or similar installations. These SCADA spikes can indicate problems with the process and could result in an increased risk to public health.

The WSAA Health Based Targets Manual specifies a series of decision rules to assess the performance of filtration processes. For example, this rule assesses the performance of conventional filtration:

“Individual filter turbidity ≤ 0.2 NTU for 95% of month and not > 0.5 NTU for ≥ 15 consecutive minutes.”

Turbidity is a measure for the cloudiness of a fluid because of large numbers of individual particles otherwise invisible to the naked eye. Turbidity is an important parameter in water treatment because a high level of cloudiness strongly correlates with the presence of microbes. This article shows how to implement this specific decision rule using the R language.


To create a minimum working example, I first create a simulated SCADA feed for turbidity. The turbidity data frame contains 24 hours of data. The seq.POSIXt function creates 24 hours of timestamps at a one-minute spacing. In addition, the rnorm function creates 1440 turbidity readings with an average of 0.1 NTU and a standard deviation of 0.01 NTU. The image below visualises the simulated data. The next step is to assess this data in accordance with the decision rule.

# Simulate data
turbidity <- data.frame(DateTime = seq.POSIXt(as.POSIXct("2017-01-01 00:00:00"), by = "min", length.out=24*60),
                        Turbidity = rnorm(n = 24*60, mean = 0.1, sd = 0.01)

The second section simulates five spikes in the data. The first line picks a random start time for the spike. The second line in the for-loop picks a duration between 10 and 30 minutes. In addition, the third line simulates the value of the spike. The mean value of the spike is determined by the rbinom function to create either a low or a high spike. The remainder of the spike simulation inserts the new data into the turbidity data frame.

# Simulate spikes
for (i in 1:5) {
   time <- sample(turbidity$DateTime, 1)
   duration <- sample(10:30, 1)
   value <- rnorm(1, 0.5 * rbinom(1, 1, 0.5) + 0.3, 0.05)
   start <- which(turbidity$DateTime == time)
   turbidity$Turbidity[start:(start+duration - 1)] <- rnorm(duration, value, value/10)

The image below visualises the simulated data using the mighty ggplot. Only four spikes are visible because two of them overlap. The next step is to assess this data in accordance with the decision rule.

ggplot(turbidity, aes(x = DateTime, y = Turbidity)) + geom_line(size = 0.2) + 
   geom_hline(yintercept = 0.5, col = "red") + ylim(0,max(turbidity$Turbidity)) + 
   ggtitle("Simulated SCADA data")

Simulated SCADA data with spikes

SCADA Spikes Detection

The following code searches for all spikes over 0.50 NTU using the run length function. This function transforms a vector into a vector of values and lengths. For example, the run length of the vector c(1, 1, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6) is:

  • lengths: int [1:5] 2 3 4 2 1
  • values : num [1:5] 1 2 3 5 6

The value 1 has a length of 1, the value 2 has a length of 3 and so on. The spike detection code creates the run length for turbidity levels greater than 0.5, which results in a boolean vector. The cumsum function calculates the starting point of each spike which allows us to calculate their duration.

The code results in a data frame with all spikes higher than 0.50 NTU and longer than 15 minutes. The spike that occurred at 11:29 was higher than 0.50 NTU and lasted for 24 minutes. The other three spikes are either lower than 0.50 NTU. The first high spike lasted less than 15 minutes.

# Spike Detection
spike.detect <- function(DateTime, Value, Height, Duration) {
 runlength <- rle(Value > Height)
 spikes <- data.frame(Spike = runlength$values,
 times <- cumsum(runlength$lengths))
 spikes$Times <- DateTime[spikes$times]
 spikes$Event <- c(0,spikes$Times[-1] - spikes$Times[-nrow(spikes)])
 spikes <- subset(spikes, Spike == TRUE & Event > Duration)
spike.detect(turbidity$DateTime, turbidity$Turbidity, 0.5, 15)

This approach was used to prototype a software package to assess water treatment plant data in accordance with the Health-Based Targets Manual. The finished product has been written in SQL and is available under an Open Source sharing license.

Euler Problem 9 : Special Pythagorean Triple

Euler Problem 9 Definition

Euler Problem 9: Pythagorean Triples

Scatter plot of the legs (a,b) of the first Pythagorean triples with a and b less than 6000. Negative values are included to illustrate the parabolic patterns. By Dearjean13Own work, CC BY-SA 4.0, Link

A Pythagorean triple is a set of three natural numbers, a < b < c , for which, a^2 + b^2 = c^2 . For example:

3^2 + 4^2 = 9 + 16 = 25 = 5^2 .

There exists exactly one Pythagorean triplet for which a + b + c = 1000 .

Find the product of a, b and c.

Brute Force Solution

This solution uses brute force and checks all combinations of a, b and c. To limit the solution space I used the fact that a < b < c, which implies that a < s/3,  and a < b < s/2, where s is the sum of the three sides.

a <- 0
b <- 0
c <- 0
s <- 1000
found <- FALSE
for (a in 1:floor((s/3))) {
    for (b in a:(s/2)) {
        c <- s - a - b
        if (a^2 + b^2 == c^2) {
            found <- TRUE
    if (found) 
answer <- a * b * c


Trumpworld Analysis : Ownership Relations in his Business Network

Trumpworld by BuzzfeedYou do not need a machine learning algorithm to predict that the presidency of Donald Trump will be controversial.

One of the most discussed aspects of his reign is the massive potential for conflicts of interest. Trump’s complex business empire is entangled with national and international politics.

Buzzfeed has mapped many of the relationships between businesses and people in what they have dubbed Trumpworld. They provided the data to enable citizens data science into the wheelings and dealings of Donald J. Trump. The raw data set consists of three subsets of connections between:

  • Organisations
  • People
  • People and organisations

Trumpworld Analysis

This post analyses the connections between organisations using the mighty igraph package in the R language. The code snippet below converts the data to a graph that can be analysed using social network analysis techniques. I have downloaded the table of the raw data file as CSV files. This data is subsetted to contain only ownership relationships.

# Read data <- read.csv("TrumpWorld DataOrg.csv") <- subset(, Connection=="Ownership")[,1:2]

# Create graph of ownerships
org.ownership <- graph.edgelist(as.matrix(

# Analysis

# Plot Graph

Trumpworld analysis - business ownership networkNetwork Analysis

This network contains 309 ownership relationships between 322 firms.

When we plot the data, we see that most relationships are between two firms. The plot is organised with the Fruchterman-Reingold algorithm to improve its clarity.

We can also see a large cluster in the centre. The names have been removed for clarity.

The Trumpland analysis continues with this conglomerate. The second code section excises this connected subnetwork so we can analyse it in more detail.

# Find most connected firm
# Create subnetworks
org.ownership.d <- decompose(org.ownership)
# Find largest subnetwork
largest <- which.max(sapply(org.ownership.d, diameter))
#Plot largest subnetwork

Digging Deeper

The node with the highest degree identifies the business with the most holdings. This analysis shows that DJT Holdings LLC owns 33 other organisations. These organisations own other organisations. We can now use the cluster function to investigate this subnetwork.

Trumpworld holdings

This Trumpworld analysis shows that the ownership network is clearly a star network. DJT Holdings LLC centrally controls all organisations. Perhaps this graph visualises the management style of the soon to be president Trump. Trump centrally controls his empire, which is typical for a family business.

Does this chart visualise Trump’s leadership style? Is the star network an expression of his lack of trust and thus desire to oversee everything directly?

Finding antipodes using the globe and ggmap packages

The antipodes of each point on the Earth's surface

The antipode of each point on the Earth’s surface—the points where the blue and yellow overlap, are the land antipodes.

When I was a kid, I was was fascinated by the conundrum of what happens when you drill a hole straight through the centre of the earth. I always believed that I would turn up in Australia. But is this really the case?

The antipodes of any place on earth is the place that is diametrically opposite to it. A pair of antipodes are connected by a straight line running through the centre of the Earth. These points are as far away from each other as is possible on this planet. Two people are antipodes when they live on opposite sides of the globe. Their feet (πούς/pous in Ancient Greek) are directly opposite each other.

How can we use coding in R to solve this conundrum?

Using R to find antipodes

We can roughly recreate the antipodean map on Wikipedia with the globe package. This package, written by Adrian Baddeley, plots 2D and 3D views of the earth. The package contains a data file with major coastlines that can be used to create a flipped map of the world.

The package contains a data file with major coastlines that can be used to create a flipped map of the world. To turn a spatial location into its antipode you subtract 180 degrees from the longitude and reverse the sign of the latitude, shown below.

# Create antipodean map
flip <- earth
flip$coords[,1] <- flip$coords[,1]-180
flip$coords[,2] <- -flip$coords[,2]
par(mar=rep(0,4)) # Remove plotting margins
globeearth(eye=c(90,0), col="blue")
globepoints(loc=flip$coords, eye=c(90,0), col="red", pch=".")

We can also use the ggmap package to visualise antipodes. This package, developed by David Kahle antipodean R-guru Hadley Wickham, has a neat geocoding function to obtain a spatial location. The antipode function takes the description of a location and a zoom level to plot a dot on the antipode location. The gridExtra package is used to created a faceted map, which is not otherwise possible in ggmap.


antipode <- function(location, zm=6) {
    # Map location
    lonlat <- geocode(location)
    loc1 <- get_map(lonlat, zoom=zm)
    map1 <- ggmap(loc1) + geom_point(data=lonlat, aes(lon, lat, col="red", size=10)) + 
    # Define antipode
    lonlat$lon <- lonlat$lon-180
    if (lonlat$lon < -180) lonlat$lon <- 360 + lonlat$lon
    lonlat$lat <- -lonlat$lat
    loc2 <- get_map(lonlat, zoom=zm)
    map2 <- ggmap(loc2) + geom_point(data=lonlat, aes(lon, lat, col="red", size=10)) + 
    grid.arrange(map1, map2, nrow=1)

antipode("Rector Nelissenstraat 47 Hoensbroek", 4)

This code solves the problem I was thinking about as a child. Running the code shows that the antipodes location of the home I grew up in is not in Australia, but quite a way south of New Zealand. Another childhood fantasy shattered …

Antipodes using ggmap and gridExtra.

Euler Problem 8: Largest Product in a Series

Euler Problem 8 is a combination of mathematics and text analysis. The problem is defined as follows:

Euler Problem 8 Definition

The four adjacent digits in the 1,000-digit number below that have the greatest product are 9 \times 9 \times 8 \times 9 = 5832.


Find the thirteen adjacent digits in the 1,000-digit number that have the greatest product. What is the value of this product?


The first step is to define the digits as a character string. The answer is found by cycling through all 13-character n-grams to find the highest product using the prod function.

# Define digits
digits <- "7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450"

ngram <- 13  # Define length
answer <- 0
# Clycle through digits
for (i in 1:(nchar(digits)-ngram+1)) {
    # Pick 13 consecutive digits
    adjecent <- substr(digits, i, i + ngram - 1)
    # Define product
    mult <- prod(as.numeric(unlist(strsplit(adjecent, "")))) # Largest? if (mult > answer) 
        answer <- mult

Finding n-grams is a basic function of text analysis called tokenization. The Google Ngram Viewer provides the ability to search for word strings (n-grams) in the massive Google books collection. This type of data can be used to define writing styles and analyse the evolution of language.

Euler Problem 7: 10,001st Prime

Euler Problem 7 Definition

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13. What is the 1,0001st prime number?


The function determines whether a number is a prime number by checking that it is not divisible by any prime number up to the square root of the number.

The Sieve of used in Euler Problem 3 can be reused to generate prime numbers.

This problem can only be solved using brute force because prime gaps (sequence A001223 in the OEIS) do not follow a predictable pattern. <- function(n) {
    primes <- esieve(ceiling(sqrt(n)))

i <- 2 # First Prime
n <- 1 # Start counter
while (n<10001) { # Find 10001 prime numbers
    i <- i + 1 # Next number
    if( { # Test next number
        n <- n + 1 # Increment counter
        i <- i + 1 # Next prime is at least two away

answer <- i-1

The largest prime gap for the first 10,001 primes is 72. Sexy primes with a gap of 6 are the most common and there are 1270 twin primes.

Euler Problem 7: Prime gap frequency distribution for the first 10001 primes.

Prime gap frequency distribution for the first 10001 primes.

Euler Problem 6: Sum Square Difference

Euler Problem 6 Definition

The sum of the squares of the first ten natural numbers is:

1^2 + 2^2 + \ldots + 10^2 = 385

The square of the sum of the first ten natural numbers is:

(1 + 2 + \ldots + 10)^2 = 552 = 3025

The difference between the sum of the squares of the first ten natural numbers and the square of the sum is 3025 - 385 = 2640. Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.


This is a straightforward problem for vector processing capabilities in R.

answer <- sum(1:100)^2-sum((1:100)^2)

This problem can also be solved arithmetically. When Carl Friedrich Gauss (1777-1855) when he was a child his teacher challenged his students to add all numbers from 1 to 100. All of his friends struggled to add all the 100 numbers one by one but Carl completed the task in a few seconds.

The same principle applies to computing. The first solution is like Carl’s classmates who slavishly add all numbers. This solution is based on arithmetic progressions.

The sum of natural numbers can be expressed as:

\frac{n(n + 1)}{2}
The sum of the squares of the first n natural numbers is:

With these two formulas, a fast solution without having to use loops can be written.

n <- 100
answer <- ((n*(n+1))/2)^2 - (n*(n+1)*(2*n+1))/6

Euler Problem 5: Smallest Multiple

Euler Problem 5 relates to the divisibility of numbers.

Euler Problem 5

2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder. What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?


The solution will also be divisible by the number 1 to 10 so we can start at 2520 and increment by 2520. The loop checks whether the number is divisible by the numbers 1 to 20.

# Start as high as possible
i <- 2520
# Check consecutive numbers for divisibility by 1:20
while (sum(i%%(1:20)) != 0) {
    i <- i + 2520 # Increase by smallest number divisible by 1:10
answer <- i