# Lifting the Big Data Veil: Data Science Strategy for Water Utilities

The Data Science Venn Diagram (Conway, 2010).

In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data.

Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics.

The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality.

Good data science needs to comply with these four basic principles:

• Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality.
• Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions.
• Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions.
• Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics.

I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic.

## Data Science Strategy for Water Utilities

Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies.

You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall.

Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue.

# Euler Problem 10: Summation of Primes

Euler Problem 10 asks for the summation of primes. Computationally this is a simple problem because we can re-use the prime sieve developed for problem 3.

When generating a large number of primes the erratic pattern at which they occur is much more interesting than their sum. Mathematicians consider primes the basic building blocks of number theory. No matter how hard we look, however, they do not seem to obey any logical sequence. This Euler Problem is simply

## Euler Problem 10 Definition

The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17. Find the sum of all the primes below two million.

## Solution

The sieve of Eratosthenes function used in Euler Problem 3 can be reused once again to generate the prime numbers between two and two million. An interesting problem occurs when I run the code. When I sum all the primes without the as.numeric conversion, R throws an integer overflow error and recommends the conversion.

primes <- esieve(2e6)


This results in a vector containing the first 148,933 prime numbers. The largest prime gap is 132 and it seems that sexy primes are more common than any of the other twin primes (note the spikes at intervals of 6 in the bar chart).

The summing of primes reveals an interesting problem in mathematics. Goldbach’s conjecture is one of the oldest and best-known unsolved problems in number theory and states that:

Every even integer greater than 2 can be expressed as the sum of two primes.

Note that this conjecture is only about even numbers. Goldbach also theorised that every odd composite number can be written as the sum of a prime and twice a square. This conjecture is the subject of the Euler Problem 46, which I will work on soon.

# SCADA spikes in Water Treatment Data

SCADA spikes are events in the data stream of water treatment plants or similar installations. These SCADA spikes can indicate problems with the process and could result in an increased risk to public health.

The WSAA Health Based Targets Manual specifies a series of decision rules to assess the performance of filtration processes. For example, this rule assesses the performance of conventional filtration:

“Individual filter turbidity ≤ 0.2 NTU for 95% of month and not > 0.5 NTU for ≥ 15 consecutive minutes.”

Turbidity is a measure for the cloudiness of a fluid because of large numbers of individual particles otherwise invisible to the naked eye. Turbidity is an important parameter in water treatment because a high level of cloudiness strongly correlates with the presence of microbes. This article shows how to implement this specific decision rule using the R language.

## Simulation

To create a minimum working example, I first create a simulated SCADA feed for turbidity. The turbidity data frame contains 24 hours of data. The seq.POSIXt function creates 24 hours of timestamps at a one-minute spacing. In addition, the rnorm function creates 1440 turbidity readings with an average of 0.1 NTU and a standard deviation of 0.01 NTU. The image below visualises the simulated data. The next step is to assess this data in accordance with the decision rule.

# Simulate data
set.seed(1234)
turbidity <- data.frame(DateTime = seq.POSIXt(as.POSIXct("2017-01-01 00:00:00"), by = "min", length.out=24*60),
Turbidity = rnorm(n = 24*60, mean = 0.1, sd = 0.01)
)


The second section simulates five spikes in the data. The first line picks a random start time for the spike. The second line in the for-loop picks a duration between 10 and 30 minutes. In addition, the third line simulates the value of the spike. The mean value of the spike is determined by the rbinom function to create either a low or a high spike. The remainder of the spike simulation inserts the new data into the turbidity data frame.

# Simulate spikes
for (i in 1:5) {
time <- sample(turbidity$DateTime, 1) duration <- sample(10:30, 1) value <- rnorm(1, 0.5 * rbinom(1, 1, 0.5) + 0.3, 0.05) start <- which(turbidity$DateTime == time)
turbidity$Turbidity[start:(start+duration - 1)] <- rnorm(duration, value, value/10) }  The image below visualises the simulated data using the mighty ggplot. Only four spikes are visible because two of them overlap. The next step is to assess this data in accordance with the decision rule. library(ggplot2) ggplot(turbidity, aes(x = DateTime, y = Turbidity)) + geom_line(size = 0.2) + geom_hline(yintercept = 0.5, col = "red") + ylim(0,max(turbidity$Turbidity)) +


The following code searches for all spikes over 0.50 NTU using the run length function. This function transforms a vector into a vector of values and lengths. For example, the run length of the vector c(1, 1, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6) is:

• lengths: int [1:5] 2 3 4 2 1
• values : num [1:5] 1 2 3 5 6

The value 1 has a length of 1, the value 2 has a length of 3 and so on. The spike detection code creates the run length for turbidity levels greater than 0.5, which results in a boolean vector. The cumsum function calculates the starting point of each spike which allows us to calculate their duration.

The code results in a data frame with all spikes higher than 0.50 NTU and longer than 15 minutes. The spike that occurred at 11:29 was higher than 0.50 NTU and lasted for 24 minutes. The other three spikes are either lower than 0.50 NTU. The first high spike lasted less than 15 minutes.

# Spike Detection
spike.detect <- function(DateTime, Value, Height, Duration) {
runlength <- rle(Value > Height)
spikes <- data.frame(Spike = runlength$values, times <- cumsum(runlength$lengths))
spikes$Times <- DateTime[spikes$times]
spikes$Event <- c(0,spikes$Times[-1] - spikes$Times[-nrow(spikes)]) spikes <- subset(spikes, Spike == TRUE & Event > Duration) return(spikes) } spike.detect(turbidity$DateTime, turbidity$Turbidity, 0.5, 15)  This approach was used to prototype a software package to assess water treatment plant data in accordance with the Health-Based Targets Manual. The finished product has been written in SQL and is available under an Open Source sharing license. # Euler Problem 9 : Special Pythagorean Triple ## Euler Problem 9 Definition Scatter plot of the legs (a,b) of the first Pythagorean triples with a and b less than 6000. Negative values are included to illustrate the parabolic patterns. By Dearjean13Own work, CC BY-SA 4.0, Link A Pythagorean triple is a set of three natural numbers, $a < b < c$, for which, $a^2 + b^2 = c^2$. For example: $3^2 + 4^2 = 9 + 16 = 25 = 5^2$. There exists exactly one Pythagorean triplet for which $a + b + c = 1000$. Find the product of a, b and c. ## Brute Force Solution This solution uses brute force and checks all combinations of a, b and c. To limit the solution space I used the fact that a < b < c, which implies that a < s/3, and a < b < s/2, where s is the sum of the three sides. a <- 0 b <- 0 c <- 0 s <- 1000 found <- FALSE for (a in 1:floor((s/3))) { for (b in a:(s/2)) { c <- s - a - b if (a^2 + b^2 == c^2) { found <- TRUE break } } if (found) break } answer <- a * b * c  # Trumpworld Analysis : Ownership Relations in his Business Network You do not need a machine learning algorithm to predict that the presidency of Donald Trump will be controversial. One of the most discussed aspects of his reign is the massive potential for conflicts of interest. Trump’s complex business empire is entangled with national and international politics. Buzzfeed has mapped many of the relationships between businesses and people in what they have dubbed Trumpworld. They provided the data to enable citizens data science into the wheelings and dealings of Donald J. Trump. The raw data set consists of three subsets of connections between: • Organisations • People • People and organisations ## Trumpworld Analysis This post analyses the connections between organisations using the mighty igraph package in the R language. The code snippet below converts the data to a graph that can be analysed using social network analysis techniques. I have downloaded the table of the raw data file as CSV files. This data is subsetted to contain only ownership relationships. # Read data trumpworld.org <- read.csv("TrumpWorld DataOrg.csv") trumpworld.org.ownership <- subset(trumpworld.org, Connection=="Ownership")[,1:2] # Create graph of ownerships library(igraph) org.ownership <- graph.edgelist(as.matrix(trumpworld.org.ownership)) # Analysis nrow(trumpworld.org.ownership) length(unique(c(trumpworld.org.ownership[,1], trumpworld.org.ownership[,2]))) # Plot Graph par(mar=rep(0,4)) plot(org.ownership, layout=layout.fruchterman.reingold, vertex.label=NA, vertex.size=2, edge.arrow.size=.1 )  ## Network Analysis This network contains 309 ownership relationships between 322 firms. When we plot the data, we see that most relationships are between two firms. The plot is organised with the Fruchterman-Reingold algorithm to improve its clarity. We can also see a large cluster in the centre. The names have been removed for clarity. The Trumpland analysis continues with this conglomerate. The second code section excises this connected subnetwork so we can analyse it in more detail. # Find most connected firm which.max(degree(org.ownership)) # Create subnetworks org.ownership.d <- decompose(org.ownership) # Find largest subnetwork largest <- which.max(sapply(org.ownership.d, diameter)) #Plot largest subnetwork plot(org.ownership.d[[largest]], layout=layout.fruchterman.reingold, vertex.label.cex=.5, vertex.size=5, edge.arrow.size=.1 )  ## Digging Deeper The node with the highest degree identifies the business with the most holdings. This analysis shows that DJT Holdings LLC owns 33 other organisations. These organisations own other organisations. We can now use the cluster function to investigate this subnetwork. This Trumpworld analysis shows that the ownership network is clearly a star network. DJT Holdings LLC centrally controls all organisations. Perhaps this graph visualises the management style of the soon to be president Trump. Trump centrally controls his empire, which is typical for a family business. Does this chart visualise Trump’s leadership style? Is the star network an expression of his lack of trust and thus desire to oversee everything directly? # Finding antipodes using the globe and ggmap packages The antipode of each point on the Earth’s surface—the points where the blue and yellow overlap, are the land antipodes. When I was a kid, I was was fascinated by the conundrum of what happens when you drill a hole straight through the centre of the earth. I always believed that I would turn up in Australia. But is this really the case? The antipodes of any place on earth is the place that is diametrically opposite to it. A pair of antipodes are connected by a straight line running through the centre of the Earth. These points are as far away from each other as is possible on this planet. Two people are antipodes when they live on opposite sides of the globe. Their feet (πούς/pous in Ancient Greek) are directly opposite each other. How can we use coding in R to solve this conundrum? ## Using R to find antipodes We can roughly recreate the antipodean map on Wikipedia with the globe package. This package, written by Adrian Baddeley, plots 2D and 3D views of the earth. The package contains a data file with major coastlines that can be used to create a flipped map of the world. The package contains a data file with major coastlines that can be used to create a flipped map of the world. To turn a spatial location into its antipode you subtract 180 degrees from the longitude and reverse the sign of the latitude, shown below. library(globe) # Create antipodean map flip <- earth flip$coords[,1] <- flip$coords[,1]-180 flip$coords[,2] <- -flip$coords[,2] par(mar=rep(0,4)) # Remove plotting margins globeearth(eye=c(90,0), col="blue") globepoints(loc=flip$coords, eye=c(90,0), col="red", pch=".")


We can also use the ggmap package to visualise antipodes. This package, developed by David Kahle antipodean R-guru Hadley Wickham, has a neat geocoding function to obtain a spatial location. The antipode function takes the description of a location and a zoom level to plot a dot on the antipode location. The gridExtra package is used to created a faceted map, which is not otherwise possible in ggmap.

library(ggmap)
library(gridExtra)

antipode <- function(location, zm=6) {
# Map location
lonlat <- geocode(location)
loc1 <- get_map(lonlat, zoom=zm)
map1 <- ggmap(loc1) + geom_point(data=lonlat, aes(lon, lat, col="red", size=10)) +
theme(legend.position="none")
# Define antipode
lonlat$lon <- lonlat$lon-180
if (lonlat$lon < -180) lonlat$lon <- 360 + lonlat$lon lonlat$lat <- -lonlat\$lat
loc2 <- get_map(lonlat, zoom=zm)
map2 <- ggmap(loc2) + geom_point(data=lonlat, aes(lon, lat, col="red", size=10)) +
theme(legend.position="none")
grid.arrange(map1, map2, nrow=1)
}

antipode("Rector Nelissenstraat 47 Hoensbroek", 4)


This code solves the problem I was thinking about as a child. Running the code shows that the antipodes location of the home I grew up in is not in Australia, but quite a way south of New Zealand. Another childhood fantasy shattered …

Antipodes using ggmap and gridExtra.

# Euler Problem 8: Largest Product in a Series

Euler Problem 8 is a combination of mathematics and text analysis. The problem is defined as follows:

## Euler Problem 8 Definition

The four adjacent digits in the 1,000-digit number below that have the greatest product are $9 \times 9 \times 8 \times 9 = 5832$.

73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450

Find the thirteen adjacent digits in the 1,000-digit number that have the greatest product. What is the value of this product?

## Solution

The first step is to define the digits as a character string. The answer is found by cycling through all 13-character n-grams to find the highest product using the prod function.

# Define digits
digits <- "7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450"

ngram <- 13  # Define length
# Clycle through digits
for (i in 1:(nchar(digits)-ngram+1)) {
# Pick 13 consecutive digits
adjecent <- substr(digits, i, i + ngram - 1)
# Define product
}


Finding n-grams is a basic function of text analysis called tokenization. The Google Ngram Viewer provides the ability to search for word strings (n-grams) in the massive Google books collection. This type of data can be used to define writing styles and analyse the evolution of language.

# Euler Problem 7 Definition

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13. What is the 1,0001st prime number?

# Solution

The is.prime function determines whether a number is a prime number by checking that it is not divisible by any prime number up to the square root of the number.

The Sieve of used in Euler Problem 3 can be reused to generate prime numbers.

This problem can only be solved using brute force because prime gaps (sequence A001223 in the OEIS) do not follow a predictable pattern.

is.prime <- function(n) {
primes <- esieve(ceiling(sqrt(n)))
prod(n%%primes!=0)==1
}

i <- 2 # First Prime
n <- 1 # Start counter
while (n<10001) { # Find 10001 prime numbers
i <- i + 1 # Next number
if(is.prime(i)) { # Test next number
n <- n + 1 # Increment counter
i <- i + 1 # Next prime is at least two away
}
}



The largest prime gap for the first 10,001 primes is 72. Sexy primes with a gap of 6 are the most common and there are 1270 twin primes.

Prime gap frequency distribution for the first 10001 primes.

# Euler Problem 6: Sum Square Difference

## Euler Problem 6 Definition

The sum of the squares of the first ten natural numbers is:

$1^2 + 2^2 + \ldots + 10^2 = 385$

The square of the sum of the first ten natural numbers is:

$(1 + 2 + \ldots + 10)^2 = 552 = 3025$

The difference between the sum of the squares of the first ten natural numbers and the square of the sum is $3025 - 385 = 2640$. Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

## Solution

This is a straightforward problem for vector processing capabilities in R.

answer <- sum(1:100)^2-sum((1:100)^2)


This problem can also be solved arithmetically. When Carl Friedrich Gauss (1777-1855) when he was a child his teacher challenged his students to add all numbers from 1 to 100. All of his friends struggled to add all the 100 numbers one by one but Carl completed the task in a few seconds.

The same principle applies to computing. The first solution is like Carl’s classmates who slavishly add all numbers. This solution is based on arithmetic progressions.

The sum of natural numbers can be expressed as:

$\frac{n(n + 1)}{2}$
The sum of the squares of the first n natural numbers is:

$\frac{n(n+1)(2n+1)}{6}$
With these two formulas, a fast solution without having to use loops can be written.

n <- 100


# Euler Problem 5: Smallest Multiple

Euler Problem 5 relates to the divisibility of numbers.

## Euler Problem 5

2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder. What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?

## Solution

The solution will also be divisible by the number 1 to 10 so we can start at 2520 and increment by 2520. The loop checks whether the number is divisible by the numbers 1 to 20.

# Start as high as possible
i <- 2520
# Check consecutive numbers for divisibility by 1:20
while (sum(i%%(1:20)) != 0) {
i <- i + 2520 # Increase by smallest number divisible by 1:10
}