# Data Science for Water Utilities Using R

Data science comes natural to water utilities because of the engineering competencies required to deliver clean and refreshing water. Many water managers I speak to are interested in a more systematic approach to creating value from data.

My work in this area is gaining popularity. Two weeks ago I was the keynote speaker at an asset data conference in New Zealand. My paper about data science strategy for water utilities is the most downloaded paper this year. This week I am in Vietnam, assisting the local Phú Thọ water company with their data science problems.

In all my talks and publications I emphasise the importance of collaboration between utilities and that we should share code because we are all sharing the same problems. I am hoping to develop a global data science coalition for water services to achieve this goal.

My book about making water utilities more customer-centric will soon be published, so time to start another project. My new book will be about Data Science for Water Utilities Using R. This book is currently not more than a collection of existing articles, code snippets and production work from my job. The cover is finished because it motivates me to keep writing.

## Data Science for Water Utilities

The first chapter will provide a strategic overview of data science and how water utilities can use this discipline to create value. This chapter is based on earlier articles and recent presentations on the topic.

## Using R

This chapter will make a case for using R by providing just enough information for readers to be able to follow the code in the book. A recurring theme at a data conference in Auckland I spoke at was the problems posed by the high reliance on spreadsheets. This chapter will explain why code is superior and how to use R to achieve this advantage.

## Reservoirs

This first practical chapter will discuss how to manage data from reservoirs. The core problem is to find the relationship between depth and volume based on bathymetric survey data. I started toying with bathymetric data from Pretyboy Reservoir in the state of Mayne. The code below downloads and visualises this data.

```# RESERVOIRS
library(tidyverse)
library(RColorBrewer)
library(gridExtra)

if (!file.exists("Hydroinformatics/prettyboy.csv")) {
url <- "http://www.mgs.md.gov/ReservoirDataPoints/PrettyBoy1998.dat"
names(prettyboy) <- read.csv(url, nrows = 1, header = FALSE, stringsAsFactors = FALSE)
write_csv(prettyboy, "Hydroinformatics/prettyboy.csv")

# Remove extremes, duplicates and Anomaly
ext <- c(which(prettyboy\$Easting == min(prettyboy\$Easting)),
which(prettyboy\$Easting == max(prettyboy\$Easting)),
which(duplicated(prettyboy)))
prettyboy <- prettyboy[-ext, ]

# Visualise reservoir
bathymetry_colours <- c(rev(brewer.pal(3, "Greens"))[-2:-3],
brewer.pal(9, "Blues")[-1:-3])
ggplot(prettyboy, aes(x = Easting, y = Northing, colour = Depth)) +
geom_point(size = .1) + coord_equal() +
```

Bathymetric survey of the Prettyboy reservoir.

In the plot, you can see the lines where the survey boat took soundings. I am working on converting this survey data to a non-convex hull to calculate its volume and to determine the relationship between depth and volume.

Other areas to be covered in this chapter could be hydrology and meteorology, but alas I am not qualified in these subjects. I hope to find somebody who can help me with this part.

## Water Quality

The quality of water in tanks and networks is tested using samples. One of the issues in analysing water quality data is the low number of data points due to the cost of laboratory testing. There has been some discussion about how to correctly calculate percentiles and other statistical issues.

This chapter will also describe how to create a water system index to communicate the performance of a water system to non-experts. The last topic in this chapter discusses analysing taste testing data.

Water system performance index.

## Water Balance

We have developed a model to produce water balances based on SCADA data. I am currently generalising this idea by using the igraph package to define water network geometry. Next year I will start experimenting with a predictive model for water consumption that uses data from the Australian Census and historical data to predict future use.

Data from SCADA systems are time series. This chapter will discuss how to model this data, find spikes in the readings and conduct predictive analyses.

## Customer Perception

This chapter is based on my dissertation on customer perception. Most water utilities do not extract the full value from their customer surveys. In this chapter, I will show how to analyse latent variables in survey data. The code below loads the cleaned data set of the results of a customer survey I undertook in Australia and the USA. The first ten variables are the Personal Involvement Index. This code does a quick exploratory analysis using a boxplot and visualises a factor analysis that uncovers two latent variables.

```# CUSTOMERS
library(psych)

# Exploratory Analyis
p1 <- customers[,1:10] %>%
gather %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot() +
xlab("Item") + ylab("Response") + ggtitle("Personal Involvement Index")

# Factor analysis
fap <- fa.parallel(customers[,1:10]) grid.arrange(p1, ncol= 2) customers[,1:10] %>%
fa(nfactors = fap\$nfact, rotate = "promax") %>%
fa.diagram(main = "Factor Analysis")
```

## Customer Complaints

Customer complaints are a gift to the business. Unfortunately, most business view complaints punitively. This chapter will explain how to analyse and respond to complaints to improve the level of service to customers.

## Customer Contacts

One of the topics in this chapter is how to use Erlang-C modelling to predict staffing levels in contact centres.

## Economics

Last but not least, economics is the engine room of any organisation. In the early stages of my career, I specialised in cost estimating, including probabilistic methods. This chapter will include an introduction to Monte Carlo simulation to improve cost estimation reliability.

## Data Science for Water Utilities Mind Map

This book is still in its early stages. The mind map below shows the work in progress on the proposed chapters and topic.

## Data Science for Water Utilities: The next steps

I started writing bits and pieces of Data Science for Water Utilities using the fabulous bookdown system in R-Studio. It will take me about a year to realise this vision as I need to increase my analytical skills to write about such a broad range of topics. I would love to get some feedback on these two questions:

1. What is missing in this list? Any practical problems I should include?
2. Would you like to donate some data and code to include in the book?

# How Virtual Tags have transformed SCADA data analysis

Yesterday, I delivered the International Keynote at the Asset Data & Insights Conference in Auckland, New Zealand (the place where R was originally developed). My talk was about how to create value from SCADA data, using a method I developed with my colleagues called Virtual Tags. My talk started with my views on data science strategy, which I also presented to the R User Group in Melbourne. In this post, I like to explain what Virtual Tags are, and how they can be used to improve the value of SCADA data.

## SCADA Systems at Water Treatment Plants

Water treatment plants are mostly fully automated systems, using analysers and the SCADA system to communicate this data. For those of you not familiar with water treatment plants, this video below gives a cute summary of the process.

Water treatment plants need sensors to measure a broad range of parameters. These instruments record data 24 hours per day to control operations. When the process operates effectively, all values fall within a very narrow band. All these values are stored by the SCADA system for typically a year, after which they are destroyed to save storage space.

Water treatment plants measure turbidity (clarity of the water) to assess the effectiveness of filtration. The code snippet below simulates the measurements from a turbidity instrument at a water treatment plant over five hours. The code simulates measurements from a turbidity instrument at a water treatment plant over a period of five hours. Most water quality data has a log-normal distribution with a narrow standard deviation.

```# Simulate measured data
set.seed(1234)
n <- 300
wtp <- data.frame(DateTime = seq.POSIXt(ISOdate(1910, 1, 1), length.out = n, by = 60),
WTP = rlnorm(n, log(.1), .01))
library(ggplot2)
p <- ggplot(wtp, aes(x = DateTime, y = WTP)) + geom_line(colour = "grey") +
ylim(0.09, 0.11) + ylab("Turbidity") + ggtitle("Turbidity simulation")
p
```

The data generated by the SCADA system is used to take operational decisions. The data is created and structured to make decisions in the present, not to solve problems in the future. SCADA Historian systems archive this information for future analysis. Historian systems only store new values when the new reading is more or less than a certain percentage than the previous one. This method saves storage space without sacrificing much accuracy.

For example, when an instrument reads 0.20 and the limit is set at 5%, new values are only recorded when they are below 0.19 or above 0.21. Any further values are stored when they deviate 5% from the new value, and so on. The code snippet below simulates this behaviour, based on the simulated turbidity readings generated earlier. This Historian only stores the data points marked in black.

```# Historise data
threshold <- 0.03
h <- 1 # First historised point
# Starting conditions
wtp\$historise <- FALSE
wtp\$historise[c(1, n)] <- TRUE
# Testing for delta <> threshold
for (i in 2:nrow(wtp)) {
delta <- wtp\$WTP[i] / wtp\$WTP[h] if (delta > (1 + threshold) | delta < (1 - threshold)) {
wtp\$historise[i] <- TRUE
h <- i
}
}
historian <- subset(wtp, historise == TRUE)
historian\$Source <- "Historian"
p <- p + geom_point(data = historian, aes(x = DateTime, y = WTP)) + ggtitle("Historised data")
p
```

## Virtual Tags

This standard method to generate and store SCADA data works fine to operate systems but does not work so well when using the data for posthoc analysis. The data in Historian is an unequally-spaced time series which makes it harder to analyse the data. The Virtual Tag approach expands these unequal time series to an equally-spaced one, using constant interpolation.

The `vt` function undertakes the constant interpolation using the approx function. The function`vt` is applied to all the DateTime values, using the historised data points. The red line shows how the value is constant until it jumps by more than 5%. This example demonstrates that we have a steady process with some minor spikes, which is the expected outcome of this simulation.

```# Create Virtual Ttags
vt <- function(t) approx(historian\$DateTime, historian\$WTP, xout = t, method = "constant")
turbidity <- lapply(as.data.frame(wtp\$DateTime), vt)
wtp\$VirtualTag <- turbidity[[1]]\$y
p + geom_line(data = wtp, aes(x = DateTime, y = VirtualTag), colour = "red") + ggtitle("Virtual Tags")
```

The next step in Virtual Tags is to combine the tags from different data points. For example, we are only interested in the turbidity readings when the filter was running. We can do this by combining this data with a valve status or flow in the filter.

This approach might seem cumbersome but it simplifies analysing data from Historian. Virtual Tags enables several analytical processes that would otherwise be hard to. This system also adds context to the SCADA information by linking tags to each other and the processes they describe. If you are interested in more detail, then please download the technical manual for Virtual Tags and how they are implemented in SQL.

## The Presentation

And finally, these are the slides of my keynote presentation.

# Tic Tac Toe Part 3: The Minimax Algorithm

In two previous posts, I presented code to teach R to play the trivial game of Tic Tac Toe. I thought this was an unbeatable algorithm. Alas, a comment from Alberto shattered my pride as he was able to beat my code.

The reason for the demise of my code was that I didn’t implement a full minimax algorithm, but instead looked only two moves ahead. I thought the common strategy rules (start in the centre and occupy the corners) would make the program unbeatable. When I simulated the game by instructing the computer to play against itself, Alberto’s strategy never arose because the code forces the centre field. Alberto’s code shows that you need to look at least three moves ahead for a perfect game. He has been so kind to share his code and gave me permission to publish it.

Alberto recreated two functions, for completeness I have added the complete working code that merges his improvements with my earlier work. The first two functions are identical to the previous post. These functions draw the game board and process the human player’s move by waiting for a mouse click.

```# Draw the game board
draw.board <- function(game) {
xo <- c("X", " ", "O") # Symbols
par(mar = rep(1,4))
plot.new()
plot.window(xlim = c(0,30), ylim = c(0,30))
abline(h = c(10, 20), col="darkgrey", lwd = 4)
abline(v = c(10, 20), col="darkgrey", lwd = 4)
text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 4)
# Identify location of any three in a row
square <- t(matrix(game, nrow = 3))
hor <- abs(rowSums(square))
if (any(hor == 3))
hor <- (4 - which(hor == 3)) * 10 - 5
else
hor <- 0
ver <- abs(colSums(square))
if (any(ver == 3))
ver <- which(ver == 3) * 10 - 5
else
ver <- 0
diag1 <- sum(diag(square))
diag2 <- sum(diag(t(apply(square, 2, rev))))
# Draw winning lines
if (all(hor > 0))
for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col="red")
if (all(ver > 0))
for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col="red")
if (abs(diag1) == 3)
lines(c(2, 28), c(28, 2), lwd = 10, col = "red")
if (abs(diag2) == 3)
lines(c(2, 28), c(2, 28), lwd = 10, col = "red")
}

# Human player enters a move
move.human <- function(game) {
text(4, 0, "Click on screen to move", col = "grey", cex=.7)
empty <- which(game == 0)
move <- 0
while (!move %in% empty) {
coords <- locator(n = 1) # add lines
coords\$x <- floor(abs(coords\$x) / 10) + 1
coords\$y <- floor(abs(coords\$y) / 10) + 1
move <- coords\$x + 3 * (3 - coords\$y)
}
return (move)
}
```

Alberto rewrote the functions that analyse the board and determine the move of the computer. The ganador (Spanish for winning) function assesses the board condition by assigning -10 or + 10 for a winning game and 0 for any other situation.

```ganador <- function(juego, player) {
game <- matrix(juego, nrow = 3, byrow = T)
hor <- rowSums(game)
ver <- colSums(game)
diag <- c(sum(diag(game)), sum(diag(apply(game, 1, rev))))
if (-3 %in% c(hor, ver, diag))
return(-10)
if (3 %in% c(hor, ver, diag))
return(10)
else
return(0)
}
```

The next function is the actual minimax algorithm. If the computer starts then the first move ($9!= 362880$ options to assess) takes a little while. The commented lines can be used to force a corner and make the games faster by forcing a random corner.

The minimax function returns a list with the move and its valuation through the ganador function. The function works recursively until it has filled the board and retains the best scoring move using the minimax method. To avoid the computer always playing the same move in the same situation random variables are added.

```minimax <- function(juego, player) {
free <- which(juego == 0)
if(length(free) == 1) {
juego[free] <- player
return(list(move = free, U = ganador(juego, player)))
}
poss.results <- rep(0, 9)
for(i in free) {
game <- juego
game[i] <- player
}
mm <- ifelse(player == -1, "which.min", "which.max")
if(any(poss.results == (player * 10))) {
move <- do.call(mm, list(poss.results))
return(list(move = move, U = poss.results[move]))
}
for(i in free) {
game <- juego
game[i] <- player
poss.results[i] <- minimax(game, -player)\$U
}
random <- runif(9, 0, 0.1)
poss.results[-free] <- 100 * -player
poss.results <- poss.results + (player * random)
move <- do.call(mm, list(poss.results))
return(list(move = move, U = poss.results[move]))
}
```

This final function stitches everything together and lets you play the game. Simply paste all functions in your R console and run them to play a game. The tic.tac.toe function can take two parameters, “human” and/or “computer”. The order of the parameters determines who starts the game.

```# Main game engine
tic.tac.toe <- function(player1 = "human", player2 = "computer") {
game <- rep(0, 9) # Empty board
winner <- FALSE # Define winner
player <- 1 # First player
players <- c(player1, player2)
draw.board(game)
while (0 %in% game & !winner) { # Keep playing until win or full board
if (players[(player + 3) %% 3] == "human") # Human player
move <- move.human(game)
else { # Computer player
move <- minimax(game, player)
move <- move\$move
}
game[move] <- player # Change board
draw.board(game)
winner <- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6 # Winner, winner, chicken dinner?
player <- -player # Change player
}
}

tic.tac.toe()
```

This is my last word on Tic Tac Toe but now that the minimax conundrum is solved I could start working on other similar games such as Connect Four, Draughts or even the royal game of Chess.

# Data Science from a Strategic Business Perspective

Last night I spoke at the Melbourne R User Group (MelbuRn) about data science from a strategic business perspective. It was great to see so many people attending.

My presentation outlined the strategy that I developed and am implementing for my employer Coliban Water. This strategy is based on a common-sense approach that leverages our existing strengths. This strategy was also outlined in an article for the Source journal.

Water utilities are, pardon the pun, awash with data. For decades we have been using a rudimentary version of the Internet of Things called SCADA (Supervisory Control and Data Aquisition). This system controls our assets and provides operators and analysts with the needed data. All this data is used to control our systems and stored for future reference.

There is no such thing as ‘dark data’. All data has been used for its designated purpose when it was created. My job at Coliban Water is to create value from this information.

In this presentation, I explained how Coliban Water is creating more value from data by following a systematic strategy,

# Lifting the Big Data Veil: Data Science Strategy for Water Utilities

The Data Science Venn Diagram (Conway, 2010).

In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data.

Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics.

The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality.

Good data science needs to comply with these four basic principles:

• Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality.
• Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions.
• Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions.
• Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics.

I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic.

## Data Science Strategy for Water Utilities

Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies.

You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall.

Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue.