Analysing Digital Water Meter Data using the Tidyverse

In last week’s article, I discussed how to simulate water consumption data to help develop analytics and reporting. This post describes how to create a diurnal curve from standard digital metering data.

Data Source

The simulated data consists  of three fields:

All analysis is undertaken in the local Australian Eastern Standard Time (AEST). The input to all functions is thus in AEST. The digital water meters send an hourly pulse at a random time within the hour. Each transmitter (RTU) uses a random offset to avoid network congestion. The digital meter counts each time the impeller makes a full turn, and for this analysis, we assume that this equates to a five-litre volume. The ratio between volume and count depends on the meter brand and type. The image below shows a typical data set for an RTU, including some missing data points.

Simulated water consumption.

Simulated water consumption (red: measured points, blue: interpolated points.

To analyse the data we need two auxiliary functions: one to slice the data we need and one to interpolate data for the times we need it. The Tidyverse heavily influences the code in this article. I like the Tidyverse way of doing things because it leads to elegant code that is easy to understand.

meter_reads <- read.csv("Hydroinformatics/DigitalMetering/meter_reads.csv")
rtu <- unique(meter_reads$DevEUI)
meter_reads$TimeStampUTC <- as.POSIXct(meter_reads$TimeStampUTC, tz = "UTC")

Slicing Digital Water Metering Data

Data analysis is undertaken on slices of the complete data set. This function slices the available data by a vector of RTU ids and a timestamp range in AEST. This function adds a new timestamp variable in AEST. If no date range is provided, all available data for the selected RTUs is provided. The output of this function is a data frame (a Tibble in Tydiverse language).

slice_reads <- function(rtus, dates = range(meter_reads$TimeStampUTC)) {
 filter(meter_reads, DevEUI %in% rtus) %>%
    mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC, tz = "Australia/Melbourne"))) %>%
    filter(TimeStampAEST >= as.POSIXct(dates[1]) & 
             TimeStampAEST <= as.POSIXct(dates[2])) %>%
    arrange(DevEUI, TimeStampAEST)

Interpolation of Meter Reads

This function interpolates the cumulative counts for a series of RTUs over a vector of timestamps in AEST. The function creates a list to store the results for each RTU, interpolates the data using the approx function and then flattens the list back to a data frame. The interpolation function contains a different type of pipe because of the approx for interpolation function does not take a data argument. The %$% pipe from the Magrittr package solves that problem.

The output is a data frame with DevEUI, the timestamp in AEST and the interpolated cumulative count. The image above shows the counts for two meters over two days an the graph superimposes an interpolated point over the raw data. Although the actual data consists of integer counts, interpolated values are numeric values. The decimals are retained to distinguish them from real reads.

interpolate_count <- function(rtus, timestamps) {
  timestamps <- as.POSIXct(timestamps, tz = "Australia/Melbourne")
  results <- vector("list", length(rtus))
  for (r in seq_along(rtus)) {
    interp <- slice_reads(rtus[r]) %$%
      approx(TimeStampAEST, Count, timestamps)
    results[[r]] <- data_frame(DevEUI = rep(rtus[r], length(timestamps)), TimeStampAEST = timestamps, Count = interp$y) 
  return(, results)) 

interpolate_count(rtu[2:3], seq.POSIXt(as.POSIXct("2020-02-01"), as.POSIXct("2020-02-2"), by = "day")) 

slice_reads(rtu[2], c("2020-02-06", "2020-02-08")) %>%
  ggplot(aes(x = TimeStampAEST, y = Count))  + 
  geom_line(col = "grey", size = 1) + 
    geom_point(col = "red") + 
    geom_point(data = interpolate_count(rtu[2], as.POSIXct("2020-02-06") + (0:2)*24*3600), colour = "blue") + 
    ggtitle(paste("DevEUI", rtu[2]))

With these two auxiliary functions, we can start analysing the data.

Daily Consumption

Daily consumption for each connection is a critical metric in managing water resources and billing customers. The daily consumption of any water connection is defined by the difference between the cumulative counts at midnight. The interpolation function makes it easy to determine daily consumption. This function interpolates the midnight reads for each of the RTUs over the period, starting the previous day. The output of the function is a data frame that can be piped into the plotting function to visualise the data. When you group the data by date, you can also determine the total consumption over a group of services.

daily_consumption <- function(rtus, dates) {
  timestamps <- seq.POSIXt(as.POSIXct(min(dates)) - 24 * 3600, as.POSIXct(max(dates)), by = "day") 
  interpolate_count(rtus, timestamps) %>%
    group_by(DevEUI) %>%
    mutate(Consumption = c(0, diff(Count)) * 5,
           Date = format(TimeStampAEST, "%F")) %>%
    filter(TimeStampAEST != timestamps[1]) %>%
    select(DevEUI, Date, Consumption)

daily_consumption(rtu[32:33], c("2020-02-01", "2020-02-7")) %>%
  ggplot(aes(x = Date, y = Consumption)) + geom_col() + 
  facet_wrap(~DevEUI) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
Analysing digital water meter data: Daily consumption.

Analysing digital water meter data: Daily consumption.

Diurnal Curves

The diurnal curve is one of the most important pieces of information used in the design of water supply systems. This curve shows the usage of one or more services for each hour in the day. This curve is a reflection of human behaviour, as we use most water in the morning and the evenings.

This function slices data for a vector of RTUs over a period and then plots the average diurnal curve. The data is obtained by interpolating the cumulative counts for each whole hour in the period. The function then calculates the flow in litres per hour and visualises the minimum, mean and maximum value.

plot_diurnal_connections <- function(rtus, dates) {
  timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour") 
  interpolate_count(rtus, timestamps) %>% 
    mutate(Rate = c(0, diff(Count * 5)),
           Hour = as.integer(format(TimeStampAEST, "%H"))) %>% 
    filter(Rate >= 0) %>%
    group_by(Hour) %>%
    summarise(min = min(Rate), mean = mean(Rate), max = max(Rate)) %>%
    ggplot(aes(x = Hour, ymin = min, ymax = max)) + 
      geom_ribbon(fill = "lightblue", alpha = 0.5) + 
      geom_line(aes(x = Hour, y = mean), col = "orange", size = 1) +
      ggtitle("Connections Diurnal flow") + ylab("Flow rate [L/h]")

plot_diurnal_connections(rtu[12:20], c("2020-02-01", "2020-03-01"))
Analysing digital water meter data: Diurnal curve.

Analysing digital water meter data: Diurnal curve.

Boxplots are also an informative way to visualise this curve. This method provides more statistical information on one page, and the ggplot function performs the statistical analysis.

plot_diurnal_box <- function(rtus, dates) {
  timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour") 
  interpolate_count(rtus, timestamps) %>% 
    mutate(Rate = c(0, diff(Count * 5)),
           Hour = as.integer(format(TimeStampAEST, "%H"))) %>% 
    filter(Rate >= 0) %>%
    group_by(Hour) %>%
    ggplot(aes(x = factor(Hour), y = Rate)) + 
      geom_boxplot() + 
      ggtitle("Diurnal flow") + ylab("Flow rate [L/h]") + xlab("Time")

plot_diurnal_box(rtu[12:20], c("2020-02-01", "2020-03-01"))
Analysing digital water meter data: Diurnal curve.

Analysing digital water meter data: Diurnal curve.

Further Analysing Digital Water Metering Data

These are only glimpses into what is possible with this type of data. Further algorithms need to be developed to extract additional value from this data. I am working on developing leak detection algorithms and clustering diurnal curves, daily consumption graphs and so on. Any data science enthusiast who is interested in helping me to develop an Open Source R library to analyse digital metering data.

The code for this article is available on GitHub.


Simulating Water Consumption to Develop Analysis and Reporting

I am currently working on developing analytics for a digital water metering project. Over the next five years, we are enabling 70,000 customer water meters with digital readers and transmitters. The data is not yet available but we don’t want to wait to build reporting systems until after the data is live. The R language comes to the rescue as it has magnificent capabilities to simulate data. Simulating data is a useful technique to progress a project when data is being collected. Simulated data also helps because the outcomes of the analysis are known, which helps to validate the outcomes.

The raw data that we will eventually receive from the digital customer meters has the following basic structure:

  • DevEUI: Unique device identifier.
  • Timestamp: Date and time in (UTC) of the transmission.
  • Cumulative count: The number of revolutions the water meter makes. Each revolution is a pulse which equates to five litres of water.

Every device will send an hourly data burst which contains the cumulative meter read in pulse counts. The transmitters are set at a random offset from the whole our, to minimise the risk of congestion at the receivers. The time stamp for each read is set in the Coordinated Universal Time (UTC). Using this time zone prevents issues with daylight savings. All analysis will be undertaken in the Australian Eastern (Daylight) Time zone.

This article explains how we simulated test data to assist with developing reporting and analysis. The analysis of digital metering data follows in a future post. The code and the data can be found on GitHub. I have recently converted to using the Tidyverse for all my R coding. It has made my working life much easier and I will use it for all future posts.

Simulating water consumption

For simplicity, this simulation assumes a standard domestic diurnal curve (average daily usage pattern) for indoor water use. Diurnal curves are an important piece of information in water management. The curve shows water consumption over the course of a day, averaged over a fixed period. The example below is sourced from a journal article. This generic diurnal curve consists of 24 data points based on measured indoor water consumption, shown in the graph below.

Simulating water consumption: diurnal curve example

Source: Gurung et al. (2014) Smart meters for enhanced water supply network modelling and infrastructure planning. Resources, Conservation and Recycling (90), 34-50.

This diurnal curve only includes indoor water consumption and is assumed to be independent of seasonal variation. This is not a realistic assumption, but the purpose of this simulation is not to accurately model water consumption but to provide a data set to validate the reporting and analyses.

Simulating water consumption in R

The first code snippet sets the parameters used in this simulation. The unique device identifiers (DevEUI) are simulated as six-digit random numbers. The timestamps vector consists of hourly date-time variables in UTC. For each individual transmitter, this timestamp is offset by a random time. Each transmitter is also associated with the number of people living in each house. This number is based on a Poisson distribution.

# Libraries
# Boundary conditions
n <- 100 # Number of simulated meters
d <- 100 # Number of days to simulate
s <- as.POSIXct("2020-01-01", tz = "UTC") # Start of simulation

set.seed(1969) # Seed random number generator for reproducibility
rtu <- sample(1E6:2E6, n, replace = FALSE) # 6-digit id
offset <- sample(0:3599, n, replace = TRUE) # Unique Random offset for each RTU

# Number of occupants per connection
occupants <- rpois(n, 1.5) + 1 %>%
  ggplot(aes(occupants)) + geom_bar(fill = "dodgerblue2", alpha = 0.5) + 
  xlab("Occupants") + ylab("Connections") + ggtitle("Occupants per connection")
Simulated number of occupants per connection.

Simulated number of occupants per connection.

The diurnal curve is based on actual data which includes leaks as the night time use shows a consistent flow of about one litre per hour. For that reason, the figures are rounded and reduced by one litre per hour, to show a zero flow when people are usually asleep. The curve is also shifted by eleven hours because the raw data is stored in UTC.

diurnal <- round(c(1.36, 1.085, 0.98, 1.05, 1.58, 3.87, 9.37, 13.3, 12.1, 10.3, 8.44, 7.04, 6.11, 5.68, 5.58, 6.67, 8.32, 10.0, 9.37, 7.73, 6.59, 5.18, 3.55, 2.11)) - 1 

data.frame(TimeUTC = 0:23, Flow = diurnal) %>% 
  ggplot(aes(x = TimeUTC, y = Flow)) + 
  geom_area(fill = "dodgerblue2", alpha = 0.5) +
  scale_x_continuous(breaks = 0:23) + ylab("Flow [L/h/p]") + 
  ggtitle("Idealised diurnal curve for households")
ggsave("Hydroinformatics/DigitalMetering/diurnal_curve.png", dpi = 300)

tdiff <- 11
diurnal <- c(diurnal[(tdiff + 1): 24], diurnal[1:tdiff])

This simulation only aims to simulate a realistic data set and not to present an accurate depiction of reality. This simulation could be enhanced by using different diurnal curves for various customer segments and to include outdoor watering, temperature dependencies and so on.

Simulating Water Consumption

A leak is defined by a constant flow through the meter, in addition to the idealised diurnal curve. A weighted binomial distribution (θ = 0.1) models approximately one in ten properties with a leak. The size of the leak is derived from a random number between 10 and 50 litres per hour.

The data is stored in a matrix through a loop that cycles through each connection. The DevEUI is repeated over the simulated time period (24 times the number of days). The second variable is the time stamp plus the predetermined offset for each RTU. The meter count is defined by the cumulative sum of the diurnal flow, multiplied by the number of occupants. Each point in the diurnal deviates from the model curve by ±10%. Any predetermined leakage is added to each meter read over the whole period of 100 days. The hourly volumes are summed cumulatively to simulate meter reads. The flow is divided by five as each meter revolution indicate five litres.

The next code snippet simulates the digital metering data using the assumptions and parameters outlined above.

# Leak simulation
leaks <- rbinom(n, 1, prob = .1) * sample(10:50, n, replace = TRUE) data.frame(DevEUI = rtu, Leak = leaks) %>%
  subset(Leak > 0)

# Digital metering data simulation
meter_reads <- matrix(ncol = 3, nrow = 24 * n * d)
colnames(meter_reads) <- c("DevEUI", "TimeStampUTC", "Count")

for (i in 1:n) {
  r <- ((i - 1) * 24 * d + 1):(i * 24 * d)
  meter_reads[r, 1] <- rep(rtu[i], each = (24 * d))
  meter_reads[r, 2] <- seq.POSIXt(s, by = "hour", length.out = 24 * d) + offset[i]
  meter_reads[r, 3] <- round(cumsum((rep(diurnal * occupants[i], d) + leaks[i]) * 
                                     runif(24 * d, 0.9, 1.1))/5)

meter_reads <- meter_reads %>% 
  as_data_frame() %>%
  mutate(TimeStampUTC = as.POSIXct(TimeStampUTC, origin = "1970-01-01", tz ="UTC"))

Missing Data Points

The data transmission process is not 100% reliable and the base station will not receive some reads. This simulation identifies reads to be removed from the data through the temporary variable remove. This simulation includes two types of failures:

  • Faulty RTUs (2% of RTUs with missing 95% of data)
  • Randomly missing data points (1% of data)
# Initialise temp variable
meter_reads <- mutate(meter_reads, remove = 0)
# Define faulty RTUs (2% of fleet)
faulty <- rtu[rbinom(n, 1, prob = 0.02) == 1]
meter_reads$remove[meter_reads$DevEUI %in% faulty] <- rbinom(sum(meter_reads$DevEUI %in% faulty), 1, prob = .95)

# Data loss
missing <- sample(1:(nrow(meter_reads) - 5), 0.005 * nrow(meter_reads))
for (m in missing){
  meter_reads[m:(m + sample(1:5, 1)), "remove"] <- 1

# Remove data points
meter_reads <- filter(meter_reads, remove == 0) %>%

filter(meter_reads, DevEUI %in% rtu[2]) %>%
  mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC, 
                                           tz = "Australia/Melbourne"))) %>%
  filter(TimeStampAEST >= as.POSIXct("2020-02-06") & 
         TimeStampAEST <= as.POSIXct("2020-02-08")) %>%
  arrange(DevEUI, TimeStampAEST) %>% 
  ggplot(aes(x = TimeStampAEST, y = Count, colour = factor(DevEUI)))  + 
    geom_line() + geom_point() 

The graph shows an example of the cumulative reads and some missing data points.

Simulated water consumption


Analysing Digital Metering Data

Data simulation is a good way to develop your analysis algorithms before you have real data. I have also used this technique when I was waiting for survey results during my dissertation. When the data finally arrived, I simply had to plug it into the code and finetune the code. R has great capabilities to simulate reality to help you understand the data.

In next week’s article, I will outline how I used R and the Tidyverse package to develop libraries to analyse digital metering data.

Analysing soil moisture data in NetCDF format with the ncdf4 library

The netCDF format is popular in sciences that analyse sequential spatial data. It is a self-describing, machine-independent data format for creating, accessing and sharing array-oriented information. The netCDF format provides spatial time-series such as meteorological or environmental data. This article shows how to visualise and analyse this data format by reviewing soil moisture data published by the Australian Bureau of Statistics.

Soil Moisture data

The Australian Bureau of Meteorology publishes hydrological data in both a simple map grid and in the NetCDF format. The map grid consists of a flat text file that requires a bit of data jujitsu before it can be used. The NetCDF format is much easier to deploy as it provides a three-dimensional matrix of spatial data over time.

We are looking at the possible relationship between sewer main blockages and deep soil moisture levels. You will need to manually download this dataset from the Bureau of Meteorology website. I have not been able to scrape the website automatically. For this analysis, I use the actual deep soil moisture level, aggregated monthly in NetCDF 4 format.

Soil moisture data from the Australian Bureau of meteorology in netCDF format

Reading, Extracting and Transforming the netCDF format

The ncdf4 library, developed by David W. Pierce, provides the necessary functionality to manage this data. The first step is to load the data, extract the relevant information and transform the data for visualisation and analysis. When the data is read, it essentially forms a complex list that contains the metadata and the measurements.

The ncvar_get function extracts the data from the list. The lon, lat and dates variables are the dimensions of the moisture data. The time data is stored as the number of days since 1 January 1900. The spatial coordinates are stored in decimal degrees with 0.05-decimal degree intervals. The moisture data is a three-dimensional matrix with longitue, latitude and time as dimensions. Storing this data in this way will make it very easy to use.

# Load data
bom <- nc_open("Hydroinformatics/SoilMoisture/")
print(bom) # Inspect the data

# Extract data
lon <- ncvar_get(bom, "longitude")
lat <- ncvar_get(bom, "latitude")
dates <- as.Date("1900-01-01") + ncvar_get(bom, "time")
moisture <- ncvar_get(bom, "sd_pct")
dimnames(moisture) <- list(lon, lat, dates)

Visualising the data

The first step is to check the overall data. This first code snippet extracts a matrix from the cube for 31 July 2017 and plots it. This code pipe extracts the date for the end of July 2017 and creates a data frame which is passed to ggplot for visualisation. Although I use the Tidyverse, I still need reshape2 because the gather function does not like matrices.


d <- "2017-07-31"
m <- moisture[, , which(dates == d)] %>%
       melt(varnames = c("lon", "lat")) %>%

ggplot(m, aes(x = lon, y = lat, fill = value)) + borders("world") + 
    geom_tile() + 
    scale_fill_gradientn(colors = brewer.pal(9, "Blues")) + 
    labs(title = "Total moisture in deep soil layer (100-500 cm)",
    subtitle = format(as.Date(d), "%d %B %Y")) + 
    xlim(range(lon)) + ylim(range(lat)) + coord_fixed()

Deep soil moisture: Source Bureau of Meteorology, Australia

With the ggmap package we can create a nice map of a local area.

loc <- round(geocode("Bendigo") / 0.05) * 0.05 
map_tile <- get_map(loc, zoom = 12, color = "bw") %>% 

map_tile + 
    geom_tile(data = m, aes(x = lon, y = lat, fill = value), alpha = 0.8) + 
    scale_fill_gradientn(colors = brewer.pal(9, "Blues")) + 
    labs(title = "Total moisture in deep soil layer (100-500 cm)",
        subtitle = format(as.Date(d), "%d %B %Y"))

Analysing the data

For my analysis, I am interested in the time series of moisture data for a specific point on the map. The previous code slices the data horizontally over time. To create a time series we can pierce through the data for a specific coordinate. The purpose of this time series is to investigate the relationship between sewer main blockages and deep soil data, which can be a topic for a future post.

mt <- data.frame(date = dates, 
                 dp = moisture[as.character(loc$lon), as.character(loc$lat), ])
ggplot(mt, aes(x = date, y = dp)) + geom_line() + 
    labs(x = "Month",
         y = "Moisture",
         title = "Total moisture in deep soil layer (100-500 cm)",
         subtitle = paste(as.character(loc), collapse = ", "))

The latest version of this code is available on my GitHub repository.

Deep soil moisture time series.

Visualising Water Consumption using a Geographic Bubble Chart

A geographic bubble chart is a straightforward method to visualise quantitative information with a geospatial relationship. Last week I was in Vietnam helping the Phú Thọ Water Supply Joint Stock Company with their data science. They asked me to create a map of a sample of their water consumption data. In this post, I share this little ditty to explain how to plot a bubble chart over a map using the

In this post, I share this little ditty to explain how to plot a bubble chart over a map using the ggmap package.

Load and Explore the Data

The sample data contains a list of just over 100 readings from water meters in the city of Việt Trì in Vietnam, plus their geospatial location. This data uses the World Geodetic System of 1984 (WGS84), which is compatible with Google Maps and similar systems.

# Load the data
water <- read.csv("PhuTho/MeterReads.csv")
water$Consumption <- water$read_new - water$read_old

# Summarise the data

The consumption at each connection is between 0 and 529 cubic metres, with a mean consumption of 23.45 cubic metres.

Visualise the data with a geographic bubble chart

With the ggmap extension of the ggplot package, we can visualise any spatial data set on a map. The only condition is that the spatial coordinates are in the WGS84 datum. The ggmap package adds a geographical layer to ggplot by adding a Google Maps or Open Street Map canvas.

The first step is to download the map canvas. To do this, you need to know the centre coordinates and the zoom factor. To determine the perfect zoon factor requires some trial and error. The ggmap package provides for various map types, which are described in detail in the documentation.

# Load map library

# Find the middle of the points
centre <- c(mean(range(water$lon)), mean(range(water$lat)))

# Download the satellite image
viettri <- get_map(centre, zoom = 17, maptype = "hybrid")
g <- ggmap(viettri)

The ggmap package follows the same conventions as ggplot. We first call the map layer and then add any required geom. The point geom creates a nice bubble chart when used in combination with the scale_size_area option. This option scales the points to a maximum size so that they are easily visible. The transparency (alpha) minimises problems with overplotting. This last code snippet plots the map with water consumption.

# Add the points
g + geom_point(data = reads, aes(x = lon, y = lat, size = Consumption), 
        shape = 21, colour = "dodgerblue4", fill = "dodgerblue", alpha = .5) +
    scale_size_area(max_size = 20) + 
    # Size of the biggest point 
    ggtitle("Việt Trì sự tiêu thụ nước")

You can find the code and data for this article on my GitHub repository. With thanks to Ms Quy and Mr Tuyen of Phu Tho water for their permission to use this data.

This map visualises water consumption in the targeted area of Việt Trì. The larger the bubble, the larger the consumption. It is no surprise that two commercial customers used the most water. Ggplot automatically adds the legend for the consumption variable.

Geographic Bubble Chart: Visualising Water Consumption in Vietnam

Data Science for Water Utilities Using R

Data science comes natural to water utilities because of the engineering competencies required to deliver clean and refreshing water. Many water managers I speak to are interested in a more systematic approach to creating value from data.

My work in this area is gaining popularity. Two weeks ago I was the keynote speaker at an asset data conference in New Zealand. My paper about data science strategy for water utilities is the most downloaded paper this year. This week I am in Vietnam, assisting the local Phú Thọ water company with their data science problems.

In all my talks and publications I emphasise the importance of collaboration between utilities and that we should share code because we are all sharing the same problems. I am hoping to develop a global data science coalition for water services to achieve this goal.

My book about making water utilities more customer-centric will soon be published, so time to start another project. My new book will be about Data Science for Water Utilities Using R. This book is currently not more than a collection of existing articles, code snippets and production work from my job. The cover is finished because it motivates me to keep writing.

This article describes my proposed chapter structure with some example code snippets. The most recent version of this code can be found on my GitHub repository. Feel free to leave a comment at the bottom of this article if you like to see additional problems discussed, or if you want to participate by sharing code.

Data Science for Water Utilities


Data Science for Water Utilities

The first chapter will provide a strategic overview of data science and how water utilities can use this discipline to create value. This chapter is based on earlier articles and recent presentations on the topic.

Using R

This chapter will make a case for using R by providing just enough information for readers to be able to follow the code in the book. A recurring theme at a data conference in Auckland I spoke at was the problems posed by the high reliance on spreadsheets. This chapter will explain why code is superior and how to use R to achieve this advantage.


This first practical chapter will discuss how to manage data from reservoirs. The core problem is to find the relationship between depth and volume based on bathymetric survey data. I started toying with bathymetric data from Pretyboy Reservoir in the state of Mayne. The code below downloads and visualises this data.


# Read data
if (!file.exists("Hydroinformatics/prettyboy.csv")) {
    url <- ""
    prettyboy <- read.csv(url, skip = 2, header = FALSE)
    names(prettyboy) <- read.csv(url, nrows = 1, header = FALSE, stringsAsFactors = FALSE)
    write_csv(prettyboy, "Hydroinformatics/prettyboy.csv")
} else prettyboy <- read_csv("Hydroinformatics/prettyboy.csv")

# Remove extremes, duplicates and Anomaly
ext <- c(which(prettyboy$Easting == min(prettyboy$Easting)), 
         which(prettyboy$Easting == max(prettyboy$Easting)),
prettyboy <- prettyboy[-ext, ]

# Visualise reservoir
bathymetry_colours <- c(rev(brewer.pal(3, "Greens"))[-2:-3], 
                        brewer.pal(9, "Blues")[-1:-3])
ggplot(prettyboy, aes(x = Easting, y = Northing, colour = Depth)) + 
    geom_point(size = .1) + coord_equal() + 
    scale_colour_gradientn(colors = bathymetry_colours) 
Data Science for Water Utilities: Prettyboy reseroir

Bathymetric survey of the Prettyboy reservoir.

In the plot, you can see the lines where the survey boat took soundings. I am working on converting this survey data to a non-convex hull to calculate its volume and to determine the relationship between depth and volume.

Other areas to be covered in this chapter could be hydrology and meteorology, but alas I am not qualified in these subjects. I hope to find somebody who can help me with this part.

Water Quality

The quality of water in tanks and networks is tested using samples. One of the issues in analysing water quality data is the low number of data points due to the cost of laboratory testing. There has been some discussion about how to correctly calculate percentiles and other statistical issues.

This chapter will also describe how to create a water system index to communicate the performance of a water system to non-experts. The last topic in this chapter discusses analysing taste testing data.

Water system performance index.

Water system performance index.

Water Balance

We have developed a model to produce water balances based on SCADA data. I am currently generalising this idea by using the igraph package to define water network geometry. Next year I will start experimenting with a predictive model for water consumption that uses data from the Australian Census and historical data to predict future use.


Data from SCADA systems are time series. This chapter will discuss how to model this data, find spikes in the readings and conduct predictive analyses.

Customer Perception

This chapter is based on my dissertation on customer perception. Most water utilities do not extract the full value from their customer surveys. In this chapter, I will show how to analyse latent variables in survey data. The code below loads the cleaned data set of the results of a customer survey I undertook in Australia and the USA. The first ten variables are the Personal Involvement Index. This code does a quick exploratory analysis using a boxplot and visualises a factor analysis that uncovers two latent variables.


# Read data
customers <- read_csv("Hydroinformatics/customers.csv")

# Exploratory Analyis
p1 <- customers[,1:10] %>% 
    gather %>% 
    ggplot(aes(x = key, y = value)) + 
        geom_boxplot() + 
        xlab("Item") + ylab("Response") + ggtitle("Personal Involvement Index")

# Factor analysis
fap <- fa.parallel(customers[,1:10]) grid.arrange(p1, ncol= 2) customers[,1:10] %>%
    fa(nfactors = fap$nfact, rotate = "promax") %>% 
    fa.diagram(main = "Factor Analysis")

Personal Involvement Index - Data Science for Water Utilities

Customer Complaints

Customer complaints are a gift to the business. Unfortunately, most business view complaints punitively. This chapter will explain how to analyse and respond to complaints to improve the level of service to customers.

Customer Contacts

One of the topics in this chapter is how to use Erlang-C modelling to predict staffing levels in contact centres.


Last but not least, economics is the engine room of any organisation. In the early stages of my career, I specialised in cost estimating, including probabilistic methods. This chapter will include an introduction to Monte Carlo simulation to improve cost estimation reliability.

Data Science for Water Utilities Mind Map

This book is still in its early stages. The mind map below shows the work in progress on the proposed chapters and topic.

Data Science for Water Utilities: The next steps

I started writing bits and pieces of Data Science for Water Utilities using the fabulous bookdown system in R-Studio. It will take me about a year to realise this vision as I need to increase my analytical skills to write about such a broad range of topics. I would love to get some feedback on these two questions:

  1. What is missing in this list? Any practical problems I should include?
  2. Would you like to donate some data and code to include in the book?

Feel free to leave a comment below.

How Virtual Tags have transformed SCADA data analysis

Yesterday, I delivered the International Keynote at the Asset Data & Insights Conference in Auckland, New Zealand (the place where R was originally developed). My talk was about how to create value from SCADA data, using a method I developed with my colleagues called Virtual Tags. My talk started with my views on data science strategy, which I also presented to the R User Group in Melbourne. In this post, I like to explain what Virtual Tags are, and how they can be used to improve the value of SCADA data.

Asset Data & Insights Conference

SCADA Systems at Water Treatment Plants

Water treatment plants are mostly fully automated systems, using analysers and the SCADA system to communicate this data. For those of you not familiar with water treatment plants, this video below gives a cute summary of the process.

Water treatment plants need sensors to measure a broad range of parameters. These instruments record data 24 hours per day to control operations. When the process operates effectively, all values fall within a very narrow band. All these values are stored by the SCADA system for typically a year, after which they are destroyed to save storage space.

Water treatment plants measure turbidity (clarity of the water) to assess the effectiveness of filtration. The code snippet below simulates the measurements from a turbidity instrument at a water treatment plant over five hours. The code simulates measurements from a turbidity instrument at a water treatment plant over a period of five hours. Most water quality data has a log-normal distribution with a narrow standard deviation.

# Simulate measured data
n <- 300
wtp <- data.frame(DateTime = seq.POSIXt(ISOdate(1910, 1, 1), length.out = n, by = 60),
                  WTP = rlnorm(n, log(.1), .01))
p <- ggplot(wtp, aes(x = DateTime, y = WTP)) + geom_line(colour = "grey") + 
    ylim(0.09, 0.11) + ylab("Turbidity") + ggtitle("Turbidity simulation")

SCADA data simulation

SCADA Historian

The data generated by the SCADA system is used to take operational decisions. The data is created and structured to make decisions in the present, not to solve problems in the future. SCADA Historian systems archive this information for future analysis. Historian systems only store new values when the new reading is more or less than a certain percentage than the previous one. This method saves storage space without sacrificing much accuracy.

For example, when an instrument reads 0.20 and the limit is set at 5%, new values are only recorded when they are below 0.19 or above 0.21. Any further values are stored when they deviate 5% from the new value, and so on. The code snippet below simulates this behaviour, based on the simulated turbidity readings generated earlier. This Historian only stores the data points marked in black.

# Historise data
threshold <- 0.03
h <- 1 # First historised point
# Starting conditions
wtp$historise <- FALSE
wtp$historise[c(1, n)] <- TRUE
# Testing for delta <> threshold
for (i in 2:nrow(wtp)) {
    delta <- wtp$WTP[i] / wtp$WTP[h] if (delta > (1 + threshold) | delta < (1 - threshold)) {
        wtp$historise[i] <- TRUE
        h <- i
historian <- subset(wtp, historise == TRUE)
historian$Source <- "Historian"
p <- p + geom_point(data = historian, aes(x = DateTime, y = WTP)) + ggtitle("Historised data")

SCADA historian simulation

Virtual Tags

This standard method to generate and store SCADA data works fine to operate systems but does not work so well when using the data for posthoc analysis. The data in Historian is an unequally-spaced time series which makes it harder to analyse the data. The Virtual Tag approach expands these unequal time series to an equally-spaced one, using constant interpolation.

The vt function undertakes the constant interpolation using the approx function. The functionvt is applied to all the DateTime values, using the historised data points. The red line shows how the value is constant until it jumps by more than 5%. This example demonstrates that we have a steady process with some minor spikes, which is the expected outcome of this simulation.

# Create Virtual Ttags
vt <- function(t) approx(historian$DateTime, historian$WTP, xout = t, method = "constant")
turbidity <- lapply($DateTime), vt)
wtp$VirtualTag <- turbidity[[1]]$y
p + geom_line(data = wtp, aes(x = DateTime, y = VirtualTag), colour = "red") + ggtitle("Virtual Tags")

Example of Virtual Tags for SCADAYou can find the latest version of this code on GitHub.

The next step in Virtual Tags is to combine the tags from different data points. For example, we are only interested in the turbidity readings when the filter was running. We can do this by combining this data with a valve status or flow in the filter.

This approach might seem cumbersome but it simplifies analysing data from Historian. Virtual Tags enables several analytical processes that would otherwise be hard to. This system also adds context to the SCADA information by linking tags to each other and the processes they describe. If you are interested in more detail, then please download the technical manual for Virtual Tags and how they are implemented in SQL.

The Presentation

And finally, these are the slides of my keynote presentation.

Data Science from a Strategic Business Perspective

Data science from a strategic business perspective - The MelbuRn R user group.Last night I spoke at the Melbourne R User Group (MelbuRn) about data science from a strategic business perspective. It was great to see so many people attending.

My presentation outlined the strategy that I developed and am implementing for my employer Coliban Water. This strategy is based on a common-sense approach that leverages our existing strengths. This strategy was also outlined in an article for the Source journal.

Water utilities are, pardon the pun, awash with data. For decades we have been using a rudimentary version of the Internet of Things called SCADA (Supervisory Control and Data Aquisition). This system controls our assets and provides operators and analysts with the needed data. All this data is used to control our systems and stored for future reference.

There is no such thing as ‘dark data’. All data has been used for its designated purpose when it was created. My job at Coliban Water is to create value from this information.

In this presentation, I explained how Coliban Water is creating more value from data by following a systematic strategy,

Data Science Strategy Presentation

Percentile Calculations in Water Quality Regulations

Percentile Calculations in Water Quality RegulationsPercentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use?

The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile.

set.seed(1969) <- rnorm(n = 10000, mean = 100, sd = 15)
ggplot(, aes( + 
 geom_histogram(binwidth = 1, aes(y = ..density..), fill = "dodgerblue") + 
 geom_line(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", size = 1) + 
 geom_area(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), 
 colour = "red", fill="red", alpha = 0.5, xlim = quantile(, c(0.5, 0.75))) + 
 theme(text = element_text(size = 16))

The quantile default function and the 95th percentile give the following results:

> quantile(
       0%       25%       50%       75%      100% 
 39.91964  89.68041 100.16437 110.01910 153.50195 

> quantile(, probs=0.95)

Methods of percentile calculations

The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods.

The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code.

> sapply(1:9, function(m) quantile(, 0.95, type = m))

     95%      95%      95%      95%      95%      95%      95%      95%      95% 
124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 

Percentile calculations in water quality

The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that:

“The 95th percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.”

The Victorian regulators also specify that the percentile should be calculated with the Weibull Method. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R?

The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name \hat{Q}_8(p). Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data?

In case you are interested in how the Weibull method works, the weibull.quantile function shown below calculates a quantile p for a vector x using this method. This function gives the same result as quantile(x, p, type=6).

weibull.quantile <- function(x, p) {
    # Order Samples from large to small
    x <- x[order(x, decreasing = FALSE)]
    # Determine ranking of percentile according to Weibull (1939)
    r <- p * (length(x) + 1)
    # Linear interpolation
    rfrac <- (r - floor(r))
    return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1])

Turbidity Data Example

Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits.

Turbidity at customer tap for each zone in the Tarnagulla system in 2016 (n=53).

When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result.


The graph and the table were created with the following code snippet:

ggplot(turbidity, aes(Result)) + 
 geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) + 
 facet_wrap(~Zone) + 

tapply(turbidity$Result, turbidity$Zone, 
 function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))

You can view the code on GitHub.

Lifting the Big Data Veil: Data Science Strategy for Water Utilities

Data Science Strategy for water utilities: The Data Science Venn Diagram (Conway, 2010).

The Data Science Venn Diagram (Conway, 2010).

In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data.

Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics.

The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality.

Good data science needs to comply with these four basic principles:

  • Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality.
  • Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions.
  • Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions.
  • Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics.

I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic.

Data Science Strategy for Water Utilities

Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies.

You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall.

Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue.

SCADA spikes in Water Treatment Data

Turbid waterSCADA spikes are events in the data stream of water treatment plants or similar installations. These SCADA spikes can indicate problems with the process and could result in an increased risk to public health.

The WSAA Health Based Targets Manual specifies a series of decision rules to assess the performance of filtration processes. For example, this rule assesses the performance of conventional filtration:

“Individual filter turbidity ≤ 0.2 NTU for 95% of month and not > 0.5 NTU for ≥ 15 consecutive minutes.”

Turbidity is a measure for the cloudiness of a fluid because of large numbers of individual particles otherwise invisible to the naked eye. Turbidity is an important parameter in water treatment because a high level of cloudiness strongly correlates with the presence of microbes. This article shows how to implement this specific decision rule using the R language.


To create a minimum working example, I first create a simulated SCADA feed for turbidity. The turbidity data frame contains 24 hours of data. The seq.POSIXt function creates 24 hours of timestamps at a one-minute spacing. In addition, the rnorm function creates 1440 turbidity readings with an average of 0.1 NTU and a standard deviation of 0.01 NTU. The image below visualises the simulated data. The next step is to assess this data in accordance with the decision rule.

# Simulate data
turbidity <- data.frame(DateTime = seq.POSIXt(as.POSIXct("2017-01-01 00:00:00"), 
                        by = "min", length.out=24*60),
                        Turbidity = rnorm(n = 24*60, mean = 0.1, sd = 0.01)

The second section simulates five spikes in the data. The first line picks a random start time for the spike. The second line in the for-loop picks a duration between 10 and 30 minutes. In addition, the third line simulates the value of the spike. The mean value of the spike is determined by the rbinom function to create either a low or a high spike. The remainder of the spike simulation inserts the new data into the turbidity data frame.

# Simulate spikes
for (i in 1:5) {
   time <- sample(turbidity$DateTime, 1)
   duration <- sample(10:30, 1)
   value <- rnorm(1, 0.5 * rbinom(1, 1, 0.5) + 0.3, 0.05)
   start <- which(turbidity$DateTime == time)
   turbidity$Turbidity[start:(start+duration - 1)] <- rnorm(duration, value, value/10)

The image below visualises the simulated data using the mighty ggplot. Only four spikes are visible because two of them overlap. The next step is to assess this data in accordance with the decision rule.

ggplot(turbidity, aes(x = DateTime, y = Turbidity)) + geom_line(size = 0.2) + 
   geom_hline(yintercept = 0.5, col = "red") + ylim(0,max(turbidity$Turbidity)) + 
   ggtitle("Simulated SCADA data")

Simulated SCADA data with spikes

SCADA Spikes Detection

The following code searches for all spikes over 0.50 NTU using the run length function. This function transforms a vector into a vector of values and lengths. For example, the run length of the vector c(1, 1, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6) is:

  • lengths: int [1:5] 2 3 4 2 1
  • values : num [1:5] 1 2 3 5 6

The value 1 has a length of 1, the value 2 has a length of 3 and so on. The spike detection code creates the run length for turbidity levels greater than 0.5, which results in a boolean vector. The cumsum function calculates the starting point of each spike which allows us to calculate their duration.

The code results in a data frame with all spikes higher than 0.50 NTU and longer than 15 minutes. The spike that occurred at 11:29 was higher than 0.50 NTU and lasted for 24 minutes. The other three spikes are either lower than 0.50 NTU. The first high spike lasted less than 15 minutes.

# Spike Detection
spike.detect <- function(DateTime, Value, Height, Duration) {
 runlength <- rle(Value > Height)
 spikes <- data.frame(Spike = runlength$values,
 times <- cumsum(runlength$lengths))
 spikes$Times <- DateTime[spikes$times]
 spikes$Event <- c(0,spikes$Times[-1] - spikes$Times[-nrow(spikes)])
 spikes <- subset(spikes, Spike == TRUE & Event > Duration)
spike.detect(turbidity$DateTime, turbidity$Turbidity, 0.5, 15)

This approach was used to prototype a software package to assess water treatment plant data in accordance with the Health-Based Targets Manual. The finished product has been written in SQL and is available under an Open Source sharing license.

View this code on GitHub.