# Analysing soil moisture data in NetCDF format with the ncdf4 library

The netCDF format is popular in sciences that analyse sequential spatial data. It is a self-describing, machine-independent data format for creating, accessing and sharing array-oriented information. The netCDF format provides spatial time-series such as meteorological or environmental data. This article shows how to visualise and analyse this data format by reviewing soil moisture data published by the Australian Bureau of Statistics.

## Soil Moisture data

The Australian Bureau of Meteorology publishes hydrological data in both a simple map grid and in the NetCDF format. The map grid consists of a flat text file that requires a bit of data jujitsu before it can be used. The NetCDF format is much easier to deploy as it provides a three-dimensional matrix of spatial data over time.

We are looking at the possible relationship between sewer main blockages and deep soil moisture levels. You will need to manually download this dataset from the Bureau of Meteorology website. I have not been able to scrape the website automatically. For this analysis, I use the actual deep soil moisture level, aggregated monthly in NetCDF 4 format.

## Reading, Extracting and Transforming the netCDF format

The ncdf4 library, developed by David W. Pierce, provides the necessary functionality to manage this data. The first step is to load the data, extract the relevant information and transform the data for visualisation and analysis. When the data is read, it essentially forms a complex list that contains the metadata and the measurements.

The ncvar_get function extracts the data from the list. The lon, lat and dates variables are the dimensions of the moisture data. The time data is stored as the number of days since 1 January 1900. The spatial coordinates are stored in decimal degrees with 0.05-decimal degree intervals. The moisture data is a three-dimensional matrix with longitue, latitude and time as dimensions. Storing this data in this way will make it very easy to use.

library(ncdf4)
bom <- nc_open("Hydroinformatics/SoilMoisture/sd_pct_Actual_month.nc")
print(bom) # Inspect the data

# Extract data
lon <- ncvar_get(bom, "longitude")
lat <- ncvar_get(bom, "latitude")
dates <- as.Date("1900-01-01") + ncvar_get(bom, "time")
moisture <- ncvar_get(bom, "sd_pct")
dimnames(moisture) <- list(lon, lat, dates)


## Visualising the data

The first step is to check the overall data. This first code snippet extracts a matrix from the cube for 31 July 2017 and plots it. This code pipe extracts the date for the end of July 2017 and creates a data frame which is passed to ggplot for visualisation. Although I use the Tidyverse, I still need reshape2 because the gather function does not like matrices.

library(tidyverse)
library(RColorBrewer)
library(reshape2)

d <- "2017-07-31"
m <- moisture[, , which(dates == d)] %>%
melt(varnames = c("lon", "lat")) %>%
subset(!is.na(value))

ggplot(m, aes(x = lon, y = lat, fill = value)) + borders("world") +
geom_tile() +
labs(title = "Total moisture in deep soil layer (100-500 cm)",
subtitle = format(as.Date(d), "%d %B %Y")) +
xlim(range(lon)) + ylim(range(lat)) + coord_fixed()


With the ggmap package we can create a nice map of a local area.

library(ggmap)
loc <- round(geocode("Bendigo") / 0.05) * 0.05
map_tile <- get_map(loc, zoom = 12, color = "bw") %>%
ggmap()

map_tile +
geom_tile(data = m, aes(x = lon, y = lat, fill = value), alpha = 0.8) +
labs(title = "Total moisture in deep soil layer (100-500 cm)",
subtitle = format(as.Date(d), "%d %B %Y"))


## Analysing the data

For my analysis, I am interested in the time series of moisture data for a specific point on the map. The previous code slices the data horizontally over time. To create a time series we can pierce through the data for a specific coordinate. The purpose of this time series is to investigate the relationship between sewer main blockages and deep soil data, which can be a topic for a future post.

mt <- data.frame(date = dates,
dp = moisture[as.character(loc$lon), as.character(loc$lat), ])
ggplot(mt, aes(x = date, y = dp)) + geom_line() +
labs(x = "Month",
y = "Moisture",
title = "Total moisture in deep soil layer (100-500 cm)",
subtitle = paste(as.character(loc), collapse = ", "))


The latest version of this code is available on my GitHub repository.

# Visualising Water Consumption using a Geographic Bubble Chart

A geographic bubble chart is a straightforward method to visualise quantitative information with a geospatial relationship. Last week I was in Vietnam helping the Phú Thọ Water Supply Joint Stock Company with their data science. They asked me to create a map of a sample of their water consumption data. In this post, I share this little ditty to explain how to plot a bubble chart over a map using the

In this post, I share this little ditty to explain how to plot a bubble chart over a map using the ggmap package.

## Load and Explore the Data

The sample data contains a list of just over 100 readings from water meters in the city of Việt Trì in Vietnam, plus their geospatial location. This data uses the World Geodetic System of 1984 (WGS84), which is compatible with Google Maps and similar systems.

# Load the data
water$Consumption <- water$read_new - water$read_old # Summarise the data head(water) summary(water$Consumption)


The consumption at each connection is between 0 and 529 cubic metres, with a mean consumption of 23.45 cubic metres.

## Visualise the data with a geographic bubble chart

With the ggmap extension of the ggplot package, we can visualise any spatial data set on a map. The only condition is that the spatial coordinates are in the WGS84 datum. The ggmap package adds a geographical layer to ggplot by adding a Google Maps or Open Street Map canvas.

The first step is to download the map canvas. To do this, you need to know the centre coordinates and the zoom factor. To determine the perfect zoon factor requires some trial and error. The ggmap package provides for various map types, which are described in detail in the documentation.

# Load map library
library(ggmap)

# Find the middle of the points
centre <- c(mean(range(water$lon)), mean(range(water$lat)))

viettri <- get_map(centre, zoom = 17, maptype = "hybrid")
g <- ggmap(viettri)


The ggmap package follows the same conventions as ggplot. We first call the map layer and then add any required geom. The point geom creates a nice bubble chart when used in combination with the scale_size_area option. This option scales the points to a maximum size so that they are easily visible. The transparency (alpha) minimises problems with overplotting. This last code snippet plots the map with water consumption.

# Add the points
g + geom_point(data = reads, aes(x = lon, y = lat, size = Consumption),
shape = 21, colour = "dodgerblue4", fill = "dodgerblue", alpha = .5) +
scale_size_area(max_size = 20) +
# Size of the biggest point
ggtitle("Việt Trì sự tiêu thụ nước")


You can find the code and data for this article on my GitHub repository. With thanks to Ms Quy and Mr Tuyen of Phu Tho water for their permission to use this data.

This map visualises water consumption in the targeted area of Việt Trì. The larger the bubble, the larger the consumption. It is no surprise that two commercial customers used the most water. Ggplot automatically adds the legend for the consumption variable.

# Data Science for Water Utilities Using R

Data science comes natural to water utilities because of the engineering competencies required to deliver clean and refreshing water. Many water managers I speak to are interested in a more systematic approach to creating value from data.

My work in this area is gaining popularity. Two weeks ago I was the keynote speaker at an asset data conference in New Zealand. My paper about data science strategy for water utilities is the most downloaded paper this year. This week I am in Vietnam, assisting the local Phú Thọ water company with their data science problems.

In all my talks and publications I emphasise the importance of collaboration between utilities and that we should share code because we are all sharing the same problems. I am hoping to develop a global data science coalition for water services to achieve this goal.

My book about making water utilities more customer-centric will soon be published, so time to start another project. My new book will be about Data Science for Water Utilities Using R. This book is currently not more than a collection of existing articles, code snippets and production work from my job. The cover is finished because it motivates me to keep writing.

## Data Science for Water Utilities

The first chapter will provide a strategic overview of data science and how water utilities can use this discipline to create value. This chapter is based on earlier articles and recent presentations on the topic.

## Using R

This chapter will make a case for using R by providing just enough information for readers to be able to follow the code in the book. A recurring theme at a data conference in Auckland I spoke at was the problems posed by the high reliance on spreadsheets. This chapter will explain why code is superior and how to use R to achieve this advantage.

## Reservoirs

This first practical chapter will discuss how to manage data from reservoirs. The core problem is to find the relationship between depth and volume based on bathymetric survey data. I started toying with bathymetric data from Pretyboy Reservoir in the state of Mayne. The code below downloads and visualises this data.

# RESERVOIRS
library(tidyverse)
library(RColorBrewer)
library(gridExtra)

if (!file.exists("Hydroinformatics/prettyboy.csv")) {
url <- "http://www.mgs.md.gov/ReservoirDataPoints/PrettyBoy1998.dat"
names(prettyboy) <- read.csv(url, nrows = 1, header = FALSE, stringsAsFactors = FALSE)
write_csv(prettyboy, "Hydroinformatics/prettyboy.csv")

# Remove extremes, duplicates and Anomaly
ext <- c(which(prettyboy$Easting == min(prettyboy$Easting)),
which(prettyboy$Easting == max(prettyboy$Easting)),
which(duplicated(prettyboy)))
prettyboy <- prettyboy[-ext, ]

# Visualise reservoir
bathymetry_colours <- c(rev(brewer.pal(3, "Greens"))[-2:-3],
brewer.pal(9, "Blues")[-1:-3])
ggplot(prettyboy, aes(x = Easting, y = Northing, colour = Depth)) +
geom_point(size = .1) + coord_equal() +


Bathymetric survey of the Prettyboy reservoir.

In the plot, you can see the lines where the survey boat took soundings. I am working on converting this survey data to a non-convex hull to calculate its volume and to determine the relationship between depth and volume.

Other areas to be covered in this chapter could be hydrology and meteorology, but alas I am not qualified in these subjects. I hope to find somebody who can help me with this part.

## Water Quality

The quality of water in tanks and networks is tested using samples. One of the issues in analysing water quality data is the low number of data points due to the cost of laboratory testing. There has been some discussion about how to correctly calculate percentiles and other statistical issues.

This chapter will also describe how to create a water system index to communicate the performance of a water system to non-experts. The last topic in this chapter discusses analysing taste testing data.

Water system performance index.

## Water Balance

We have developed a model to produce water balances based on SCADA data. I am currently generalising this idea by using the igraph package to define water network geometry. Next year I will start experimenting with a predictive model for water consumption that uses data from the Australian Census and historical data to predict future use.

Data from SCADA systems are time series. This chapter will discuss how to model this data, find spikes in the readings and conduct predictive analyses.

## Customer Perception

This chapter is based on my dissertation on customer perception. Most water utilities do not extract the full value from their customer surveys. In this chapter, I will show how to analyse latent variables in survey data. The code below loads the cleaned data set of the results of a customer survey I undertook in Australia and the USA. The first ten variables are the Personal Involvement Index. This code does a quick exploratory analysis using a boxplot and visualises a factor analysis that uncovers two latent variables.

# CUSTOMERS
library(psych)

# Exploratory Analyis
p1 <- customers[,1:10] %>%
gather %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot() +
xlab("Item") + ylab("Response") + ggtitle("Personal Involvement Index")

# Factor analysis
fap <- fa.parallel(customers[,1:10]) grid.arrange(p1, ncol= 2) customers[,1:10] %>%
fa(nfactors = fap$nfact, rotate = "promax") %>% fa.diagram(main = "Factor Analysis")  ## Customer Complaints Customer complaints are a gift to the business. Unfortunately, most business view complaints punitively. This chapter will explain how to analyse and respond to complaints to improve the level of service to customers. ## Customer Contacts One of the topics in this chapter is how to use Erlang-C modelling to predict staffing levels in contact centres. ## Economics Last but not least, economics is the engine room of any organisation. In the early stages of my career, I specialised in cost estimating, including probabilistic methods. This chapter will include an introduction to Monte Carlo simulation to improve cost estimation reliability. ## Data Science for Water Utilities Mind Map This book is still in its early stages. The mind map below shows the work in progress on the proposed chapters and topic. ## Data Science for Water Utilities: The next steps I started writing bits and pieces of Data Science for Water Utilities using the fabulous bookdown system in R-Studio. It will take me about a year to realise this vision as I need to increase my analytical skills to write about such a broad range of topics. I would love to get some feedback on these two questions: 1. What is missing in this list? Any practical problems I should include? 2. Would you like to donate some data and code to include in the book? Feel free to leave a comment below. # Data Science from a Strategic Business Perspective Last night I spoke at the Melbourne R User Group (MelbuRn) about data science from a strategic business perspective. It was great to see so many people attending. My presentation outlined the strategy that I developed and am implementing for my employer Coliban Water. This strategy is based on a common-sense approach that leverages our existing strengths. This strategy was also outlined in an article for the Source journal. Water utilities are, pardon the pun, awash with data. For decades we have been using a rudimentary version of the Internet of Things called SCADA (Supervisory Control and Data Aquisition). This system controls our assets and provides operators and analysts with the needed data. All this data is used to control our systems and stored for future reference. There is no such thing as ‘dark data’. All data has been used for its designated purpose when it was created. My job at Coliban Water is to create value from this information. In this presentation, I explained how Coliban Water is creating more value from data by following a systematic strategy, ## Data Science Strategy Presentation # Percentile Calculations in Water Quality Regulations Percentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use? The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile. set.seed(1969) test.data <- rnorm(n = 10000, mean = 100, sd = 15) library(ggplot2) ggplot(as.data.frame(test.data), aes(test.data)) + geom_histogram(binwidth = 1, aes(y = ..density..), fill = "dodgerblue") + geom_line(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", size = 1) + geom_area(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", fill="red", alpha = 0.5, xlim = quantile(test.data, c(0.5, 0.75))) + theme(text = element_text(size = 16))  The quantile default function and the 95th percentile give the following results: > quantile(test.data) 0% 25% 50% 75% 100% 39.91964 89.68041 100.16437 110.01910 153.50195 > quantile(test.data, probs=0.95) 95% 124.7775  ## Methods of percentile calculations The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods. The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code. > sapply(1:9, function(m) quantile(test.data, 0.95, type = m)) 95% 95% 95% 95% 95% 95% 95% 95% 95% 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775  ## Percentile calculations in water quality The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that: “The 95th percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.” The Victorian regulators also specify that the percentile should be calculated with the Weibull Method. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R? The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name $\hat{Q}_8(p)$. Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data? In case you are interested in how the Weibull method works, the weibull.quantile function shown below calculates a quantile p for a vector x using this method. This function gives the same result as quantile(x, p, type=6). weibull.quantile <- function(x, p) { # Order Samples from large to small x <- x[order(x, decreasing = FALSE)] # Determine ranking of percentile according to Weibull (1939) r <- p * (length(x) + 1) # Linear interpolation rfrac <- (r - floor(r)) return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1]) }  ## Turbidity Data Example Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits. Turbidity at customer tap for each zone in the Tarnagulla system in 2016 (n=53). When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result. ZoneR1R2R3R4R5R6R7R8R9 Bealiba0.3000.3000.2000.2400.2900.3000.2450.3000.300 Dunolly0.400000.400000.300000.340000.390000.435000.345000.405000.40125 Laanecoorie0.500000.500000.400000.440000.490000.535000.445000.505000.50125 Tarnagulla0.40.40.40.40.40.40.40.40.4 The graph and the table were created with the following code snippet: ggplot(turbidity, aes(Result)) + geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) + facet_wrap(~Zone) + theme(text=element_text(size=16)) tapply(turbidity$Result, turbidity$Zone, function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))  You can view the code on GitHub. # Lifting the Big Data Veil: Data Science Strategy for Water Utilities The Data Science Venn Diagram (Conway, 2010). In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data. Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics. The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality. Good data science needs to comply with these four basic principles: • Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality. • Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions. • Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions. • Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics. I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic. ## Data Science Strategy for Water Utilities Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies. You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall. Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue. # SCADA spikes in Water Treatment Data SCADA spikes are events in the data stream of water treatment plants or similar installations. These SCADA spikes can indicate problems with the process and could result in an increased risk to public health. The WSAA Health Based Targets Manual specifies a series of decision rules to assess the performance of filtration processes. For example, this rule assesses the performance of conventional filtration: “Individual filter turbidity ≤ 0.2 NTU for 95% of month and not > 0.5 NTU for ≥ 15 consecutive minutes.” Turbidity is a measure for the cloudiness of a fluid because of large numbers of individual particles otherwise invisible to the naked eye. Turbidity is an important parameter in water treatment because a high level of cloudiness strongly correlates with the presence of microbes. This article shows how to implement this specific decision rule using the R language. ## Simulation To create a minimum working example, I first create a simulated SCADA feed for turbidity. The turbidity data frame contains 24 hours of data. The seq.POSIXt function creates 24 hours of timestamps at a one-minute spacing. In addition, the rnorm function creates 1440 turbidity readings with an average of 0.1 NTU and a standard deviation of 0.01 NTU. The image below visualises the simulated data. The next step is to assess this data in accordance with the decision rule. # Simulate data set.seed(1234) turbidity <- data.frame(DateTime = seq.POSIXt(as.POSIXct("2017-01-01 00:00:00"), by = "min", length.out=24*60), Turbidity = rnorm(n = 24*60, mean = 0.1, sd = 0.01) )  The second section simulates five spikes in the data. The first line picks a random start time for the spike. The second line in the for-loop picks a duration between 10 and 30 minutes. In addition, the third line simulates the value of the spike. The mean value of the spike is determined by the rbinom function to create either a low or a high spike. The remainder of the spike simulation inserts the new data into the turbidity data frame. # Simulate spikes for (i in 1:5) { time <- sample(turbidity$DateTime, 1)
duration <- sample(10:30, 1)
value <- rnorm(1, 0.5 * rbinom(1, 1, 0.5) + 0.3, 0.05)
start <- which(turbidity$DateTime == time) turbidity$Turbidity[start:(start+duration - 1)] <- rnorm(duration, value, value/10)
}


The image below visualises the simulated data using the mighty ggplot. Only four spikes are visible because two of them overlap. The next step is to assess this data in accordance with the decision rule.

library(ggplot2)
ggplot(turbidity, aes(x = DateTime, y = Turbidity)) + geom_line(size = 0.2) +
geom_hline(yintercept = 0.5, col = "red") + ylim(0,max(turbidity$Turbidity)) + ggtitle("Simulated SCADA data")  ## SCADA Spikes Detection The following code searches for all spikes over 0.50 NTU using the run length function. This function transforms a vector into a vector of values and lengths. For example, the run length of the vector c(1, 1, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6) is: • lengths: int [1:5] 2 3 4 2 1 • values : num [1:5] 1 2 3 5 6 The value 1 has a length of 1, the value 2 has a length of 3 and so on. The spike detection code creates the run length for turbidity levels greater than 0.5, which results in a boolean vector. The cumsum function calculates the starting point of each spike which allows us to calculate their duration. The code results in a data frame with all spikes higher than 0.50 NTU and longer than 15 minutes. The spike that occurred at 11:29 was higher than 0.50 NTU and lasted for 24 minutes. The other three spikes are either lower than 0.50 NTU. The first high spike lasted less than 15 minutes. # Spike Detection spike.detect <- function(DateTime, Value, Height, Duration) { runlength <- rle(Value > Height) spikes <- data.frame(Spike = runlength$values,
times <- cumsum(runlength$lengths)) spikes$Times <- DateTime[spikes$times] spikes$Event <- c(0,spikes$Times[-1] - spikes$Times[-nrow(spikes)])
spikes <- subset(spikes, Spike == TRUE & Event > Duration)
return(spikes)
}
spike.detect(turbidity$DateTime, turbidity$Turbidity, 0.5, 15)


This approach was used to prototype a software package to assess water treatment plant data in accordance with the Health-Based Targets Manual. The finished product has been written in SQL and is available under an Open Source sharing license.

View this code on GitHub.