# Data Science from a Strategic Business Perspective

Last night I spoke at the Melbourne R User Group (MelbuRn) about data science from a strategic business perspective. It was great to see so many people attending.

My presentation outlined the strategy that I developed and am implementing for my employer Coliban Water. This strategy is based on a common-sense approach that leverages our existing strengths. This strategy was also outlined in an article for the Source journal.

Water utilities are, pardon the pun, awash with data. For decades we have been using a rudimentary version of the Internet of Things called SCADA (Supervisory Control and Data Aquisition). This system controls our assets and provides operators and analysts with the needed data. All this data is used to control our systems and stored for future reference.

There is no such thing as ‘dark data’. All data has been used for its designated purpose when it was created. My job at Coliban Water is to create value from this information.

In this presentation, I explained how Coliban Water is creating more value from data by following a systematic strategy,

# Percentile Calculations in Water Quality Regulations

Percentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use?

The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile.

set.seed(1969)
test.data <- rnorm(n = 10000, mean = 100, sd = 15)
library(ggplot2)
ggplot(as.data.frame(test.data), aes(test.data)) +
geom_histogram(binwidth = 1, aes(y = ..density..), fill = "dodgerblue") +
geom_line(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", size = 1) +
geom_area(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15),
colour = "red", fill="red", alpha = 0.5, xlim = quantile(test.data, c(0.5, 0.75))) +
theme(text = element_text(size = 16))


The quantile default function and the 95th percentile give the following results:

> quantile(test.data)
0%       25%       50%       75%      100%
39.91964  89.68041 100.16437 110.01910 153.50195

> quantile(test.data, probs=0.95)
95%
124.7775


## Methods of percentile calculations

The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods.

The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code.

> sapply(1:9, function(m) quantile(test.data, 0.95, type = m))

95%      95%      95%      95%      95%      95%      95%      95%      95%
124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775


## Percentile calculations in water quality

The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that:

“The 95th percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.”

The Victorian regulators also specify that the percentile should be calculated with the Weibull Method. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R?

The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name $\hat{Q}_8(p)$. Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data?

In case you are interested in how the Weibull method works, the weibull.quantile function shown below calculates a quantile p for a vector x using this method. This function gives the same result as quantile(x, p, type=6).

weibull.quantile <- function(x, p) {
# Order Samples from large to small
x <- x[order(x, decreasing = FALSE)]
# Determine ranking of percentile according to Weibull (1939)
r <- p * (length(x) + 1)
# Linear interpolation
rfrac <- (r - floor(r))
return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1])
}


## Turbidity Data Example

Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits.

Turbidity at customer tap for each zone in the Tarnagulla system in 2016 (n=53).

When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result.

ZoneR1R2R3R4R5R6R7R8R9
Bealiba0.3000.3000.2000.2400.2900.3000.2450.3000.300
Dunolly0.400000.400000.300000.340000.390000.435000.345000.405000.40125
Laanecoorie0.500000.500000.400000.440000.490000.535000.445000.505000.50125
Tarnagulla0.40.40.40.40.40.40.40.40.4

The graph and the table were created with the following code snippet:

ggplot(turbidity, aes(Result)) +
geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) +
facet_wrap(~Zone) +
theme(text=element_text(size=16))

tapply(turbidity$Result, turbidity$Zone,
function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))


# Lifting the Big Data Veil: Data Science Strategy for Water Utilities

The Data Science Venn Diagram (Conway, 2010).

In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data.

Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics.

The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality.

Good data science needs to comply with these four basic principles:

• Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality.
• Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions.
• Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions.
• Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics.

I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic.

## Data Science Strategy for Water Utilities

Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies.

You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall.

Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue.

# SCADA spikes in Water Treatment Data

SCADA spikes are events in the data stream of water treatment plants or similar installations. These SCADA spikes can indicate problems with the process and could result in an increased risk to public health.

The WSAA Health Based Targets Manual specifies a series of decision rules to assess the performance of filtration processes. For example, this rule assesses the performance of conventional filtration:

“Individual filter turbidity ≤ 0.2 NTU for 95% of month and not > 0.5 NTU for ≥ 15 consecutive minutes.”

Turbidity is a measure for the cloudiness of a fluid because of large numbers of individual particles otherwise invisible to the naked eye. Turbidity is an important parameter in water treatment because a high level of cloudiness strongly correlates with the presence of microbes. This article shows how to implement this specific decision rule using the R language.

## Simulation

To create a minimum working example, I first create a simulated SCADA feed for turbidity. The turbidity data frame contains 24 hours of data. The seq.POSIXt function creates 24 hours of timestamps at a one-minute spacing. In addition, the rnorm function creates 1440 turbidity readings with an average of 0.1 NTU and a standard deviation of 0.01 NTU. The image below visualises the simulated data. The next step is to assess this data in accordance with the decision rule.

# Simulate data
set.seed(1234)
turbidity <- data.frame(DateTime = seq.POSIXt(as.POSIXct("2017-01-01 00:00:00"), by = "min", length.out=24*60),
Turbidity = rnorm(n = 24*60, mean = 0.1, sd = 0.01)
)


The second section simulates five spikes in the data. The first line picks a random start time for the spike. The second line in the for-loop picks a duration between 10 and 30 minutes. In addition, the third line simulates the value of the spike. The mean value of the spike is determined by the rbinom function to create either a low or a high spike. The remainder of the spike simulation inserts the new data into the turbidity data frame.

# Simulate spikes
for (i in 1:5) {
time <- sample(turbidity$DateTime, 1) duration <- sample(10:30, 1) value <- rnorm(1, 0.5 * rbinom(1, 1, 0.5) + 0.3, 0.05) start <- which(turbidity$DateTime == time)
turbidity$Turbidity[start:(start+duration - 1)] <- rnorm(duration, value, value/10) }  The image below visualises the simulated data using the mighty ggplot. Only four spikes are visible because two of them overlap. The next step is to assess this data in accordance with the decision rule. library(ggplot2) ggplot(turbidity, aes(x = DateTime, y = Turbidity)) + geom_line(size = 0.2) + geom_hline(yintercept = 0.5, col = "red") + ylim(0,max(turbidity$Turbidity)) +


The following code searches for all spikes over 0.50 NTU using the run length function. This function transforms a vector into a vector of values and lengths. For example, the run length of the vector c(1, 1, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6) is:

• lengths: int [1:5] 2 3 4 2 1
• values : num [1:5] 1 2 3 5 6

The value 1 has a length of 1, the value 2 has a length of 3 and so on. The spike detection code creates the run length for turbidity levels greater than 0.5, which results in a boolean vector. The cumsum function calculates the starting point of each spike which allows us to calculate their duration.

The code results in a data frame with all spikes higher than 0.50 NTU and longer than 15 minutes. The spike that occurred at 11:29 was higher than 0.50 NTU and lasted for 24 minutes. The other three spikes are either lower than 0.50 NTU. The first high spike lasted less than 15 minutes.

# Spike Detection
spike.detect <- function(DateTime, Value, Height, Duration) {
runlength <- rle(Value > Height)
spikes <- data.frame(Spike = runlength$values, times <- cumsum(runlength$lengths))
spikes$Times <- DateTime[spikes$times]
spikes$Event <- c(0,spikes$Times[-1] - spikes$Times[-nrow(spikes)]) spikes <- subset(spikes, Spike == TRUE & Event > Duration) return(spikes) } spike.detect(turbidity$DateTime, turbidity\$Turbidity, 0.5, 15)


This approach was used to prototype a software package to assess water treatment plant data in accordance with the Health-Based Targets Manual. The finished product has been written in SQL and is available under an Open Source sharing license.