# Analysing Digital Water Meter Data using the Tidyverse

In last week’s article, I discussed how to simulate water consumption data to help develop analytics and reporting. This post describes how to create a diurnal curve from standard digital metering data.

## Data Source

The simulated data consists  of three fields:

All analysis is undertaken in the local Australian Eastern Standard Time (AEST). The input to all functions is thus in AEST. The digital water meters send an hourly pulse at a random time within the hour. Each transmitter (RTU) uses a random offset to avoid network congestion. The digital meter counts each time the impeller makes a full turn, and for this analysis, we assume that this equates to a five-litre volume. The ratio between volume and count depends on the meter brand and type. The image below shows a typical data set for an RTU, including some missing data points.

Simulated water consumption (red: measured points, blue: interpolated points.

To analyse the data we need two auxiliary functions: one to slice the data we need and one to interpolate data for the times we need it. The Tidyverse heavily influences the code in this article. I like the Tidyverse way of doing things because it leads to elegant code that is easy to understand.

```library(tidyverse)
library(lubridate)
library(magrittr)
```

## Slicing Digital Water Metering Data

Data analysis is undertaken on slices of the complete data set. This function slices the available data by a vector of RTU ids and a timestamp range in AEST. This function adds a new timestamp variable in AEST. If no date range is provided, all available data for the selected RTUs is provided. The output of this function is a data frame (a Tibble in Tydiverse language).

```slice_reads <- function(rtus, dates = range(meter_reads\$TimeStampUTC)) {
mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC, tz = "Australia/Melbourne"))) %>%
filter(TimeStampAEST >= as.POSIXct(dates[1]) &
TimeStampAEST <= as.POSIXct(dates[2])) %>%
arrange(DevEUI, TimeStampAEST)
}
```

This function interpolates the cumulative counts for a series of RTUs over a vector of timestamps in AEST. The function creates a list to store the results for each RTU, interpolates the data using the approx function and then flattens the list back to a data frame. The interpolation function contains a different type of pipe because of the approx for interpolation function does not take a data argument. The %\$% pipe from the Magrittr package solves that problem.

The output is a data frame with DevEUI, the timestamp in AEST and the interpolated cumulative count. The image above shows the counts for two meters over two days an the graph superimposes an interpolated point over the raw data. Although the actual data consists of integer counts, interpolated values are numeric values. The decimals are retained to distinguish them from real reads.

```interpolate_count <- function(rtus, timestamps) {
timestamps <- as.POSIXct(timestamps, tz = "Australia/Melbourne")
results <- vector("list", length(rtus))
for (r in seq_along(rtus)) {
approx(TimeStampAEST, Count, timestamps)
results[[r]] <- data_frame(DevEUI = rep(rtus[r], length(timestamps)), TimeStampAEST = timestamps, Count = interp\$y)
}
return(do.call(rbind, results))
}

interpolate_count(rtu[2:3], seq.POSIXt(as.POSIXct("2020-02-01"), as.POSIXct("2020-02-2"), by = "day"))

ggplot(aes(x = TimeStampAEST, y = Count))  +
geom_line(col = "grey", size = 1) +
geom_point(col = "red") +
geom_point(data = interpolate_count(rtu[2], as.POSIXct("2020-02-06") + (0:2)*24*3600), colour = "blue") +
ggtitle(paste("DevEUI", rtu[2]))
```

With these two auxiliary functions, we can start analysing the data.

## Daily Consumption

Daily consumption for each connection is a critical metric in managing water resources and billing customers. The daily consumption of any water connection is defined by the difference between the cumulative counts at midnight. The interpolation function makes it easy to determine daily consumption. This function interpolates the midnight reads for each of the RTUs over the period, starting the previous day. The output of the function is a data frame that can be piped into the plotting function to visualise the data. When you group the data by date, you can also determine the total consumption over a group of services.

```daily_consumption <- function(rtus, dates) {
timestamps <- seq.POSIXt(as.POSIXct(min(dates)) - 24 * 3600, as.POSIXct(max(dates)), by = "day")
interpolate_count(rtus, timestamps) %>%
group_by(DevEUI) %>%
mutate(Consumption = c(0, diff(Count)) * 5,
Date = format(TimeStampAEST, "%F")) %>%
filter(TimeStampAEST != timestamps[1]) %>%
select(DevEUI, Date, Consumption)
}

daily_consumption(rtu[32:33], c("2020-02-01", "2020-02-7")) %>%
ggplot(aes(x = Date, y = Consumption)) + geom_col() +
facet_wrap(~DevEUI) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

Analysing digital water meter data: Daily consumption.

## Diurnal Curves

The diurnal curve is one of the most important pieces of information used in the design of water supply systems. This curve shows the usage of one or more services for each hour in the day. This curve is a reflection of human behaviour, as we use most water in the morning and the evenings.

This function slices data for a vector of RTUs over a period and then plots the average diurnal curve. The data is obtained by interpolating the cumulative counts for each whole hour in the period. The function then calculates the flow in litres per hour and visualises the minimum, mean and maximum value.

```plot_diurnal_connections <- function(rtus, dates) {
timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour")
interpolate_count(rtus, timestamps) %>%
mutate(Rate = c(0, diff(Count * 5)),
Hour = as.integer(format(TimeStampAEST, "%H"))) %>%
filter(Rate >= 0) %>%
group_by(Hour) %>%
summarise(min = min(Rate), mean = mean(Rate), max = max(Rate)) %>%
ggplot(aes(x = Hour, ymin = min, ymax = max)) +
geom_ribbon(fill = "lightblue", alpha = 0.5) +
geom_line(aes(x = Hour, y = mean), col = "orange", size = 1) +
ggtitle("Connections Diurnal flow") + ylab("Flow rate [L/h]")
}

plot_diurnal_connections(rtu[12:20], c("2020-02-01", "2020-03-01"))
```

Analysing digital water meter data: Diurnal curve.

Boxplots are also an informative way to visualise this curve. This method provides more statistical information on one page, and the ggplot function performs the statistical analysis.

```plot_diurnal_box <- function(rtus, dates) {
timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour")
interpolate_count(rtus, timestamps) %>%
mutate(Rate = c(0, diff(Count * 5)),
Hour = as.integer(format(TimeStampAEST, "%H"))) %>%
filter(Rate >= 0) %>%
group_by(Hour) %>%
ggplot(aes(x = factor(Hour), y = Rate)) +
geom_boxplot() +
ggtitle("Diurnal flow") + ylab("Flow rate [L/h]") + xlab("Time")
}

plot_diurnal_box(rtu[12:20], c("2020-02-01", "2020-03-01"))
```

Analysing digital water meter data: Diurnal curve.

## Further Analysing Digital Water Metering Data

These are only glimpses into what is possible with this type of data. Further algorithms need to be developed to extract additional value from this data. I am working on developing leak detection algorithms and clustering diurnal curves, daily consumption graphs and so on. Any data science enthusiast who is interested in helping me to develop an Open Source R library to analyse digital metering data.

# Simulating Water Consumption to Develop Analysis and Reporting

I am currently working on developing analytics for a digital water metering project. Over the next five years, we are enabling 70,000 customer water meters with digital readers and transmitters. The data is not yet available but we don’t want to wait to build reporting systems until after the data is live. The R language comes to the rescue as it has magnificent capabilities to simulate data. Simulating data is a useful technique to progress a project when data is being collected. Simulated data also helps because the outcomes of the analysis are known, which helps to validate the outcomes.

The raw data that we will eventually receive from the digital customer meters has the following basic structure:

• DevEUI: Unique device identifier.
• Timestamp: Date and time in (UTC) of the transmission.
• Cumulative count: The number of revolutions the water meter makes. Each revolution is a pulse which equates to five litres of water.

Every device will send an hourly data burst which contains the cumulative meter read in pulse counts. The transmitters are set at a random offset from the whole our, to minimise the risk of congestion at the receivers. The time stamp for each read is set in the Coordinated Universal Time (UTC). Using this time zone prevents issues with daylight savings. All analysis will be undertaken in the Australian Eastern (Daylight) Time zone.

This article explains how we simulated test data to assist with developing reporting and analysis. The analysis of digital metering data follows in a future post. The code and the data can be found on GitHub. I have recently converted to using the Tidyverse for all my R coding. It has made my working life much easier and I will use it for all future posts.

## Simulating water consumption

For simplicity, this simulation assumes a standard domestic diurnal curve (average daily usage pattern) for indoor water use. Diurnal curves are an important piece of information in water management. The curve shows water consumption over the course of a day, averaged over a fixed period. The example below is sourced from a journal article. This generic diurnal curve consists of 24 data points based on measured indoor water consumption, shown in the graph below.

Source: Gurung et al. (2014) Smart meters for enhanced water supply network modelling and infrastructure planning. Resources, Conservation and Recycling (90), 34-50.

This diurnal curve only includes indoor water consumption and is assumed to be independent of seasonal variation. This is not a realistic assumption, but the purpose of this simulation is not to accurately model water consumption but to provide a data set to validate the reporting and analyses.

## Simulating water consumption in R

The first code snippet sets the parameters used in this simulation. The unique device identifiers (DevEUI) are simulated as six-digit random numbers. The timestamps vector consists of hourly date-time variables in UTC. For each individual transmitter, this timestamp is offset by a random time. Each transmitter is also associated with the number of people living in each house. This number is based on a Poisson distribution.

```# Libraries
library(tidyverse)
# Boundary conditions
n <- 100 # Number of simulated meters
d <- 100 # Number of days to simulate
s <- as.POSIXct("2020-01-01", tz = "UTC") # Start of simulation

set.seed(1969) # Seed random number generator for reproducibility
rtu <- sample(1E6:2E6, n, replace = FALSE) # 6-digit id
offset <- sample(0:3599, n, replace = TRUE) # Unique Random offset for each RTU

# Number of occupants per connection
occupants <- rpois(n, 1.5) + 1
as.data.frame(occupants) %>%
ggplot(aes(occupants)) + geom_bar(fill = "dodgerblue2", alpha = 0.5) +
xlab("Occupants") + ylab("Connections") + ggtitle("Occupants per connection")
```

Simulated number of occupants per connection.

The diurnal curve is based on actual data which includes leaks as the night time use shows a consistent flow of about one litre per hour. For that reason, the figures are rounded and reduced by one litre per hour, to show a zero flow when people are usually asleep. The curve is also shifted by eleven hours because the raw data is stored in UTC.

```diurnal <- round(c(1.36, 1.085, 0.98, 1.05, 1.58, 3.87, 9.37, 13.3, 12.1, 10.3, 8.44, 7.04, 6.11, 5.68, 5.58, 6.67, 8.32, 10.0, 9.37, 7.73, 6.59, 5.18, 3.55, 2.11)) - 1

data.frame(TimeUTC = 0:23, Flow = diurnal) %>%
ggplot(aes(x = TimeUTC, y = Flow)) +
geom_area(fill = "dodgerblue2", alpha = 0.5) +
scale_x_continuous(breaks = 0:23) + ylab("Flow [L/h/p]") +
ggtitle("Idealised diurnal curve for households")
ggsave("Hydroinformatics/DigitalMetering/diurnal_curve.png", dpi = 300)

tdiff <- 11
diurnal <- c(diurnal[(tdiff + 1): 24], diurnal[1:tdiff])
```

This simulation only aims to simulate a realistic data set and not to present an accurate depiction of reality. This simulation could be enhanced by using different diurnal curves for various customer segments and to include outdoor watering, temperature dependencies and so on.

## Simulating Water Consumption

A leak is defined by a constant flow through the meter, in addition to the idealised diurnal curve. A weighted binomial distribution (θ = 0.1) models approximately one in ten properties with a leak. The size of the leak is derived from a random number between 10 and 50 litres per hour.

The data is stored in a matrix through a loop that cycles through each connection. The DevEUI is repeated over the simulated time period (24 times the number of days). The second variable is the time stamp plus the predetermined offset for each RTU. The meter count is defined by the cumulative sum of the diurnal flow, multiplied by the number of occupants. Each point in the diurnal deviates from the model curve by ±10%. Any predetermined leakage is added to each meter read over the whole period of 100 days. The hourly volumes are summed cumulatively to simulate meter reads. The flow is divided by five as each meter revolution indicate five litres.

The next code snippet simulates the digital metering data using the assumptions and parameters outlined above.

```# Leak simulation
leaks <- rbinom(n, 1, prob = .1) * sample(10:50, n, replace = TRUE) data.frame(DevEUI = rtu, Leak = leaks) %>%
subset(Leak > 0)

# Digital metering data simulation
meter_reads <- matrix(ncol = 3, nrow = 24 * n * d)

for (i in 1:n) {
r <- ((i - 1) * 24 * d + 1):(i * 24 * d)
meter_reads[r, 1] <- rep(rtu[i], each = (24 * d))
meter_reads[r, 2] <- seq.POSIXt(s, by = "hour", length.out = 24 * d) + offset[i]
meter_reads[r, 3] <- round(cumsum((rep(diurnal * occupants[i], d) + leaks[i]) *
runif(24 * d, 0.9, 1.1))/5)
}

as_data_frame() %>%
mutate(TimeStampUTC = as.POSIXct(TimeStampUTC, origin = "1970-01-01", tz ="UTC"))
```

## Missing Data Points

The data transmission process is not 100% reliable and the base station will not receive some reads. This simulation identifies reads to be removed from the data through the temporary variable remove. This simulation includes two types of failures:

• Faulty RTUs (2% of RTUs with missing 95% of data)
• Randomly missing data points (1% of data)
```# Initialise temp variable
# Define faulty RTUs (2% of fleet)
faulty <- rtu[rbinom(n, 1, prob = 0.02) == 1]

# Data loss
for (m in missing){
meter_reads[m:(m + sample(1:5, 1)), "remove"] <- 1
}

# Remove data points
select(-remove)

#Visualise
mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC,
tz = "Australia/Melbourne"))) %>%
filter(TimeStampAEST >= as.POSIXct("2020-02-06") &
TimeStampAEST <= as.POSIXct("2020-02-08")) %>%
arrange(DevEUI, TimeStampAEST) %>%
ggplot(aes(x = TimeStampAEST, y = Count, colour = factor(DevEUI)))  +
geom_line() + geom_point()
```

The graph shows an example of the cumulative reads and some missing data points.

## Analysing Digital Metering Data

Data simulation is a good way to develop your analysis algorithms before you have real data. I have also used this technique when I was waiting for survey results during my dissertation. When the data finally arrived, I simply had to plug it into the code and finetune the code. R has great capabilities to simulate reality to help you understand the data.

In next week’s article, I will outline how I used R and the Tidyverse package to develop libraries to analyse digital metering data.

# Visualising Water Consumption using a Geographic Bubble Chart

A geographic bubble chart is a straightforward method to visualise quantitative information with a geospatial relationship. Last week I was in Vietnam helping the Phú Thọ Water Supply Joint Stock Company with their data science. They asked me to create a map of a sample of their water consumption data. In this post, I share this little ditty to explain how to plot a bubble chart over a map using the

In this post, I share this little ditty to explain how to plot a bubble chart over a map using the ggmap package.

## Load and Explore the Data

The sample data contains a list of just over 100 readings from water meters in the city of Việt Trì in Vietnam, plus their geospatial location. This data uses the World Geodetic System of 1984 (WGS84), which is compatible with Google Maps and similar systems.

```# Load the data

# Summarise the data
summary(water\$Consumption)
```

The consumption at each connection is between 0 and 529 cubic metres, with a mean consumption of 23.45 cubic metres.

## Visualise the data with a geographic bubble chart

With the ggmap extension of the ggplot package, we can visualise any spatial data set on a map. The only condition is that the spatial coordinates are in the WGS84 datum. The ggmap package adds a geographical layer to ggplot by adding a Google Maps or Open Street Map canvas.

The first step is to download the map canvas. To do this, you need to know the centre coordinates and the zoom factor. To determine the perfect zoon factor requires some trial and error. The ggmap package provides for various map types, which are described in detail in the documentation.

```# Load map library
library(ggmap)

# Find the middle of the points
centre <- c(mean(range(water\$lon)), mean(range(water\$lat)))

viettri <- get_map(centre, zoom = 17, maptype = "hybrid")
g <- ggmap(viettri)
```

The ggmap package follows the same conventions as ggplot. We first call the map layer and then add any required geom. The point geom creates a nice bubble chart when used in combination with the `scale_size_area option`. This option scales the points to a maximum size so that they are easily visible. The transparency (alpha) minimises problems with overplotting. This last code snippet plots the map with water consumption.

```# Add the points
g + geom_point(data = reads, aes(x = lon, y = lat, size = Consumption),
shape = 21, colour = "dodgerblue4", fill = "dodgerblue", alpha = .5) +
scale_size_area(max_size = 20) +
# Size of the biggest point
ggtitle("Việt Trì sự tiêu thụ nước")
```

You can find the code and data for this article on my GitHub repository. With thanks to Ms Quy and Mr Tuyen of Phu Tho water for their permission to use this data.

This map visualises water consumption in the targeted area of Việt Trì. The larger the bubble, the larger the consumption. It is no surprise that two commercial customers used the most water. Ggplot automatically adds the legend for the consumption variable.