Approximations of Pi: A Random Walk though the Beauty of Pi

Off all thinkable numbers, Pi has somewhat of celebrity status. It is so famous it even has its holiday. Mathematicians who use the American way to write dates recognise March the 14th as Pi Day. Some authors have even assigned a mystical status to the number of Pi. In the novel Contact by Carl Sagan, the heroine Ellie discovers a hidden message in the base-11 representation of Pi. In the 1998 film Pi by Darren Aronofsky, the protagonist is driven mad by the idea that a secret pattern in random numbers. The string of numbers that form the decimals of Pi might seem random, but they are of course perfectly predictable and calculable. Their apparent randomness can create artful visualisations.

This article discusses some approximations of Pi using the R language and visualises the results. The video below is an ode to the aesthetic beauty of the digits of Pi by Numberphile.

Approximations of Pi

My favourite approximation of Pi has served me well through the early part of my engineering career. When using calculators that did not have a button for \pi , I used 22/7. It is only accurate to two decimals, but that is more than enough for dredging projects.

The base R language can provide 22 decimals of Pi using options(digits = 22). If you need more accuracy, then the MPFR package in R provides a library for multiple-precision floating-point computations. The Const function can compute Pi to a high level of precision with thousands of digits.

library(rmpfr)
pi_mpfr <- Const("pi", prec = 200000)

There are also more creative ways to determine the number Pi using Monte Carlo simulations. Imagine you are terribly bored, sitting inside a room with floorboards. You decide to take a match with length l and the floorboards are t wide. The needle is half the length of the width of the floorboards (t/l = 2). You drop the matchstick randomly for about ten million times (n=10,000,000) and you record every time the needle crosses the groove of one of the boards (o). The number Pi can be approximated by:

\pi \approx \frac{c \cdot l}{o \cdot t}

This bit of code simulates dropping the needle ten million times, which gives an estimated value of \pi \approx 3.143096 . You would have to be extremely bored to throw a needle that many times to only achieve ccuracy to two decimals!

library(tidyverse)
t <- 2
l <- 1
n <- 10000000

needles <- data_frame(phi = runif(n, 0, 2 * pi),
           x1 = runif(n, 0, (l + 3)), 
           y1 = runif(n, 0, (l + 3)),
           x2 = x1 + l * sin(phi),
           y2 = y1 + l * cos(phi), 
           overlap = (x1 < 1 & x2 > 1) | (x1 < 3 & x2 > 3))

ggplot(needles[1:1000,], aes(x1, y1)) + 
  geom_segment(aes(xend = x2, yend = y2, colour = overlap)) + 
  geom_vline(xintercept = c(1, 3), colour = "red") +
  scale_color_manual(values = c("gray", "black")) + 
  theme_void() 
ggsave("Misc/buffon.png", dpi = 300)

pi_buffon <- (n * l) / (sum(needles$overlap) * t)

Approximations of Pi: A Random Walk though the digits of Pi

Visualising the number Pi

In 1888, the John Venn, inventor of the eponymous Venn diagram, wanted to visualise the apparent randomness of the digits of Pi. He obtained the first 707 digits of Pi from amateur mathematician William Shanks. Unfortunately, only the first 527 decimals of Pi were correct but this was only discovered in 1944. Venn assigned a compass point to the digits 0 to 7 and then drew lines to show the path indicated by each digit, ignoring the digits 8 and 9.

Running through the values of Pi this way produces a line that snakes its way through the graph. When you use random numbers with the same method, the graph looks very similar. In this case, I have downloaded the digits of Pi from the On-Line Encyclopedia of Integer Sequences (Nr. A000796).

library(tidyverse)
# Download Pi Digits
pi_digits <- read.csv("http://oeis.org/A000796/b000796.txt", header = FALSE, sep = " ", skip = 1) %>%
  select(digit = V2)

# Venn walk
venn_walk <- function(digits) {
  digits <- digits[digits != 8 & digits != 9]
  l <- length(digits) - 1
  x <- rep(0, l)
  y <- x
  for (i in 1:l) {
    a <- digits[i + 1] * pi / 4
    dx <- round(sin(a))
    dy <- round(cos(a))
    x[i + 1] <- x[i] + dx 
    y[i + 1] <- y[i] + dy
  }
  coords <- data_frame(x = x, y = y) 
  ggplot(coords, aes(x, y)) + geom_path() + 
    geom_point(data = coords[c(1, l + 1), ], aes(x, y), colour = "red", size = 2) + 
    theme_void() 
} 
venn_walk(pi_digits) 
ggsave("Misc/venn_pi_walk.png", dpi = 300) 

# Random Numbers
data.frame(digit = sample(0:7, 20000, replace = TRUE)) %>%
 venn_walk()

Random walk through the digits of Pi - The Venn method

The Numberphile video shows some beautiful visualisations of Pi, one of which I like to share with you to close this article. Martin Krzywinski created this, and many other, visualisations of Pi.

data_frame(x = rep(1:20, 20),
           y = unlist(lapply(1:20, function(x) rep(x, 20))),
           d = pi_digits$digit[1:400]) %>%
  mutate(d = factor(d)) %>%
  ggplot(aes(x, y, colour = d)) + geom_point(size = 3) +
  theme_void() + 
  theme(plot.background = element_rect(fill = "black"),
        panel.background = element_rect(fill = "black"),
        legend.position="none")
ggsave("Misc/pi_dots.png")

As always, you can find the latest version of this code on GitHub. Feel free to subscribe to this blog if you like to receive articles in your mailbox.

Approximations of Pi: Art work that visualises the random nature of the decimals of Pi

Tap Water Sentiment Analysis using Tidytext

In developed countries, tap water is safe to drink and available for a meagre price. Despite the fact that high-quality drinking water is almost freely available, the consumption of bottled water is increasing every year. Bottled water companies use sophisticated marketing strategies, while water utilities are mostly passive providers of public service. Australian marketing expert Russell Howcroft even called water utilities “lazy marketers”. Can we use data science to find out more about how people feel about tap water and learn about the reasons behind this loss in trust in the municipal water supply?

This tap water sentiment analysis estimates the attitudes people have towards tap water by analysing tweets. This article explains how to examine tweets about tap water using the R language for statistical computing and the Tidytext package. The most recent version of the code and the raw data set used in this analysis can be viewed on my GitHub page.

Tap Water Sentiment Analysis

Each tweet that contains the words “tap water” contains a message about the attitude the author has towards that topic. Each text expresses a sentiment about the topic it describes. Sentiment analysis is a data science technique that extracts subjective information from a text. The basic method compares a string of words with a set of words with calibrated sentiments. These calibrated sets are created by asking many people how they feel about a certain word. For example, the word “stink” expresses a negative sentiment, while the word “nice” would be a positive sentiment.

This tap water sentiment analysis consists of three steps. The first step extracts 1000 tweets that contain the words “tap water” from Twitter. The second step cleans the data, and the third step undertakes the analysis visualises the results.

Extracting tweets using the TwitteR package

The TwitteR package by Geoff Gentry makes it very easy to retrieve tweets using search criteria. You will need to create an API on Twitter to receive the keys and tokens. In the code below, the actual values have been removed. Follow the instructions in this article to obtain these codes for yourself. This code snippet calls a private file to load the API codes, extracts the tweets and creates a data frame with a tweet id number and its text.

# Init
library(tidyverse)
library(tidytext)
library(twitteR)

# Extract tap water tweets
source("twitteR_API.R")
setup_twitter_oauth(api_key, api_secret, token, token_secret)
tapwater_tweets <- searchTwitter("tap water", n = 1000, lang = "en") %>%
  twListToDF() %>%
  select(id, text)
tapwater_tweets <- subset(tapwater_tweets, !duplicated(tapwater_tweets$text))
tapwater_tweets$text <- gsub("’", "'", tapwater_tweets$text)
write_csv(tapwater_tweets, "Hydroinformatics/tapwater_tweets.csv")

When I first extracted these tweets, a tweet by CNN about tap water in Kentucky that smells like diesel was retweeted many times, so I removed all duplicate tweets from the set. Unfortunately, this left less than 300 original tweets in the corpus.

Sentiment analysis with Tidytext

Text analysis can be a powerful tool to help to analyse large amounts of text. The R language has an extensive range of packages to help you undertake such a task. The Tidytext package extends the Tidy Data logic promoted by Hadley Wickham and his Tidyverse software collection.

Data Cleaning

The first step in cleaning the data is to create unigrams, which involves splitting the tweets into individual words that can be analysed. The first step is to look at which words are most commonly used in the tap water tweets and visualise the result.

# Tokenisation
tidy_tweets <- tapwater_tweets %>%
  unnest_tokens(word, text)

data(stop_words)
tidy_tweets <- tidy_tweets %>%
  anti_join(stop_words) %>%
  filter(!word %in% c("tap", "water", "rt", "https", "t.co", "gt", "amp", as.character(0:9)))

tidy_tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 5) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) + geom_col(fill = "dodgerblue4") +
    xlab(NULL) + coord_flip() + ggtitle("Most common words in tap water tweets")
ggsave("Hydroinformatics/tapwater_words.png", dpi = 300)

Most common words in tap water sentiment analysis

The most common words related to drinking the water and to bottled water, which makes sense. Also the recent issues in Kentucky feature in this list.

Sentiment Analysis

The Tidytext package contains three lexicons of thousands of single English words (unigrams) that were manually assessed for their sentiment. The principle of the sentiment analysis is to compare the words in the text with the words in the lexicon and analyse the results. For example, the statement: “This tap water tastes horrible” has a sentiment score of -3 in the AFFIN system by Finn Årup Nielsen due to the word “horrible”. In this analysis, I have used the “bing” method published by Liu et al. in 2005.

# Sentiment analysis
sentiment_bing <- tidy_tweets %>%
  inner_join(get_sentiments("bing"))

sentiment_bing %>%
  summarise(Negative = sum(sentiment == "negative"), 
            positive = sum(sentiment == "positive"))

sentiment_bing %>%
  group_by(sentiment) %>%
  count(word, sort = TRUE) %>%
  filter(n > 2) %>%
  ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + 
    coord_flip() + facet_wrap(~sentiment, scales = "free_y") + 
    ggtitle("Contribution to sentiment") + xlab(NULL) + ylab(NULL)
ggsave("Hydroinformatics/tapwater_sentiment.png", dpi = 300)

This tap water sentiment analysis shows that two-thirds of the words that express a sentiment were negative. The most common negative words were “smells” and “scared”. This analysis is not a positive result for water utilities. Unfortunately, most tweets were not spatially located so I couldn’t determine the origin of the sentiment.

Tap Water sentiment analysis

Sentiment analysis is an interesting explorative technique, but it should not be interpreted as absolute truth. This method is not able to detect sarcasm or irony, and words don’t always have the same meaning as described in the dictionary.

The important message for water utilities is that they need to start taking the aesthetic properties of tap water as serious as the health parameters. A lack of trust will drive consumers to bottled water, or less healthy alternatives such as soft drinks are alternative water sources.

If you like to know more about customer perceptions of tap water, then read my book Customer Experience Management for Water Utilities by IWA Publishing.

Customer Experience Management

Topological Tomfoolery in R: Plotting a Möbius Strip

Topology is, according to Clifford Pickover, the “silly putty of mathematics”. This branch of maths studies the transformation of shapes, knots and other complex geometry problems.  One of the most famous topics in topology is the Möbius strip. This shape has some unusual properties which have inspired many artists, inventors, mathematicians and magicians.

You can make a Möbius strip by taking a strip of paper, giving it one twist and glue the ends together to form a loop. If you now cut this strip lengthwise in half, you don’t end-up with two separate strips, but with one long one.

The Möbius strip can also be described with the following parametric equations (where 0 \leq u \leq 2\pi, -1 \leq v \leq 1 and R is the radius of the loop):

x(u,v)= \left(R+\frac{v}{2} \cos \frac{u}{2}\right)\cos u
y(u,v)= \left(R+\frac{v}{2} \cos\frac{u}{2}\right)\sin u
z(u,v)= \frac{v}{2}\sin \frac{u}{2}

The mathematics of this set of parametric equations is not as compex as it looks. R is the radius of the ring, u is the polar angle of each point and v indicates the width of the strip. The polar angle u/2 indicates the number of half twists. To make the ring twist twice, change the anlge to u.

For my data science day job, I have to visualise some three-dimensional spaces so I thought I best learn how to do this by visualising a Möbis strip, using these three equations.

Plotting a Möbius Strip

The RGL package provides the perfect functionality to play with virtual Möbius strips. This package produces interactive three-dimensional plots that you can zoom and rotate. This package has many options to change lighting, colours, shininess and so on. The code to create for plotting a Möbius strip is straightforward.

The first section defines the parameters and converts the u and v sequences to a mesh (from the plot3D package). This function creates two matrices with every possible combination of u and v which are used to calculate the x, y, z points.

The last three lines define a 3D window with a white background and plot the 3D surface in blue. You can explore the figure with your mouse by zooming and rotating it. Parametric equations can be a bit of fun, play with the formula to change the shape and see what happens.

# Moebius strip
library(rgl)
library(plot3D)

# Define Parameters
R <- 5
u <- seq(0, 2 * pi, length.out = 100)
v <- seq(-1, 1, length.out = 100)
m <- mesh(u, v)
u <- m$x
v <- m$y

# Móbius strip parametric equations
x <- (R + v/2 * cos(u /2)) * cos(u)
y <- (R + v/2 * cos(u /2)) * sin(u)
z <- v/2 * sin(u / 2)  

# Visualise
bg3d(color = "white")
surface3d(x, y, z, color= "blue")

You can find the latest version of this code on GitHub.

Plotting a Möbius Strip

Plotting a Möbius Strip: RGL output.

We can take it to the next level by plotting a three-dimensional Möbius strip, or a Klein Bottle. The parametric equations for the bottle are mind boggling:

x(u,v) = -\frac{2}{15} \cos u (3 \cos{v}-30 \sin{u}+90 \cos^4{u} \sin{u} -60 \cos^6{u} \sin{u} +5 \cos{u} \cos{v} \sin{u})

y(u,v) = -\frac{1}{15} \sin u (3 \cos{v}-3 \cos^2{u} \cos{v}-48 \cos^4{u} \cos{v} + 48 \cos^6{u} \cos{v} - 60 \sin{u}+5 \cos{u} \cos{v} \sin{u}-5 \cos^3{u} \cos{v} \sin{u}-80 \cos^5{u} \cos{v} \sin{u}+80 \cos^7{u} \cos{v} \sin{u})

z(u,v) = \frac{2}{15} (3+5 \cos{u} \sin{u}) \sin{v}

Where: 0 \leq u \leq \pi and 0 \leq v \leq 2\leq.

The code to visualise this bottle is essentially the same, just more complex equations.

u <- seq(0, pi, length.out = 100)
v <- seq(0, 2 * pi, length.out = 100)
m <- mesh(u, v)
u <- m$x
v <- m$y
x <- (-2 / 15) * cos(u) * (3 * cos(v) - 30 * sin(u) + 90 * cos(u)^4 * sin(u) - 60 * cos(u)^6 * sin(u) + 5 * cos(u) * cos(v) * sin(u))
y <- (-1 / 15) * sin(u) * (3 * cos(v) - 3 * cos(u)^2 * cos(v) - 48 * cos(u)^4 * cos(v) + 48 * cos(u)^6 * cos(v) - 60 * sin(u) + 5 * cos(u) * cos(v) * sin(u) - 5 * cos(u)^3 * cos(v) * sin(u) - 80 * cos(u)^5 * cos(v) * sin(u) + 80 * cos(u)^7 * cos(v) * sin(u))
z <- (+2 / 15) * (3 + 5 * cos(u) * sin(u)) * sin(v)

bg3d(color = "white")
surface3d(x, y, z, color= "blue", alpha = 0.5)
Plotting a Klein Bottle in RGL

Plotting a Klein Bottle in RGL. Click to view RGL widget.

The RGL package has some excellent facilities to visualise three-dimensional objects, far beyond simple strips. I am still learning and am working toward using it to visualise bathymetric surveys of water reservoirs. Möbius strips are, however, a lot more fun.

Creating Real Möbius Strips

Even more fun than playing with virtual Möbius strips is to make some paper versions and start cutting, just like August Möbius did when he did his research. If you like to create a Möbius strip, you can recycle then purchase a large zipper from your local haberdashery shop, add some hook-and-loop fasteners to the ends and start playing. If you like to know more about the mathematics of the topological curiosity, then I can highly recommend Clifford Pickover’s book on the topic.

Möbius strip zipper

Möbius strip zipper.

The Möbius Strip in Magic

In the first half of the twentieth century, many magicians used the Möbius strip as a magic trick. The great Harry Blackstone performed it regularly in his show.

If you are interested in magic tricks and Möbius strips, then you can read my ebook on the Afghan bands.

The Möbius Strip in Magic. A Treatise on the Afghan Bands

Analysing Digital Water Meter Data using the Tidyverse

In last week’s article, I discussed how to simulate water consumption data to help develop analytics and reporting. This post describes how to create a diurnal curve from standard digital metering data.

Data Source

The simulated data consists  of three fields:

All analysis is undertaken in the local Australian Eastern Standard Time (AEST). The input to all functions is thus in AEST. The digital water meters send an hourly pulse at a random time within the hour. Each transmitter (RTU) uses a random offset to avoid network congestion. The digital meter counts each time the impeller makes a full turn, and for this analysis, we assume that this equates to a five-litre volume. The ratio between volume and count depends on the meter brand and type. The image below shows a typical data set for an RTU, including some missing data points.

Simulated water consumption.

Simulated water consumption (red: measured points, blue: interpolated points.

To analyse the data we need two auxiliary functions: one to slice the data we need and one to interpolate data for the times we need it. The Tidyverse heavily influences the code in this article. I like the Tidyverse way of doing things because it leads to elegant code that is easy to understand.

library(tidyverse)
library(lubridate)
library(magrittr)
meter_reads <- read.csv("Hydroinformatics/DigitalMetering/meter_reads.csv")
rtu <- unique(meter_reads$DevEUI)
meter_reads$TimeStampUTC <- as.POSIXct(meter_reads$TimeStampUTC, tz = "UTC")

Slicing Digital Water Metering Data

Data analysis is undertaken on slices of the complete data set. This function slices the available data by a vector of RTU ids and a timestamp range in AEST. This function adds a new timestamp variable in AEST. If no date range is provided, all available data for the selected RTUs is provided. The output of this function is a data frame (a Tibble in Tydiverse language).

slice_reads <- function(rtus, dates = range(meter_reads$TimeStampUTC)) {
 filter(meter_reads, DevEUI %in% rtus) %>%
    mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC, tz = "Australia/Melbourne"))) %>%
    filter(TimeStampAEST >= as.POSIXct(dates[1]) & 
             TimeStampAEST <= as.POSIXct(dates[2])) %>%
    arrange(DevEUI, TimeStampAEST)
}

Interpolation of Meter Reads

This function interpolates the cumulative counts for a series of RTUs over a vector of timestamps in AEST. The function creates a list to store the results for each RTU, interpolates the data using the approx function and then flattens the list back to a data frame. The interpolation function contains a different type of pipe because of the approx for interpolation function does not take a data argument. The %$% pipe from the Magrittr package solves that problem.

The output is a data frame with DevEUI, the timestamp in AEST and the interpolated cumulative count. The image above shows the counts for two meters over two days an the graph superimposes an interpolated point over the raw data. Although the actual data consists of integer counts, interpolated values are numeric values. The decimals are retained to distinguish them from real reads.

interpolate_count <- function(rtus, timestamps) {
  timestamps <- as.POSIXct(timestamps, tz = "Australia/Melbourne")
  results <- vector("list", length(rtus))
  for (r in seq_along(rtus)) {
    interp <- slice_reads(rtus[r]) %$%
      approx(TimeStampAEST, Count, timestamps)
    results[[r]] <- data_frame(DevEUI = rep(rtus[r], length(timestamps)), TimeStampAEST = timestamps, Count = interp$y) 
  } 
  return(do.call(rbind, results)) 
} 

interpolate_count(rtu[2:3], seq.POSIXt(as.POSIXct("2020-02-01"), as.POSIXct("2020-02-2"), by = "day")) 

slice_reads(rtu[2], c("2020-02-06", "2020-02-08")) %>%
  ggplot(aes(x = TimeStampAEST, y = Count))  + 
  geom_line(col = "grey", size = 1) + 
    geom_point(col = "red") + 
    geom_point(data = interpolate_count(rtu[2], as.POSIXct("2020-02-06") + (0:2)*24*3600), colour = "blue") + 
    ggtitle(paste("DevEUI", rtu[2]))

With these two auxiliary functions, we can start analysing the data.

Daily Consumption

Daily consumption for each connection is a critical metric in managing water resources and billing customers. The daily consumption of any water connection is defined by the difference between the cumulative counts at midnight. The interpolation function makes it easy to determine daily consumption. This function interpolates the midnight reads for each of the RTUs over the period, starting the previous day. The output of the function is a data frame that can be piped into the plotting function to visualise the data. When you group the data by date, you can also determine the total consumption over a group of services.

daily_consumption <- function(rtus, dates) {
  timestamps <- seq.POSIXt(as.POSIXct(min(dates)) - 24 * 3600, as.POSIXct(max(dates)), by = "day") 
  interpolate_count(rtus, timestamps) %>%
    group_by(DevEUI) %>%
    mutate(Consumption = c(0, diff(Count)) * 5,
           Date = format(TimeStampAEST, "%F")) %>%
    filter(TimeStampAEST != timestamps[1]) %>%
    select(DevEUI, Date, Consumption)
}

daily_consumption(rtu[32:33], c("2020-02-01", "2020-02-7")) %>%
  ggplot(aes(x = Date, y = Consumption)) + geom_col() + 
  facet_wrap(~DevEUI) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
Analysing digital water meter data: Daily consumption.

Analysing digital water meter data: Daily consumption.

Diurnal Curves

The diurnal curve is one of the most important pieces of information used in the design of water supply systems. This curve shows the usage of one or more services for each hour in the day. This curve is a reflection of human behaviour, as we use most water in the morning and the evenings.

This function slices data for a vector of RTUs over a period and then plots the average diurnal curve. The data is obtained by interpolating the cumulative counts for each whole hour in the period. The function then calculates the flow in litres per hour and visualises the minimum, mean and maximum value.

plot_diurnal_connections <- function(rtus, dates) {
  timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour") 
  interpolate_count(rtus, timestamps) %>% 
    mutate(Rate = c(0, diff(Count * 5)),
           Hour = as.integer(format(TimeStampAEST, "%H"))) %>% 
    filter(Rate >= 0) %>%
    group_by(Hour) %>%
    summarise(min = min(Rate), mean = mean(Rate), max = max(Rate)) %>%
    ggplot(aes(x = Hour, ymin = min, ymax = max)) + 
      geom_ribbon(fill = "lightblue", alpha = 0.5) + 
      geom_line(aes(x = Hour, y = mean), col = "orange", size = 1) +
      ggtitle("Connections Diurnal flow") + ylab("Flow rate [L/h]")
}

plot_diurnal_connections(rtu[12:20], c("2020-02-01", "2020-03-01"))
Analysing digital water meter data: Diurnal curve.

Analysing digital water meter data: Diurnal curve.

Boxplots are also an informative way to visualise this curve. This method provides more statistical information on one page, and the ggplot function performs the statistical analysis.

plot_diurnal_box <- function(rtus, dates) {
  timestamps <- seq.POSIXt(as.POSIXct(dates[1]), as.POSIXct(dates[2]), by = "hour") 
  interpolate_count(rtus, timestamps) %>% 
    mutate(Rate = c(0, diff(Count * 5)),
           Hour = as.integer(format(TimeStampAEST, "%H"))) %>% 
    filter(Rate >= 0) %>%
    group_by(Hour) %>%
    ggplot(aes(x = factor(Hour), y = Rate)) + 
      geom_boxplot() + 
      ggtitle("Diurnal flow") + ylab("Flow rate [L/h]") + xlab("Time")
}

plot_diurnal_box(rtu[12:20], c("2020-02-01", "2020-03-01"))
Analysing digital water meter data: Diurnal curve.

Analysing digital water meter data: Diurnal curve.

Further Analysing Digital Water Metering Data

These are only glimpses into what is possible with this type of data. Further algorithms need to be developed to extract additional value from this data. I am working on developing leak detection algorithms and clustering diurnal curves, daily consumption graphs and so on. Any data science enthusiast who is interested in helping me to develop an Open Source R library to analyse digital metering data.

The code for this article is available on GitHub.

 

Simulating Water Consumption to Develop Analysis and Reporting

I am currently working on developing analytics for a digital water metering project. Over the next five years, we are enabling 70,000 customer water meters with digital readers and transmitters. The data is not yet available but we don’t want to wait to build reporting systems until after the data is live. The R language comes to the rescue as it has magnificent capabilities to simulate data. Simulating data is a useful technique to progress a project when data is being collected. Simulated data also helps because the outcomes of the analysis are known, which helps to validate the outcomes.

The raw data that we will eventually receive from the digital customer meters has the following basic structure:

  • DevEUI: Unique device identifier.
  • Timestamp: Date and time in (UTC) of the transmission.
  • Cumulative count: The number of revolutions the water meter makes. Each revolution is a pulse which equates to five litres of water.

Every device will send an hourly data burst which contains the cumulative meter read in pulse counts. The transmitters are set at a random offset from the whole our, to minimise the risk of congestion at the receivers. The time stamp for each read is set in the Coordinated Universal Time (UTC). Using this time zone prevents issues with daylight savings. All analysis will be undertaken in the Australian Eastern (Daylight) Time zone.

This article explains how we simulated test data to assist with developing reporting and analysis. The analysis of digital metering data follows in a future post. The code and the data can be found on GitHub. I have recently converted to using the Tidyverse for all my R coding. It has made my working life much easier and I will use it for all future posts.

Simulating water consumption

For simplicity, this simulation assumes a standard domestic diurnal curve (average daily usage pattern) for indoor water use. Diurnal curves are an important piece of information in water management. The curve shows water consumption over the course of a day, averaged over a fixed period. The example below is sourced from a journal article. This generic diurnal curve consists of 24 data points based on measured indoor water consumption, shown in the graph below.

Simulating water consumption: diurnal curve example

Source: Gurung et al. (2014) Smart meters for enhanced water supply network modelling and infrastructure planning. Resources, Conservation and Recycling (90), 34-50.

This diurnal curve only includes indoor water consumption and is assumed to be independent of seasonal variation. This is not a realistic assumption, but the purpose of this simulation is not to accurately model water consumption but to provide a data set to validate the reporting and analyses.

Simulating water consumption in R

The first code snippet sets the parameters used in this simulation. The unique device identifiers (DevEUI) are simulated as six-digit random numbers. The timestamps vector consists of hourly date-time variables in UTC. For each individual transmitter, this timestamp is offset by a random time. Each transmitter is also associated with the number of people living in each house. This number is based on a Poisson distribution.

# Libraries
library(tidyverse)
# Boundary conditions
n <- 100 # Number of simulated meters
d <- 100 # Number of days to simulate
s <- as.POSIXct("2020-01-01", tz = "UTC") # Start of simulation

set.seed(1969) # Seed random number generator for reproducibility
rtu <- sample(1E6:2E6, n, replace = FALSE) # 6-digit id
offset <- sample(0:3599, n, replace = TRUE) # Unique Random offset for each RTU

# Number of occupants per connection
occupants <- rpois(n, 1.5) + 1 
as.data.frame(occupants) %>%
  ggplot(aes(occupants)) + geom_bar(fill = "dodgerblue2", alpha = 0.5) + 
  xlab("Occupants") + ylab("Connections") + ggtitle("Occupants per connection")
Simulated number of occupants per connection.

Simulated number of occupants per connection.

The diurnal curve is based on actual data which includes leaks as the night time use shows a consistent flow of about one litre per hour. For that reason, the figures are rounded and reduced by one litre per hour, to show a zero flow when people are usually asleep. The curve is also shifted by eleven hours because the raw data is stored in UTC.

diurnal <- round(c(1.36, 1.085, 0.98, 1.05, 1.58, 3.87, 9.37, 13.3, 12.1, 10.3, 8.44, 7.04, 6.11, 5.68, 5.58, 6.67, 8.32, 10.0, 9.37, 7.73, 6.59, 5.18, 3.55, 2.11)) - 1 

data.frame(TimeUTC = 0:23, Flow = diurnal) %>% 
  ggplot(aes(x = TimeUTC, y = Flow)) + 
  geom_area(fill = "dodgerblue2", alpha = 0.5) +
  scale_x_continuous(breaks = 0:23) + ylab("Flow [L/h/p]") + 
  ggtitle("Idealised diurnal curve for households")
ggsave("Hydroinformatics/DigitalMetering/diurnal_curve.png", dpi = 300)

tdiff <- 11
diurnal <- c(diurnal[(tdiff + 1): 24], diurnal[1:tdiff])

This simulation only aims to simulate a realistic data set and not to present an accurate depiction of reality. This simulation could be enhanced by using different diurnal curves for various customer segments and to include outdoor watering, temperature dependencies and so on.

Simulating Water Consumption

A leak is defined by a constant flow through the meter, in addition to the idealised diurnal curve. A weighted binomial distribution (θ = 0.1) models approximately one in ten properties with a leak. The size of the leak is derived from a random number between 10 and 50 litres per hour.

The data is stored in a matrix through a loop that cycles through each connection. The DevEUI is repeated over the simulated time period (24 times the number of days). The second variable is the time stamp plus the predetermined offset for each RTU. The meter count is defined by the cumulative sum of the diurnal flow, multiplied by the number of occupants. Each point in the diurnal deviates from the model curve by ±10%. Any predetermined leakage is added to each meter read over the whole period of 100 days. The hourly volumes are summed cumulatively to simulate meter reads. The flow is divided by five as each meter revolution indicate five litres.

The next code snippet simulates the digital metering data using the assumptions and parameters outlined above.

# Leak simulation
leaks <- rbinom(n, 1, prob = .1) * sample(10:50, n, replace = TRUE) data.frame(DevEUI = rtu, Leak = leaks) %>%
  subset(Leak > 0)

# Digital metering data simulation
meter_reads <- matrix(ncol = 3, nrow = 24 * n * d)
colnames(meter_reads) <- c("DevEUI", "TimeStampUTC", "Count")

for (i in 1:n) {
  r <- ((i - 1) * 24 * d + 1):(i * 24 * d)
  meter_reads[r, 1] <- rep(rtu[i], each = (24 * d))
  meter_reads[r, 2] <- seq.POSIXt(s, by = "hour", length.out = 24 * d) + offset[i]
  meter_reads[r, 3] <- round(cumsum((rep(diurnal * occupants[i], d) + leaks[i]) * 
                                     runif(24 * d, 0.9, 1.1))/5)
} 

meter_reads <- meter_reads %>% 
  as_data_frame() %>%
  mutate(TimeStampUTC = as.POSIXct(TimeStampUTC, origin = "1970-01-01", tz ="UTC"))

Missing Data Points

The data transmission process is not 100% reliable and the base station will not receive some reads. This simulation identifies reads to be removed from the data through the temporary variable remove. This simulation includes two types of failures:

  • Faulty RTUs (2% of RTUs with missing 95% of data)
  • Randomly missing data points (1% of data)
# Initialise temp variable
meter_reads <- mutate(meter_reads, remove = 0)
# Define faulty RTUs (2% of fleet)
faulty <- rtu[rbinom(n, 1, prob = 0.02) == 1]
meter_reads$remove[meter_reads$DevEUI %in% faulty] <- rbinom(sum(meter_reads$DevEUI %in% faulty), 1, prob = .95)

# Data loss
missing <- sample(1:(nrow(meter_reads) - 5), 0.005 * nrow(meter_reads))
for (m in missing){
  meter_reads[m:(m + sample(1:5, 1)), "remove"] <- 1
}

# Remove data points
meter_reads <- filter(meter_reads, remove == 0) %>%
  select(-remove)

#Visualise
filter(meter_reads, DevEUI %in% rtu[2]) %>%
  mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC, 
                                           tz = "Australia/Melbourne"))) %>%
  filter(TimeStampAEST >= as.POSIXct("2020-02-06") & 
         TimeStampAEST <= as.POSIXct("2020-02-08")) %>%
  arrange(DevEUI, TimeStampAEST) %>% 
  ggplot(aes(x = TimeStampAEST, y = Count, colour = factor(DevEUI)))  + 
    geom_line() + geom_point() 

The graph shows an example of the cumulative reads and some missing data points.

Simulated water consumption

 

Analysing Digital Metering Data

Data simulation is a good way to develop your analysis algorithms before you have real data. I have also used this technique when I was waiting for survey results during my dissertation. When the data finally arrived, I simply had to plug it into the code and finetune the code. R has great capabilities to simulate reality to help you understand the data.

In next week’s article, I will outline how I used R and the Tidyverse package to develop libraries to analyse digital metering data.

Writing academic articles using R Sweave and LaTeX

One of my favourite activities in R is using Markdown to create business reports. Most of my work I export to MS Word to communicate analytical results with my colleagues. For my academic work and eBooks, I prefer LaTeX to produce great typography. This article explains how to write academic articles and essays combining R Sweave and LaTeX. The article is formatted in accordance with the APA (American Psychological Association) requirements.

To illustrate the principles of using R Sweave and LaTeX, I recycled an essay about problems with body image that I wrote for a psychology course many years ago. You can find the completed paper and all necessary files on my GitHub repository.

R Sweave and LaTeX using the APA journal template

Body Image

Body image describes the way we feel about the shape of our body. The literature on this topic demonstrates that many people, especially young women, struggle with their body image. A negative body image has been strongly associated with eating disorders. Psychologists measure body image using a special scale, shown in the image below.

My paper measures the current and ideal body shape of the subject and the body shape of the most attractive other sex. The results confirm previous research which found that body dissatisfaction for females is significantly higher than for men. The research also found a mild positive correlation between age and ideal body shape for women and between age and the female body shape found most attractive by men. You can read the full paper on my personal website.

Body shape measurement scale.

Body shape measurement scale.

R Sweave and LaTeX

The R file for this essay uses the Sweave package to integrate R code with LaTeX. The first two code chunks create a table to summarise the respondents using the xtable package. This package creates LaTeX or HTML tables from data generated by R code.

The first lines of the code read and prepare the data, while the second set of lines creates a table in LaTeX code. The code chunk uses results=tex to ensure the output is interpreted as LaTeX. This approach is used in most of the other chunks. The image is created within the document and saved as a pdf file and back integrated into the document as an image with appropriate label and caption.

<<echo=FALSE, results=tex>>=
body <- read.csv("body_image.csv")
# Respondent characteristics
body$Cohort <- cut(body$Age, c(0, 15, 30, 50, 99), 
                   labels = c("<16", "16--30", "31--50", ">50"))
body$Date <- as.Date(body$Date)
body$Current_Ideal <- body$Current - body$Ideal

library(xtable)
respondents <- addmargins(table(body$Gender, body$Cohort))
xtable(respondents, caption = "Age profile of survey participants", 
                    label = "gender-age", digits = 0)
@

Configuration

I created this file in R Studio, using the Sweave and knitr functionality. To knit the R Sweave file for this paper you will need to install the apa6 and ccicons packages in your LaTeX distribution. The apa6 package provides macros to format papers in accordance with the requirements American Psychological Association.

Pandigital Products: Euler Problem 32

Euler Problem 32 returns to pandigital numbers, which are numbers that contain one of each digit. Like so many of the Euler Problems, these numbers serve no practical purpose whatsoever, other than some entertainment value. You can find all pandigital numbers in base-10 in the Online Encyclopedia of Interegers (A050278). The Numberhile video explains everything you ever wanted to

The Numberhile video explains everything you ever wanted to know about pandigital numbers but were afraid to ask.

Euler Problem 32 Definition

We shall say that an n-digit number is pandigital if it makes use of all the digits 1 to n exactly once; for example, the 5-digit number, 15234, is 1 through 5 pandigital.

The product 7254 is unusual, as the identity, 39 × 186 = 7254, containing multiplicand, multiplier, and product is 1 through 9 pandigital.

Find the sum of all products whose multiplicand/multiplier/product identity can be written as a 1 through 9 pandigital.

HINT: Some products can be obtained in more than one way so be sure to only include it once in your sum.

Proposed Solution

The pandigital.9 function tests whether a string classifies as a pandigital number. The pandigital.prod vector is used to store the multiplication.

The only way to solve this problem is brute force and try all multiplications but we can limit the solution space to a manageable number. The multiplication needs to have ten digits. For example, when the starting number has two digits, the second number should have three digits so that the total has four digits, e.g.: 39 × 186 = 7254. When the first number only has one digit, the second number needs to have four digits.

pandigital.9 <- function(x) # Test if string is 9-pandigital
    (length(x)==9 & sum(duplicated(x))==0 & sum(x==0)==0)

t <- proc.time()
pandigital.prod <- vector()
i <- 1
for (m in 2:100) {
    if (m < 10) n_start <- 1234 else n_start <- 123
    for (n in n_start:round(10000 / m)) {
        # List of digits
        digs <- as.numeric(unlist(strsplit(paste0(m, n, m * n), "")))
        # is Pandigital?
        if (pandigital.9(digs)) {
            pandigital.prod[i] <- m * n
            i <- i + 1
            print(paste(m, "*", n, "=", m * n))
        }
    }
}
answer <- sum(unique(pandigital.prod))
print(answer)

Numbers can also be checked for pandigitality using mathematics instead of strings.

You can view the most recent version of this code on GitHub.

Analysing soil moisture data in NetCDF format with the ncdf4 library

The netCDF format is popular in sciences that analyse sequential spatial data. It is a self-describing, machine-independent data format for creating, accessing and sharing array-oriented information. The netCDF format provides spatial time-series such as meteorological or environmental data. This article shows how to visualise and analyse this data format by reviewing soil moisture data published by the Australian Bureau of Statistics.

Soil Moisture data

The Australian Bureau of Meteorology publishes hydrological data in both a simple map grid and in the NetCDF format. The map grid consists of a flat text file that requires a bit of data jujitsu before it can be used. The NetCDF format is much easier to deploy as it provides a three-dimensional matrix of spatial data over time.

We are looking at the possible relationship between sewer main blockages and deep soil moisture levels. You will need to manually download this dataset from the Bureau of Meteorology website. I have not been able to scrape the website automatically. For this analysis, I use the actual deep soil moisture level, aggregated monthly in NetCDF 4 format.

Soil moisture data from the Australian Bureau of meteorology in netCDF format

Reading, Extracting and Transforming the netCDF format

The ncdf4 library, developed by David W. Pierce, provides the necessary functionality to manage this data. The first step is to load the data, extract the relevant information and transform the data for visualisation and analysis. When the data is read, it essentially forms a complex list that contains the metadata and the measurements.

The ncvar_get function extracts the data from the list. The lon, lat and dates variables are the dimensions of the moisture data. The time data is stored as the number of days since 1 January 1900. The spatial coordinates are stored in decimal degrees with 0.05-decimal degree intervals. The moisture data is a three-dimensional matrix with longitue, latitude and time as dimensions. Storing this data in this way will make it very easy to use.

library(ncdf4)
# Load data
bom <- nc_open("Hydroinformatics/SoilMoisture/sd_pct_Actual_month.nc")
print(bom) # Inspect the data

# Extract data
lon <- ncvar_get(bom, "longitude")
lat <- ncvar_get(bom, "latitude")
dates <- as.Date("1900-01-01") + ncvar_get(bom, "time")
moisture <- ncvar_get(bom, "sd_pct")
dimnames(moisture) <- list(lon, lat, dates)

Visualising the data

The first step is to check the overall data. This first code snippet extracts a matrix from the cube for 31 July 2017 and plots it. This code pipe extracts the date for the end of July 2017 and creates a data frame which is passed to ggplot for visualisation. Although I use the Tidyverse, I still need reshape2 because the gather function does not like matrices.

library(tidyverse)
library(RColorBrewer)
library(reshape2)

d <- "2017-07-31"
m <- moisture[, , which(dates == d)] %>%
       melt(varnames = c("lon", "lat")) %>%
       subset(!is.na(value))

ggplot(m, aes(x = lon, y = lat, fill = value)) + borders("world") + 
    geom_tile() + 
    scale_fill_gradientn(colors = brewer.pal(9, "Blues")) + 
    labs(title = "Total moisture in deep soil layer (100-500 cm)",
    subtitle = format(as.Date(d), "%d %B %Y")) + 
    xlim(range(lon)) + ylim(range(lat)) + coord_fixed()

Deep soil moisture: Source Bureau of Meteorology, Australia

With the ggmap package we can create a nice map of a local area.

library(ggmap)
loc <- round(geocode("Bendigo") / 0.05) * 0.05 
map_tile <- get_map(loc, zoom = 12, color = "bw") %>% 
    ggmap()

map_tile + 
    geom_tile(data = m, aes(x = lon, y = lat, fill = value), alpha = 0.8) + 
    scale_fill_gradientn(colors = brewer.pal(9, "Blues")) + 
    labs(title = "Total moisture in deep soil layer (100-500 cm)",
        subtitle = format(as.Date(d), "%d %B %Y"))

Analysing the data

For my analysis, I am interested in the time series of moisture data for a specific point on the map. The previous code slices the data horizontally over time. To create a time series we can pierce through the data for a specific coordinate. The purpose of this time series is to investigate the relationship between sewer main blockages and deep soil data, which can be a topic for a future post.

mt <- data.frame(date = dates, 
                 dp = moisture[as.character(loc$lon), as.character(loc$lat), ])
ggplot(mt, aes(x = date, y = dp)) + geom_line() + 
    labs(x = "Month",
         y = "Moisture",
         title = "Total moisture in deep soil layer (100-500 cm)",
         subtitle = paste(as.character(loc), collapse = ", "))

The latest version of this code is available on my GitHub repository.

Deep soil moisture time series.

Pacific Island Hopping using R and iGraph

Last month I enjoyed a relaxing holiday in the tropical paradise of Vanuatu. One rainy day I contemplated how to go island hopping across the Pacific ocean visiting as many island nations as possible. The Pacific ocean is a massive body of water between, Asia and the Americas, which covers almost half the surface of the earth. The southern Pacific is strewn with island nations from Australia to Chile. In this post, I describe how to use R to plan your next Pacific island hopping journey.

Pacific Island hopping

The Pacific Ocean.

Listing all airports

My first step was to create a list of flight connections between each of the island nations in the Pacific ocean. I am not aware of a publically available data set of international flights so unfortunately, I created a list manually (if you do know of such data set, then please leave a comment).

My manual research resulted in a list of international flights from or to island airports. This list might not be complete, but it is a start. My Pinterest board with Pacific island airline route maps was the information source for this list.

The first code section reads the list of airline routes and uses the ggmap package to extract their coordinates from Google maps. The data frame with airport coordinates is saved for future reference to avoid repeatedly pinging Google for the same information.

# Init
library(tidyverse)
library(ggmap)
library(ggrepel)
library(geosphere)

# Read flight list and airport list
flights <- read.csv("Geography/PacificFlights.csv", stringsAsFactors = FALSE)
f <- "Geography/airports.csv"
if (file.exists(f)) {
    airports <- read.csv(f)
    } else airports <- data.frame(airport = NA, lat = NA, lon = NA)

# Lookup coordinates for new airports
all_airports <- unique(c(flights$From, flights$To))
new_airports <- all_airports[!(all_airports %in% airports$airport)]
if (length(new_airports) != 0) {
    coords <- geocode(new_airports)
    new_airports <- data.frame(airport = new_airports, coords)
    airports <- rbind(airports, new_airports)
    airports <- subset(airports, !is.na(airport))
    write.csv(airports, "Geography/airports.csv", row.names = FALSE)
}

# Add coordinates to flight list
flights <- merge(flights, airports, by.x="From", by.y="airport")
flights <- merge(flights, airports, by.x="To", by.y="airport")

Create the map

To create a map, I modified the code to create flight maps I published in an earlier post. This code had to be changed to centre the map on the Pacific. Mapping the Pacific ocean is problematic because the -180 and +180 degree meridians meet around the date line. Longitudes west of the antemeridian are positive, while longitudes east are negative.

The world2 data set in the borders function of the ggplot2 package is centred on the Pacific ocean. To enable plotting on this map, all negative longitudes are made positive by adding 360 degrees to them.

# Pacific centric
flights$lon.x[flights$lon.x < 0] <- flights$lon.x[flights$lon.x < 0] + 360
flights$lon.y[flights$lon.y < 0] <- flights$lon.y[flights$lon.y < 0] + 360
airports$lon[airports$lon < 0] <- airports$lon[airports$lon < 0] + 360

# Plot flight routes
worldmap <- borders("world2", colour="#efede1", fill="#efede1")
ggplot() + worldmap + 
    geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + 
    geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), 
      col = "black", size = 2, segment.color = NA) + 
    geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, 
      yend = lat.y, col = Airline), size = .4, curvature = .2) + 
    theme(panel.background = element_rect(fill="white"), 
          axis.line = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()
          ) + 
    xlim(100, 300) + ylim(-40,40)

Pacific island hopping

Pacific Island Hopping

This visualisation is aesthetic and full of context, but it is not the best visualisation to solve the travel problem. This map can also be expressed as a graph with nodes (airports) and edges (routes). Once the map is represented mathematically, we can generate travel routes and begin our Pacific Island hopping.

The igraph package converts the flight list to a graph that can be analysed and plotted. The shortest_path function can then be used to plan routes. If I would want to travel from Auckland to Saipan in the Northern Mariana Islands, I have to go through Port Vila, Honiara, Port Moresby, Chuuk, Guam and then to Saipan. I am pretty sure there are quicker ways to get there, but this would be an exciting journey through the Pacific.

library(igraph)
g <- graph_from_edgelist(as.matrix(flights[,1:2]), directed = FALSE)
par(mar = rep(0, 4))
plot(g, layout = layout.fruchterman.reingold, vertex.size=0)
V(g)
shortest_paths(g, "Auckland", "Saipan")

View the latest version of this code on GitHub.

Pacific Flight network

Digit fifth powers: Euler Problem 30

Euler problem 30 is another number crunching problem that deals with numbers to the power of five. Two other Euler problems dealt with raising numbers to a power. The previous problem looked at permutations of powers and problem 16 asks for the sum of the digits of 2^{1000}.

Numberphile has a nice video about a trick to quickly calculate the fifth root of a number that makes you look like a mathematical wizard.

Euler Problem 30 Definition

Surprisingly there are only three numbers that can be written as the sum of fourth powers of their digits:

1634 = 1^4 + 6^4 + 3^4 + 4^4

8208 = 8^4 + 2^4 + 0^4 + 8^4

9474 = 9^4 + 4^4 + 7^4 + 4^4

As 1 = 1^4 is not a sum, it is not included.

The sum of these numbers is 1634 + 8208 + 9474 = 19316. Find the sum of all the numbers that can be written as the sum of fifth powers of their digits.

Proposed Solution

The problem asks for a brute-force solution but we have a halting problem. How far do we need to go before we can be certain there are no sums of fifth power digits? The highest digit is 9 and 9^5=59049, which has five digits. If we then look at 5 \times 9^5=295245, which has six digits and a good endpoint for the loop. The loop itself cycles through the digits of each number and tests whether the sum of the fifth powers equals the number.

largest <- 6 * 9^5
answer <- 0
for (n in 2:largest) {
    power.sum <-0
    i <- n while (i > 0) {
        d <- i %% 10
        i <- floor(i / 10)
        power.sum <- power.sum + d^5
    }
    if (power.sum == n) {
        print(n)
        answer <- answer + n
    }
}
print(answer)

View the most recent version of this code on GitHub.