# How Virtual Tags have transformed SCADA data analysis

Yesterday, I delivered the International Keynote at the Asset Data & Insights Conference in Auckland, New Zealand (the place where R was originally developed). My talk was about how to create value from SCADA data, using a method I developed with my colleagues called Virtual Tags. My talk started with my views on data science strategy, which I also presented to the R User Group in Melbourne. In this post, I like to explain what Virtual Tags are, and how they can be used to improve the value of SCADA data.

## SCADA Systems at Water Treatment Plants

Water treatment plants are mostly fully automated systems, using analysers and the SCADA system to communicate this data. For those of you not familiar with water treatment plants, this video below gives a cute summary of the process.

Water treatment plants need sensors to measure a broad range of parameters. These instruments record data 24 hours per day to control operations. When the process operates effectively, all values fall within a very narrow band. All these values are stored by the SCADA system for typically a year, after which they are destroyed to save storage space.

Water treatment plants measure turbidity (clarity of the water) to assess the effectiveness of filtration. The code snippet below simulates the measurements from a turbidity instrument at a water treatment plant over five hours. The code simulates measurements from a turbidity instrument at a water treatment plant over a period of five hours. Most water quality data has a log-normal distribution with a narrow standard deviation.

# Simulate measured data
set.seed(1234)
n <- 300
wtp <- data.frame(DateTime = seq.POSIXt(ISOdate(1910, 1, 1), length.out = n, by = 60),
WTP = rlnorm(n, log(.1), .01))
library(ggplot2)
p <- ggplot(wtp, aes(x = DateTime, y = WTP)) + geom_line(colour = "grey") +
ylim(0.09, 0.11) + ylab("Turbidity") + ggtitle("Turbidity simulation")
p


The data generated by the SCADA system is used to take operational decisions. The data is created and structured to make decisions in the present, not to solve problems in the future. SCADA Historian systems archive this information for future analysis. Historian systems only store new values when the new reading is more or less than a certain percentage than the previous one. This method saves storage space without sacrificing much accuracy.

For example, when an instrument reads 0.20 and the limit is set at 5%, new values are only recorded when they are below 0.19 or above 0.21. Any further values are stored when they deviate 5% from the new value, and so on. The code snippet below simulates this behaviour, based on the simulated turbidity readings generated earlier. This Historian only stores the data points marked in black.

# Historise data
threshold <- 0.03
h <- 1 # First historised point
# Starting conditions
wtp$historise <- FALSE wtp$historise[c(1, n)] <- TRUE
# Testing for delta <> threshold
for (i in 2:nrow(wtp)) {
delta <- wtp$WTP[i] / wtp$WTP[h] if (delta > (1 + threshold) | delta < (1 - threshold)) {
wtp$historise[i] <- TRUE h <- i } } historian <- subset(wtp, historise == TRUE) historian$Source <- "Historian"
p <- p + geom_point(data = historian, aes(x = DateTime, y = WTP)) + ggtitle("Historised data")
p


## Virtual Tags

This standard method to generate and store SCADA data works fine to operate systems but does not work so well when using the data for posthoc analysis. The data in Historian is an unequally-spaced time series which makes it harder to analyse the data. The Virtual Tag approach expands these unequal time series to an equally-spaced one, using constant interpolation.

The vt function undertakes the constant interpolation using the approx function. The functionvt is applied to all the DateTime values, using the historised data points. The red line shows how the value is constant until it jumps by more than 5%. This example demonstrates that we have a steady process with some minor spikes, which is the expected outcome of this simulation.

# Create Virtual Ttags
vt <- function(t) approx(historian$DateTime, historian$WTP, xout = t, method = "constant")
turbidity <- lapply(as.data.frame(wtp$DateTime), vt) wtp$VirtualTag <- turbidity[[1]]$y p + geom_line(data = wtp, aes(x = DateTime, y = VirtualTag), colour = "red") + ggtitle("Virtual Tags")  You can find the latest version of this code on GitHub. The next step in Virtual Tags is to combine the tags from different data points. For example, we are only interested in the turbidity readings when the filter was running. We can do this by combining this data with a valve status or flow in the filter. This approach might seem cumbersome but it simplifies analysing data from Historian. Virtual Tags enables several analytical processes that would otherwise be hard to. This system also adds context to the SCADA information by linking tags to each other and the processes they describe. If you are interested in more detail, then please download the technical manual for Virtual Tags and how they are implemented in SQL. ## The Presentation Anf finally, these are the slides of my keynote presentation. # Using the iGraph package to Analyse the Enron Corpus The Enron scandal is one of the most famous corporate governance failures in the history of capitalism. The case itself is interesting for its sake, but in this post, I am more interested in one of the data sets that the subsequent investigation has provided. This blog post analyses an extensive collection of e-mails from former Enron employees. The Enron corpus is analysed using network analysis tools provided by the iGraph package. Network analysis is a versatile technique that can be used to add value to a lot of different data sets, including the complex corporate relationships of Donald Trump. ## The Enron Corpus As part of their inquiries, The Federal Energy Regulatory Commission used an extensive collection of emails from Enron employees. The Enron corpus is one of the only publicly available collections of emails available for research. This dataset also provides a fascinating playground for citizen data scientists. The set has privacy issues as it contains messages from living people. When analysing this data set, we need to keep in mind that the majority of former Enron employees were innocent people who lost their jobs due to the greed of their overlords. The code in this post only analyses the e-mail headers, ignoring the content. Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001. The Enron Corpus is a large database of half a million of emails generated by more than 100 Enron employees. You can download the corpus from the Carnegie Mellon School of Computer Science. The first code snippet downloads the 7 May 2015 version of the dataset (about 423Mb, tarred and gzipped) and untars it to your working directory. # Enron Email Dataset: https://www.cs.cmu.edu/~./enron/ download.file("https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz", destfile = "enron_mail_20150507.tgz") untar("enron_mail_20150507.tgz")  ## Preparing the Data The main folder is maildir, which holds all the personal accounts. Our first task is to load the required libraries and create a list of available emails. This code results in 517,401 e-mail files with 44,859 emails in the inboxes of users. # E-mail corpus consists of nested folders per user with e-mails as text files # Create list of all available e-mails emails <- list.files("maildir/", full.names = T, recursive = T) length(emails) # Filter by inbox only emails <- emails[grep("/inbox", emails)] length(emails)  The bulk of the code creates a list of emails between Enron employees. The code performs a lot of string manipulations to extract the information from the text files. The content of the e-mails is ignored, the code only retrieves the sender and the receiver. The analysis is limited to e-mails between employees in the corpus. Only those receivers whose inbox forms part of the analysis are included. The result of this code is a data frame with the usernames of the sender and receiver for each email. The data frame contains 2779 emails that meet the criteria. # Create list of sender and receiver (inbox owner) inboxes <- data.frame( from = apply(as.data.frame(emails), 1, function(x){readLines(x, warn = F)[3]}), to = emails, stringsAsFactors = F ) # Keep only enron.com and strip all but username library(stringr) # String manipulation inboxes <- inboxes[grepl("@enron.com", inboxes$from),]
inboxes$from <- str_sub(inboxes$from, 7, nchar(inboxes$from) - 10) to <- str_split(inboxes$to, "/")
inboxes$to <- sapply(to, "[", 3) # Create list of usernames users <- data.frame(user = paste0("maildir/", unique(inboxes$to)))

# Remove those without sent mails
sent <- apply(users, 1, function(x){sum(grepl("sent", dir(x)))})
users <- subset(users, sent != 0)

# Replace username with e-mail name
users$mailname <- NA for (i in 1:nrow(users)){ sentmail <- dir(paste0(users$user[i], "/sent_items/"))
name <- readLines(paste0(users$user[i], "/sent_items/", sentmail[1]), warn = F)[3] name <- str_sub(name, 7, nchar(name)-10) users$mailname[i] <- name
}
users$user <- str_sub(users$user, 9)
inboxes <- merge(inboxes, by.x="to", users, by.y="user")
inboxes <- data.frame(from = inboxes$from, to = inboxes$mailname)

inboxes$from <- as.character(inboxes$from)
inboxes$to <- as.character(inboxes$to)

# Only e-mails between inbox users
inboxes <- inboxes[inboxes$from %in% inboxes$to,]

# Remove emails to self
inboxes <- subset(inboxes, inboxes$from != inboxes$to)


## Network Analysis

The last code snippet defines a graph from the table of e-mails. Each employee is a node in the network, and each e-mail is an edge (line). The iGraph package is a powerful tool to analyse networks. The graph_from_edgelist function creates a network object that can be analysed using the iGraph package. The graph is directed because the information flows from the sender to the receiver.

In the next step, the Spingglass algorithm finds community structure within the data. A community is a group of nodes that are more connected with each other than with any other node.

The last step visualises the network. The diagram is drawn using the Fruchterman-Reingold algorithm, which places the most connected nodes at the centre of the picture. The background colours in the diagram indicate the eight communities.

The graph tells us a lot about the group of employees in the Enron corpus and how they relate to each other. Each of the communities represents a tightly connected group of employees that mainly e-mail each other. Any connections between communities are shown in red. When the vertex.label = NA line is removed, the usernames are displayed at each node.

We can see groups that never email each other, lonely hangers-on and tightly knit cliques within Enron. In the centre of the graph we see w few individuals who are connectors between groups because they send a lot of emails to people outside their community. On the fringes of the graph are the hangers-on who only once or twice emailed somebody in the corpus but were still included in the investigation.

library(igraph)
g <- graph_from_edgelist(as.matrix(inboxes), directed = T)
coms <- spinglass.community(g)

# Plot network
par(mar = c(0,0,2,0))
plot(coms, g,
vertex.label=NA,
layout = layout.fruchterman.reingold,
vertex.size = 3
)


This analysis provides only a quick glimpse into the knowledge contained in the Enron corpus. An extensive range of tools is available to analyse such networks. An interesting exercise would be to overlap this network with the organisation chart to see the relationships between teams. Have fun playing with this fantastic data set!

# Tic Tac Toe Part 3: The Minimax Algorithm

In two previous posts, I presented code to teach R to play the trivial game of Tic Tac Toe. I thought this was an unbeatable algorithm. Alas, a comment from Alberto shattered my pride as he was able to beat my code.

The reason for the demise of my code was that I didn’t implement a full minimax algorithm, but instead looked only two moves ahead. I thought the common strategy rules (start in the centre and occupy the corners) would make the program unbeatable. When I simulated the game by instructing the computer to play against itself, Alberto’s strategy never arose because the code forces the centre field. Alberto’s code shows that you need to look at least three moves ahead for a perfect game. He has been so kind to share his code and gave me permission to publish it.

Alberto recreated two functions, for completeness I have added the complete working code that merges his improvements with my earlier work. The first two functions are identical to the previous post. These functions draw the game board and process the human player’s move by waiting for a mouse click.

# Draw the game board
draw.board <- function(game) {
xo <- c("X", " ", "O") # Symbols
par(mar = rep(1,4))
plot.new()
plot.window(xlim = c(0,30), ylim = c(0,30))
abline(h = c(10, 20), col="darkgrey", lwd = 4)
abline(v = c(10, 20), col="darkgrey", lwd = 4)
text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 4)
# Identify location of any three in a row
square <- t(matrix(game, nrow = 3))
hor <- abs(rowSums(square))
if (any(hor == 3))
hor <- (4 - which(hor == 3)) * 10 - 5
else
hor <- 0
ver <- abs(colSums(square))
if (any(ver == 3))
ver <- which(ver == 3) * 10 - 5
else
ver <- 0
diag1 <- sum(diag(square))
diag2 <- sum(diag(t(apply(square, 2, rev))))
# Draw winning lines
if (all(hor > 0))
for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col="red")
if (all(ver > 0))
for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col="red")
if (abs(diag1) == 3)
lines(c(2, 28), c(28, 2), lwd = 10, col = "red")
if (abs(diag2) == 3)
lines(c(2, 28), c(2, 28), lwd = 10, col = "red")
}

# Human player enters a move
move.human <- function(game) {
text(4, 0, "Click on screen to move", col = "grey", cex=.7)
empty <- which(game == 0)
move <- 0
while (!move %in% empty) {
coords <- locator(n = 1) # add lines
coords$x <- floor(abs(coords$x) / 10) + 1
coords$y <- floor(abs(coords$y) / 10) + 1
move <- coords$x + 3 * (3 - coords$y)
}
return (move)
}


Alberto rewrote the functions that analyse the board and determine the move of the computer. The ganador (Spanish for winning) function assesses the board condition by assigning -10 or + 10 for a winning game and 0 for any other situation.

ganador <- function(juego, player) {
game <- matrix(juego, nrow = 3, byrow = T)
hor <- rowSums(game)
ver <- colSums(game)
diag <- c(sum(diag(game)), sum(diag(apply(game, 1, rev))))
if (-3 %in% c(hor, ver, diag))
return(-10)
if (3 %in% c(hor, ver, diag))
return(10)
else
return(0)
}


The next function is the actual minimax algorithm. If the computer starts then the first move ($9!= 362880$ options to assess) takes a little while. The commented lines can be used to force a corner and make the games faster by forcing a random corner.

The minimax function returns a list with the move and its valuation through the ganador function. The function works recursively until it has filled the board and retains the best scoring move using the minimax method. To avoid the computer always playing the same move in the same situation random variables are added.

minimax <- function(juego, player) {
free <- which(juego == 0)
if(length(free) == 1) {
juego[free] <- player
return(list(move = free, U = ganador(juego, player)))
}
poss.results <- rep(0, 9)
for(i in free) {
game <- juego
game[i] <- player
}
mm <- ifelse(player == -1, "which.min", "which.max")
if(any(poss.results == (player * 10))) {
move <- do.call(mm, list(poss.results))
return(list(move = move, U = poss.results[move]))
}
for(i in free) {
game <- juego
game[i] <- player
poss.results[i] <- minimax(game, -player)$U } random <- runif(9, 0, 0.1) poss.results[-free] <- 100 * -player poss.results <- poss.results + (player * random) move <- do.call(mm, list(poss.results)) return(list(move = move, U = poss.results[move])) }  This final function stitches everything together and lets you play the game. Simply paste all functions in your R console and run them to play a game. The tic.tac.toe function can take two parameters, “human” and/or “computer”. The order of the parameters determines who starts the game. # Main game engine tic.tac.toe <- function(player1 = "human", player2 = "computer") { game <- rep(0, 9) # Empty board winner <- FALSE # Define winner player <- 1 # First player players <- c(player1, player2) draw.board(game) while (0 %in% game & !winner) { # Keep playing until win or full board if (players[(player + 3) %% 3] == "human") # Human player move <- move.human(game) else { # Computer player move <- minimax(game, player) move <- move$move
}
game[move] <- player # Change board
draw.board(game)
winner <- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6 # Winner, winner, chicken dinner?
player <- -player # Change player
}
}

tic.tac.toe()


This is my last word on Tic Tac Toe but now that the minimax conundrum is solved I could start working on other similar games such as Connect Four, Draughts or even the royal game of Chess.

# R-Cade Games: Simulating the Legendary Game of Pong

Pong is one of the earliest arcade games on the market, first released in 1972. From the day I first saw this miracle box, I wanted to know more about computers.

I learnt how to write code from the 1983 book Dr. C. Wacko’s Miracle Guide to Designing and Programming your own Atari Computer Arcade Games. This book explains in a very clear and humorous way how to write computer games in Atari basic. I devoured this book and spent many hours developing silly games. This article is an ode to Dr Wacko, a computer geek’s midlife-crisis and an attempt to replicate the software I developed thirty years ago.

I showed in a previous post that R can be used for board games. The question is whether we create arcade games in R. My challenge is to recreate the look and feel of 1980s arcade games, or R-Cade games, using R? The code shown below simulates the legendary game of pong.

## Playing Pong in R

The code is based on the Wacko’s Boing Program in the above-mentioned book. The R code is fully commented and speaks for itself. Please note that the animation is very clunky when you run it in RStudio. Only the native R Terminal displays the animation correctly.

Perhaps somebody can help me perfect this little ditty. I love to know how to read real-time USB input to control the game, so we get a step closer to the first R-Cade game.

## The Pong Code

# Sound library
library(beepr)

# Game parameters
skill <- 0.87 # Skill (0-1)
score <- 0
high.score <- 0

# Define playing field
par(mar = rep(1,4), bg = "black")
plot.new()
plot.window(xlim = c(0, 30), ylim = c(0, 30))
lines(c(1, 30, 30, 1), c(0, 0, 30, 30), type = "l", lwd = 5, col = "white")

# Playing field boundaries (depend on cex)
xmin <- 0.5
xmax <- 29.4
ymin <- 0.5
ymax <- 29.4

# initial position
x <- sample(5:25, 1)
y <- sample(5:25, 1)
points(x, y, pch = 15, col = "white", cex = 2)

psize <- 4

# Set direction
dx <- runif(1, .5, 1)
dy <- runif(1, .5, 1)

# Game play
while (x > xmin - 1) {
sound <- 0 # Silence
Sys.sleep(.05) # Pause screen. Reduce to increase speed
points(x, y, pch = 15, col = "black", cex = 2) # Erase ball
# Move ball
x <- x + dx
y <- y + dy
# Collision detection
if (x > xmax) {
dx <- -dx * runif(1, .9, 1.1) # Bounce
if (x > xmin) x <- xmax # Boundary
sound <- 10 # Set sound
}
if (y < ymin | y > ymax) {
if (y < ymin) y <- ymin
if (y > ymax) y <- ymax
dy <- -dy * runif(1, .9, 1.1)
sound <- 10
}
if (x < xmin & (y > ypaddle - (psize / 2)) & y < ypaddle + (psize / 2)) {
if (x < xmin) x <- xmin
dx <- -dx * runif(1, .9, 1.1)
sound <- 2
score <- score + 1
}
# Draw ball
points(x, y, pch = 15, col = "white", cex = 2)
if (sound !=0) beep(sound)
if (runif(1, 0, 1) < skill) ypaddle <- ypaddle + dy # Imperfect follow
# Erase back line
lines(c(0, 0), c(0, 30), type = "l", lwd = 8, col = "black")
if (ypaddle > 30 - (psize / 2)) ypaddle <- 30 - (psize / 2)
lines(c(0, 0), c(ypaddle - (psize / 2), ypaddle + (psize / 2)), type = "l", lwd = 8, col = "white")
}
beep(8)
text(15,15, "GAME OVER", cex=5, col = "white")
s <- ifelse(score == 1, "", "s")
text(15,5, paste0(score, " Point", s), cex=3, col = "white")


# Tic Tac Toe War Games: The Intelligent Minimax Algorithm

In a previous post, I shared how to build a randomised Tic Tac Toe simulation. The computer plays against itself playing at random positions. In this post, I will share how to teach the computer to play the game strategically.

I love the 1983 classic movie War Games. In this film, a computer plays Tic Tac Toe against itself to learn that it cannot win the game to prevent a nuclear war.

Back in those days, I devoured the wonderful book Writing Strategy Games on your Atari by John White which contains an algorithm to play Tic Tac Toe War Games. This is my attempt to relive the eighties using R.

You can find the code on my GitHub page.

## Drawing the Board

A previous post describes the function that draws the Tic Tac Toe board. For completeness, the code is replicated below. The game board is a vector of length nine consisting of either -1 (X), 0 (empty field) or 1 (O). The vector indices correspond with locations on the game board:

1 2 3
4 5 6
7 8 9

draw.board &lt;- function(board) { # Draw the board
xo &lt;- c(&quot;X&quot;, &quot; &quot;, &quot;O&quot;) # Symbols
par(mar = rep(0,4))
plot.new()
plot.window(xlim = c(0,30), ylim = c(0,30))
abline(h = c(10, 20), col=&quot;darkgrey&quot;, lwd = 4)
abline(v = c(10, 20), col=&quot;darkgrey&quot;, lwd = 4)
pieces &lt;- xo[board + 2]
text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), pieces, cex = 6)
# Identify location of any three in a row
square &lt;- t(matrix(board, nrow = 3))
hor &lt;- abs(rowSums(square))
if (any(hor == 3))
hor &lt;- (4 - which(hor == 3)) * 10 - 5
else
hor &lt;- 0
ver &lt;- abs(colSums(square))
if (any(ver == 3))
ver &lt;- which(ver == 3) * 10 - 5
else
ver &lt;- 0
diag1 &lt;- sum(diag(square))
diag2 &lt;- sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (hor &gt; 0) lines(c(0, 30), rep(hor, 2), lwd=10, col=&quot;red&quot;)
if (ver &gt; 0) lines(rep(ver, 2), c(0, 30), lwd=10, col=&quot;red&quot;)
if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd=10, col=&quot;red&quot;)
if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd=10, col=&quot;red&quot;)
}


## Human Players

This second code snippet lets a human player move by clicking anywhere on the graphic display using the locator function. The click location is converted to a number to denote the position on the board. The entered field is only accepted if it has not yet been used (the empty variable contains the available fields).

# Human player enters a move
move.human &lt;- function(game) {
text(4, 0, &quot;Click on screen to move&quot;, col = &quot;grey&quot;, cex=.7)
empty &lt;- which(game == 0)
move &lt;- 0
while (!move %in% empty) {
coords &lt;- locator(n = 1) # add lines
coords$x &lt;- floor(abs(coords$x) / 10) + 1
coords$y &lt;- floor(abs(coords$y) / 10) + 1
move &lt;- coords$x + 3 * (3 - coords$y)
}
return (move)
}


## Evaluate the Game

This code snippet defines the eval.game function which assesses the current board and assigns a score. Zero means no outcome, -6 means that the X player has won and +6 implies that the O player has won.

# Evaluate board position
eval.game &lt;- function(game, player) {
# Determine game score
square &lt;- t(matrix(game, nrow = 3))
hor &lt;- rowSums(square)
ver &lt;- colSums(square)
diag1 &lt;- sum(diag(square))
diag2 &lt;- sum(diag(t(apply(square, 2, rev))))
eval &lt;- c(hor, ver, diag1, diag2)
# Determine best score
minimax &lt;- ifelse(player == -1, &quot;min&quot;, &quot;max&quot;)
best.score &lt;- do.call(minimax, list(eval))
if (abs(best.score) == 3) best.score &lt;- best.score * 2
return (best.score)
}


## Computer Moves

The computer uses a modified Minimax Algorithm to determine its next move. This article from the Never Stop Building blog and the video below explain this method in great detail.

The next function determines the computer’s move. I have not used a brute-force minimax algorithm to save running time. I struggled building a fully recursive minimax function. Perhaps somebody can help me with this. This code looks only two steps deep and contains a strategic rule to maximise the score.

The first line stores the value of the players move, the second remainder of the matrix holds the evaluations of all the opponents moves. The code adds a randomised variable, based on the strategic value of a field. The centre has the highest value because it is part of four winning lines. Corners have three winning lines and the rest only two winning lines. This means that the computer will, all things being equal, favour the centre over the corners and favour the other fields least. The randomised variables in the code ensure that the computer does not always pick the same field in a similar situation.

# Determine computer move
move.computer &lt;- function(game, player) {
empty &lt;- which(game == 0)
eval &lt;- matrix(nrow = 10, ncol = 9, data = 0)
for (i in empty) {
game.tmp &lt;- game
game.tmp[i] &lt;- player
eval[1, i] &lt;- eval.game(game.tmp, player)
empty.tmp &lt;- which(game.tmp ==0)
for (j in empty.tmp) {
game.tmp1 &lt;- game.tmp
game.tmp1[j] &lt;- -player
eval[(j + 1), i] &lt;- eval.game(game.tmp1, -player)
}
}
if (!any(abs(eval[1,]) == 6)) { # When winning, play move
# Analyse opponent move
minimax &lt;- ifelse(player == -1, &quot;max&quot;, &quot;min&quot;) # Minimax
best.opponent &lt;- apply(eval[-1,], 1, minimax)
eval[1,] &lt;- eval[1,] * -player * best.opponent
}
# Add randomisation and strategic values
board &lt;- c(3, 2, 3, 2, 4, 2, 3, 2, 3) # Strategic values
board &lt;- sapply(board, function(x) runif(1, 0.1 * x, (0.1 * x) + 0.1)) # Randomise
eval[1, empty] &lt;- eval[1, empty] + player * board[empty] # Randomise moves
# Pick best game
minimax &lt;- ifelse(player == -1, &quot;which.min&quot;, &quot;which.max&quot;) # Minimax
move &lt;- do.call(minimax, list(eval[1,])) # Select best move
return(move)
}


This last code snippet enables computers and humans play each other or themselves. The players vector contains the identity of the two players so that a human can play a computer or vice versa. The human player moves by clicking on the screen.

The loop keeps running until the board is full or a winner has been identified. A previous Tic Tac Toe post explains the draw.board function.

# Main game engine
tic.tac.toe &lt;- function(player1 = &quot;human&quot;, player2 = &quot;computer&quot;) {
game &lt;- rep(0, 9) # Empty board
winner &lt;- FALSE # Define winner
player &lt;- 1 # First player
players &lt;- c(player1, player2)
draw.board(game)
while (0 %in% game &amp; !winner) { # Keep playing until win or full board
if (players[(player + 3) %% 3] == &quot;human&quot;) # Human player
move &lt;- move.human(game)
else # Computer player
move &lt;- move.computer(game, player)
game[move] &lt;- player # Change board
draw.board(game)
winner &lt;- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6 # Winner, winner, chicken dinner?
player &lt;- -player # Change player
}
}


You can play the computer by running all functions and then entering tic.tac.toe().

I am pretty certain this simplified minimax algorithm is unbeatable—why don’t you try to win and let me know when you do.

## Tic Tac Toe War Games

Now that this problem is solved, I can finally recreate the epic scene from the WarGames movie. The Tic Tac Toe War Games code uses the functions explained above and the animation package. Unfortunately, there are not many opportunities to create sound in R.

# WAR GAMES TIC TAC TOE
source(&quot;Tic Tac Toe/Tic Tac Toe.R&quot;)

# Draw the game board
draw.board.wargames &lt;- function(game) {
xo &lt;- c(&quot;X&quot;, &quot; &quot;, &quot;O&quot;) # Symbols
par(mar = rep(1,4), bg = &quot;#050811&quot;)
plot.new()
plot.window(xlim = c(0,30), ylim = c(0,30))
abline(h = c(10, 20), col = &quot;#588fca&quot;, lwd = 20)
abline(v = c(10, 20), col = &quot;#588fca&quot;, lwd = 20)
text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 20, col = &quot;#588fca&quot;)
text(1,0,&quot;r.prevos.net&quot;, col = &quot;#588fca&quot;, cex=2)
# Identify location of any three in a row
square &lt;- t(matrix(game, nrow = 3))
hor &lt;- abs(rowSums(square))
if (any(hor == 3))
hor &lt;- (4 - which(hor == 3)) * 10 - 5
else
hor &lt;- 0
ver &lt;- abs(colSums(square))
if (any(ver == 3))
ver &lt;- which(ver == 3) * 10 - 5
else
ver &lt;- 0
diag1 &lt;- sum(diag(square))
diag2 &lt;- sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (all(hor &gt; 0)) for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col=&quot;#588fca&quot;)
if (all(ver &gt; 0)) for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col=&quot;#588fca&quot;)
if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd = 10, col = &quot;#588fca&quot;)
if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd = 10, col = &quot;#588fca&quot;)
}

library(animation)
player &lt;- -1
games &lt;- 100
saveGIF ({
for (i in 1:games) {
game &lt;- rep(0, 9) # Empty board
winner &lt;- 0 # Define winner
#draw.board.wargames(game)
while (0 %in% game &amp; !winner) { # Keep playing until win or full board
empty &lt;- which(game == 0)
move &lt;- move.computer(game, player)
game[move] &lt;- player
if (i &lt;= 12) draw.board.wargames(game)
winner &lt;- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6
player &lt;- -player } if (i &gt; 12) draw.board.wargames(game)
}
},
interval = c(unlist(lapply(seq(1, 0,-.2), function (x) rep(x, 9))), rep(0,9*94)),
movie.name = &quot;wargames.gif&quot;, ani.width = 1024, ani.height = 1024)


# Data Pseudo-Science: Developing a Biorhythm Calculator

Data science is a serious occupation. Just like any other science, however, it can also be used for spurious topics, such as calculating your biorhythm.

This post provides an example of data Pseudo-Science though a function that calculates and visualises your biorhythm. Based on the graph, I must be having a great day right now.

The broader and more pertinent message in this post is that data pseudo-science is more common than you would think. Is our belief in machine learning justified or are some of these models also a pseudo-science with not much more reliability than a biorhythm?

## Biorhythm Theory

The idea that our physical states follow a predetermined rhythm has been around as long as mathematics. The basic concept of biorhythm is that a regular sinusoid cycle accurately describes our physical, emotional and intellectual states. Each of these three cycles has a different wavelength ($w$):

• physical: $w = 23$ days
• emotional: $w = 28$ days
• intellectual: $w = 33$ days

The cycle is calculated with $\sin (2 \pi t / w)$, where $t$ indicates the number of days since birth. This idea was developed by German surgeon Wilhelm Fliess in the late 19th century and was popularised in the United States in the late 1970s. There is no scientific evidence of the validity of this theory but it is an entertaining way to play with data.

The combination of the 23- and 28-day cycles repeats every 644 days, while the triple combination of 23-, 28-, and 33-day cycles repeats every 21,252 days, 58 years, two months and three weeks. You can, by the way, never reach a point where all cycles are maximised. The best you can achieve is 299.7 our of a maximum 300 which occurs when you are 17,003 days old.

When I was a teenager in the 1980s, several books and magazines described computer code to calculate your biorhythm. I used to play with these functions on my Atari 130XE computer.

Building a biorhythm calculator in R is easy. This function takes two dates as input and plots the biorhythm for the two weeks before and after the date. To calculate your biorhythm, run the function with your date of birth and target date: biorhythm(“yyyy-mm-dd”). The default version uses today as the target.

library(ggplot2)
library(reshape2)
biorhythm <- function(dob, target = Sys.Date()) {
dob <- as.Date(dob)
target <- as.Date(target)
t <- round(as.numeric(difftime(target, dob)))
days <- (t - 14) : (t + 14)
period <- data.frame(Date = seq.Date(from = target - 15, by = 1, length.out = 29),
Physical = sin (2 * pi * days / 23) * 100,
Emotional = sin (2 * pi * days / 28) * 100,
Intellectual = sin (2 * pi * days / 33) * 100)
period <- melt(period, id.vars = "Date", variable.name = "Biorhythm", value.name = "Percentage")
ggplot(period, aes(x = Date, y = Percentage, col = Biorhythm)) + geom_line() +
ggtitle(paste("DoB:", format(dob, "%d %B %Y"))) +
geom_vline(xintercept = as.numeric(target))
}

biorhythm("1969-09-12", "2017-03-30")


Biorhythms are an early attempt for human beings to predict the future. Although there is no relationship between this algorithm and reality, many people believed in its efficacy. Does the same hold true for the hyped capabilities of machine learning?

## Data Pseudo-Science

Data pseudo-science is not only an issue when people use spurious mathematical relationships such as biorhythms or astrology. This post is also written as a warning not to only rely on numerical models to predict qualitative aspects of life.

The recent failures in predicting the results of elections, even days before the event, are a case in point. There are many reasons machine learning methods can go wrong. When machine learning algorithms fail, they are often just as useful as a biorhythm. It would be fun to write a predictive analysis package for R using only pseudoscientific approaches such as I-Ching, astrology or biorhythm.

# Tic Tac Toe Simulation — Random Moves

Tic Tac Toe might be a futile children’s game but it can also teach us about artificial intelligence. Tic Tac Toe, or Naughts and Crosses, is a zero-sum game with perfect information. Both players know exactly what the other did and when nobody makes a mistake, the game will always end in a draw.

Tic Tac Toe is a simple game but also the much more complex game of chess is a zero-sum game with perfect information.

In this two-part post, I will build an unbeatable Tic Tac Toe Simulation. This first part deals with the mechanics of the game. The second post will present an algorithm for a perfect game.

## Drawing the Board

This first code snippet draws the Tic Tac Toe simulation board. The variable xo holds the identity of the pieces and the vector board holds the current game. Player X is denoted with -1 and player O with +1. The first part of the function draws the board and the naughts and crosses. The second part of the code check for three in a row and draws the corresponding line.

draw.board <- function(board) { # Draw the board
xo <- c("X", " ", "O") # Symbols
par(mar = rep(0,4))
plot.new()
plot.window(xlim = c(0,30), ylim = c(0,30))
abline(h = c(10, 20), col="darkgrey", lwd = 4)
abline(v = c(10, 20), col="darkgrey", lwd = 4)
pieces <- xo[board + 2]
text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), pieces, cex = 6)
# Identify location of any three in a row
square <- t(matrix(board, nrow = 3))
hor <- abs(rowSums(square))
if (any(hor == 3))
hor <- (4 - which(hor == 3)) * 10 - 5
else
hor <- 0
ver <- abs(colSums(square))
if (any(ver == 3))
ver <- which(ver == 3) * 10 - 5
else
ver <- 0
diag1 <- sum(diag(square))
diag2 <- sum(diag(t(apply(square, 2, rev))))
# Draw winning lines
if (hor > 0) lines(c(0, 30), rep(hor, 2), lwd=10, col="red")
if (ver > 0) lines(rep(ver, 2), c(0, 30), lwd=10, col="red")
if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd=10, col="red")
if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd=10, col="red")
}


## Random Tic Tac Toe

The second part of the code generates ten random games and creates and animated GIF-file. The code adds random moves until one of the players wins (winner <> 0) or the board is full (no zeroes in the game vector). The eval.winner function checks for three in a row and declares a winner when found.

There are 255,168 possible legal games in Tic Tac Toe, 46,080 of which end in a draw. This implies that these randomised games result in a draw 18% of the time.

eval.winner <- function(board) { # Identify winner
square <- t(matrix(board, nrow = 3))
hor <- rowSums(square)
ver <- colSums(square)
diag1 <- sum(diag(square))
diag2 <- sum(diag(t(apply(square, 2, rev))))
if (3 %in% c(hor, ver, diag1, diag2)) return (1)
else
if (-3 %in% c(hor, ver, diag1, diag2)) return (2)
else
return(0)
}

# Random game
library(animation)
saveGIF ({
for (i in 1:10) {
game <- rep(0, 9) # Empty board
winner <- 0 # Define winner
player <- -1 # First player
draw.board(game)
while (0 %in% game & winner == 0) { # Keep playing until win or full board
empty <- which(game == 0) # Define empty squares
move <- empty[sample(length(empty), 1)] # Random move
game[move] <- player # Change board
draw.board(game)
winner <- eval.winner(game) # Evaulate game
player <- player * -1 # Change player
}
draw.board(game)
}
},
interval = 0.25, movie.name = "ttt.gif", ani.width = 600, ani.height = 600)


## Tic Tac Toe Simulation

In a future post, I will outline how to program the computer to play against itself, just like in the 1983 movie War Games.

# Create Air Travel Route Maps in ggplot: A Visual Travel Diary

I have been lucky to fly to a few countries around the world. Like any other bored traveller, I thumb through the airline magazines and look at the air travel route maps. These maps are beautifully stylised depictions of the world with gently curved lines between the many destinations. I always wanted such a map for my own travel adventures.

## Create Air Travel Route Maps using ggplot2

The first step was to create a list of all the places I have flown between at least once. Paging through my travel photos and diaries, I managed to create a pretty complete list. The structure of this document is simply a list of all routes (From, To) and every flight only gets counted once. The next step finds the spatial coordinates for each airport by searching Google Maps using the geocode function from the ggmap package. In some instances, I had to add the country name to avoid confusion between places.

# Read flight list
flights <- read.csv("flights.csv", stringsAsFactors = FALSE)

# Lookup coordinates
library(ggmap)
airports <- unique(c(flights$From, flights$To))
coords <- geocode(airports)
airports <- data.frame(airport=airports, coords)


We now we have a data frame of airports with their coordinates and can create air travel route maps. The data frames are merged so that we can create air travel route maps using the curve geom. The borders function of ggplot2 creates the map data. The ggrepel package helps to prevent overplotting of text.

# Add coordinates to flight list
flights <- merge(flights, airports, by.x="To", by.y="airport")
flights <- merge(flights, airports, by.x="From", by.y="airport")

# Plot flight routes
library(ggplot2)
library(ggrepel)
worldmap <- borders("world", colour="#efede1", fill="#efede1") # create a layer of borders
ggplot() + worldmap +
geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, yend = lat.y), col = "#b29e7d", size = 1, curvature = .2) +
geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") +
geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), col = "black", size = 2, segment.color = NA) +
theme(panel.background = element_rect(fill="white"),
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)


I also tried to use ggmap package to display the maps to get a satellite image background. This did not work because the curve geom struggles with the map projection methods used in ggmap. Another problem is that the flight from Auckland to Los Angeles is drawn the wrong way. I hope no flat-earthers will see this map because they might use it as proof that the world is flat.

You can view the recent version of the code and associated files in GitHub.

# Trumpworld Analysis : Ownership Relations in his Business Network

You do not need a machine learning algorithm to predict that the presidency of Donald Trump will be controversial.

One of the most discussed aspects of his reign is the massive potential for conflicts of interest. Trump’s complex business empire is entangled with national and international politics.

Buzzfeed has mapped many of the relationships between businesses and people in what they have dubbed Trumpworld. They provided the data to enable citizens data science into the wheelings and dealings of Donald J. Trump. The raw data set consists of three subsets of connections between:

• Organisations
• People
• People and organisations

## Trumpworld Analysis

This post analyses the connections between organisations using the mighty igraph package in the R language. The code snippet below converts the data to a graph that can be analysed using social network analysis techniques. I have downloaded the table of the raw data file as CSV files. This data is subsetted to contain only ownership relationships.

# Read data
trumpworld.org.ownership <- subset(trumpworld.org, Connection==&quot;Ownership&quot;)[,1:2]

# Create graph of ownerships
library(igraph)
org.ownership <- graph.edgelist(as.matrix(trumpworld.org.ownership))

# Analysis
nrow(trumpworld.org.ownership)
length(unique(c(trumpworld.org.ownership[,1], trumpworld.org.ownership[,2])))

# Plot Graph
par(mar=rep(0,4))
plot(org.ownership,
layout=layout.fruchterman.reingold,
vertex.label=NA,
vertex.size=2,
edge.arrow.size=.1
)


## Network Analysis

This network contains 309 ownership relationships between 322 firms.

When we plot the data, we see that most relationships are between two firms. The plot is organised with the Fruchterman-Reingold algorithm to improve its clarity.

We can also see a large cluster in the centre. The names have been removed for clarity.

The Trumpland analysis continues with this conglomerate. The second code section excises this connected subnetwork so we can analyse it in more detail.

# Find most connected firm
which.max(degree(org.ownership))
# Create subnetworks
org.ownership.d <- decompose(org.ownership)
# Find largest subnetwork
largest <- which.max(sapply(org.ownership.d, diameter))
#Plot largest subnetwork
plot(org.ownership.d[[largest]],
layout=layout.fruchterman.reingold,
vertex.label.cex=.5,
vertex.size=5,
edge.arrow.size=.1
)


## Digging Deeper

The node with the highest degree identifies the business with the most holdings. This analysis shows that DJT Holdings LLC owns 33 other organisations. These organisations own other organisations. We can now use the cluster function to investigate this subnetwork.

This Trumpworld analysis shows that the ownership network is clearly a star network. DJT Holdings LLC centrally controls all organisations. Perhaps this graph visualises the management style of the soon to be president Trump. Trump centrally controls his empire, which is typical for a family business.

Does this chart visualise Trump’s leadership style? Is the star network an expression of his lack of trust and thus desire to oversee everything directly?

View this code and associated files on GitHub.

# Finding antipodes using the globe and ggmap packages

The antipode of each point on the Earth’s surface—the points where the blue and yellow overlap, are the land antipodes.

When I was a kid, I was was fascinated by the conundrum of what happens when you drill a hole straight through the centre of the earth. I always believed that I would turn up in Australia. But is this really the case?

The antipodes of any place on earth is the place that is diametrically opposite to it. A pair of antipodes are connected by a straight line running through the centre of the Earth. These points are as far away from each other as is possible on this planet. Two people are antipodes when they live on opposite sides of the globe. Their feet (πούς/pous in Ancient Greek) are directly opposite each other.

How can we use coding in R to solve this conundrum?

## Using R to find antipodes

We can roughly recreate the antipodean map on Wikipedia with the globe package. This package, written by Adrian Baddeley, plots 2D and 3D views of the earth. The package contains a data file with major coastlines that can be used to create a flipped map of the world.

The package contains a data file with major coastlines that can be used to create a flipped map of the world. To turn a spatial location into its antipode you subtract 180 degrees from the longitude and reverse the sign of the latitude, shown below.

library(globe)
# Create antipodean map
flip <- earth
flip$coords[,1] <- flip$coords[,1]-180
flip$coords[,2] <- -flip$coords[,2]
par(mar=rep(0,4)) # Remove plotting margins
globeearth(eye=c(90,0), col="blue")
globepoints(loc=flip$coords, eye=c(90,0), col="red", pch=".")  We can also use the ggmap package to visualise antipodes. This package, developed by David Kahle antipodean R-guru Hadley Wickham, has a neat geocoding function to obtain a spatial location. The antipode function takes the description of a location and a zoom level to plot a dot on the antipode location. The gridExtra package is used to created a faceted map, which is not otherwise possible in ggmap. library(ggmap) library(gridExtra) antipode <- function(location, zm=6) { # Map location lonlat <- geocode(location) loc1 <- get_map(lonlat, zoom=zm) map1 <- ggmap(loc1) + geom_point(data=lonlat, aes(lon, lat, col="red", size=10)) + theme(legend.position="none") # Define antipode lonlat$lon <- lonlat$lon-180 if (lonlat$lon < -180) lonlat$lon <- 360 + lonlat$lon
lonlat$lat <- -lonlat$lat
loc2 <- get_map(lonlat, zoom=zm)
map2 <- ggmap(loc2) + geom_point(data=lonlat, aes(lon, lat, col="red", size=10)) +
theme(legend.position="none")
grid.arrange(map1, map2, nrow=1)
}

antipode("Rector Nelissenstraat 47 Hoensbroek", 4)


This code solves the problem I was thinking about as a child. Running the code shows that the antipodes location of the home I grew up in is not in Australia, but quite a way south of New Zealand. Another childhood fantasy shattered …

You can also view this code on GitHub.

Antipodes using ggmap and gridExtra.