The post The Sierpinski Triangle: Visualising infinity in R appeared first on The Devil is in the Data.

]]>The original Sierpinski triangle will eventually disappear into Cantor dust, a cloud of ever shrinking triangles of infinitesimal size. The triangle is self-similar, no matter how far you zoom in, the basic geometry remains the same.

A fascinating method to create a Sierpinski Triangle is a chaos game. This method uses random numbers and some simple arithmetic rules. Sierpinski Triangles can be created using the following six steps:

- Define three points in a plane to form a triangle.
- Randomly select any point on the plane.
- Randomly select any one of the three triangle points.
- Move half the distance from your current position to the selected vertex.
- Plot the current position.
- Repeat from step 3.

This fractal is an implementation of chaos theory as this random process attracts to a complex ordered geometry. The game only works with random numbers and when selecting random vertices of the triangle.

This code implements the six rules in R. The code first initializes the triangle, defines a random starting point and then runs a loop to place random dots. The R plot engine does not draw pixels but uses characters, which implies that the diagram is not as accurate as it could be but the general principle is clear. The x(11) and Sys.sleep() commands are used to plot during the for-loop.

# Sierpinsky Triangle # Initialise triangle p <- c(0, 500, 1000) q <- c(0, 1000, 0) x11() par(mar = rep(0, 4)) plot(p, q, col= "red", pch = 15, cex = 1, axes = FALSE) # Random starting point x <- sample(0:1000, 1) y <- sample(0:1000, 1) # Chaos game for (i in 1:10000) { Sys.sleep(.001) n <- sample(1:3, 1) x <- floor(x + (p[n] - x) / 2) y <- floor(y + (q[n] - y) / 2) points(x, y, pch = 15, cex = 0.5) }

This algorithm demonstrates how a seemingly chaotic process can result in order. Many other versions of chaos games exist, which I leave to the reader to play with. If you create your own versions then please share the code in the comment box below.

The post The Sierpinski Triangle: Visualising infinity in R appeared first on The Devil is in the Data.

]]>The post Euler Problem 22 : Names Scores appeared first on The Devil is in the Data.

]]>Euler problem 22 is another trivial one that takes us to the realm of ASCII codes. ASCII is a method to convert symbols into numbers, originally invented for telegraphs.

Back in the 8-bit days, ASCII art was a method to create images without using lots of memory. Each image consists of a collection of text characters that give the illusion of an image. Euler problem 22 is, unfortunately, a bit less poetic.

Using names.txt, a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.

For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938^{th} name in the list. So, COLIN would obtain a score of 938 × 53 = 49,714.

What is the total of all the name scores in the file?

This code reads and cleans the file and sorts the names alphabetically. The charToRaw function determines the numerical value of each character in each name. This output of this function is the hex ASCII code for each character. The letter A is number 65, so we subtract 64 from each value to get the sum total.

# ETL: reads the file and converts it to an ordered vector. names <- readLines("https://projecteuler.net/project/resources/p022_names.txt", warn = F) names <- unlist(strsplit(names, ",")) names <- gsub("[[:punct:]]", "", names) names <- sort(names) # Total Name scores answer <- 0 for (i in names) { value <- sum(sapply(unlist(strsplit(i, "")), function(x) as.numeric(charToRaw(x)) - 64)) value <- value * which(names==i) answer <- answer + value } print(answer)

We can have a bit more fun with this problem by comparing this list with the most popular baby names in 2016. The first section of the code extracts the list of popular names from the website. The rest of the code counts the number of matches between the lists.

# Most popular baby names library(rvest) url <- "https://www.babycenter.com/top-baby-names-2016.htm" babynames <- url %>% read_html() %>% html_nodes(xpath = '//*[@id="babyNameList"]/table') %>% html_table() babynames <- babynames[[1]] # Convert Project Euler list and test for matches proper=function(x) paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2))) names <- proper(names) sum(babynames$GIRLS %in% names) sum(babynames$BOYS %in% names)

The post Euler Problem 22 : Names Scores appeared first on The Devil is in the Data.

]]>The post Euler Problem 21: Amicable Numbers appeared first on The Devil is in the Data.

]]>A related concept is a perfect number, which is a number that equals the sum of its proper divisors. Mathematicians have also defined sociable numbers and betrothed numbers which are similar to amicable numbers. But perhaps these are for another Euler problem.

Let be defined as the sum of proper divisors of n (numbers less than n which divide evenly into n). If and , where , then and are an amicable pair and each of and are called amicable numbers.

For example, the proper divisors of 220 are 1, 2, 4, 5, 10, 11, 20, 22, 44, 55 and 110; therefore . The proper divisors of 284 are 1, 2, 4, 71 and 142; so, .

Evaluate the sum of all the amicable numbers under 10000.

The first part of the code provides for a function to list all proper divisors for a given integer *x*. The loop determines the divisors for the numbers 220 to 10,000, calculates their sum and then checks if these numbers are amicable. When the code finds an amicable number, the counter jumps to the sum of the divisors to check for the next one.

proper.divisors <- function(x) { divisors <- vector() d <- 1 for (i in 1:floor(sqrt(x))) { if (x %% i == 0) { divisors[d] <- i if (i != x/i) { d <- d + 1 divisors[d] <- x / i } d <- d + 1 } } return(divisors) } answer <- 0 n <- 220 while (n <= 10000) { div.sum <- sum(proper.divisors(n)) - n if (n == sum(proper.divisors(div.sum)) - div.sum & n != div.sum) { answer <- answer + n + div.sum print(paste0("(", n, ",", div.sum, ")")) n <- div.sum } n <- n + 1 } print(answer)

Amicable numbers were known to the Pythagoreans, who credited them with many mystical properties. Before we had access to computers, finding amicable numbers was a task that required a lot of patience. No algorithm can systematically generate all amicable numbers, and until 1946 only 390 pairs were known. Medieval Muslim mathematicians developed several formulas to create amicable numbers, but the only way to be complete is using brute force.

The post Euler Problem 21: Amicable Numbers appeared first on The Devil is in the Data.

]]>The post Euler Problem 20: Large Integer Factorials appeared first on The Devil is in the Data.

]]>The problem with computing factorials is that they are mostly very large numbers, beyond the generic capabilities of computers to process. This problem can be solved using a specialised R package and using only base-R code.

.

For example: .

The sum of the digits in the number is .

Find the sum of the digits in the number 100!

The factorial of the number 100 contains 158 digits, which is a lot more digits than a 64-bit operating system can accurately produce. Using the standard function: factorial(100) = 9.332622e+157. Without using a specialised algorithm, we cannot determine the sum of all digits. We need to deploy arbitrary-precision arithmetic to solve this problem.

Many computer languages, including R, have special libraries to deal with such large numbers. The GMP Multiple Precision Arithmetic package renders this problem almost trivial.

library(gmp) digits <- factorialZ(100) digits <- as.character(digits) answer <- sum(as.numeric(unlist(strsplit(digits, ""))))

The problem becomes more interesting when only using basic R code. I developed the big.add function to solve Euler Problem 13 through the addition of very large integers. We can extend this function to also calculate factorials. A factorial can be replaced by a series of additions, for example:

This can be mimicked in R using the Reduce function. This function reduces a vector to a single value by recursively calling a function. Reduce(“+”, rep(4, 5)) is the same as:

Using a loop, we can use the Reduce function to calculate a factorial, using only additions:

fact <- 1 x <- 100 for (i in 2:x) { fact <- Reduce("+", rep(fact, i)) } print(fact)

The *big.factorial* function below implements this idea by combining the *big.add* and *Reduce* functions to calculate large integer factorials. The function returns a value of 1 when the factorial of 0 or 1 is requested. This function does not calculate the Gamma-function for fractions. For all other values, it goes through a loop from 2 to the requested factorial. The temporary values are stored in the *bf* variable. The code loops through the factorials by using the result of the previous *Reduce* call into the current one.

big.factorial <- function(x) { x <- floor(x) bf <- 1 if (x > 1) { for (i in 2:x) { bf <- Reduce(big.add, rep(bf,i)) } } return (bf) } digits <- big.factorial(100) answer <- sum(as.numeric(unlist(strsplit(as.character(digits), "")))) print(answer)

This function is most certainly not as fast as the GMP package but it was fun to write and to learn about the mechanics behind arbitrary precision arithmetic at work.

If you like to know how factorials can be used to determine the number of ways a deck can be shuffled the watch this video.

The post Euler Problem 20: Large Integer Factorials appeared first on The Devil is in the Data.

]]>The post Euler Problem 19: Counting Sundays — When does the week start? appeared first on The Devil is in the Data.

]]>- 1 Jan 1900 was a Monday.
- Thirty days has September, April, June and November.
- All the rest have thirty-one,
- Saving February alone, Which has twenty-eight, rain or shine. And on leap years, twenty-nine.
- A leap year occurs on any year evenly divisible by 4, but not on a century unless it is divisible by 400.

How many Sundays fell on the first of the month during the twentieth century (1 Jan 1901 to 31 Dec 2000)?

The problem can be quickly solved with R base code and a tiny bit faster when using the lubridate package.

# Base R-code dates <- seq.Date(as.Date("1901/01/01"), as.Date("2000/12/31"), "days") days <- rep(1:7, length.out = length(dates)) answer <- sum(days[substr(dates, 9, 10) == "01"] == 1) print(answer) #Using Lubridate library(lubridate, quietly = TRUE) answer <- sum(wday(dates[substr(dates, 9, 10) == "01"]) == 1) print(answer)

To draw out this post a little bit further I wrote some code to solve the problem without using the calendar functions in R.

week.day <- 0 answer <- 0 for (y in 1901:2000) { for (m in 1:12) { max.day <- 31 if (m %in% c(4, 6, 9, 11)) max.day <- 30 # Leap years if (m == 2) { if (y %% 4 == 0 & y %% 100 != 0 | y %% 400 == 0) max.day <- 29 else max.day <- 28 } for (d in 1:max.day) { week.day <- week.day + 1 if (week.day == 8) week.day <- 1 if (week.day == 1 & d == 1) answer <- answer + 1 } } } print(answer)

The only aspect remotely interesting about this problem is the conversion from weekdays to numbers. In R, Monday is considered day one, which makes sense in the Christian context of Western culture. Saturday and Sunday are the weekend, the two last days of the week so they are day 6 and 7. According to international standard ISO 8601, Monday is the first day of the week. Although this is the international standard, several countries, including the United States and Canada, consider Sunday to be the first day of the week.

The international standard is biased towards Christianity. The Christian or Western world marks Sunday as their day of rest and worship. Muslims refer to Friday as their day of rest and prayer. The Jewish calendar counts Saturday—the Sabbath—as the day of rest and worship. This idea is also shared by Seventh-Day Adventists.

this example shows that data science is not only about data: it is always about people and how they interpret the world.

via chartsbin.com

The post Euler Problem 19: Counting Sundays — When does the week start? appeared first on The Devil is in the Data.

]]>The post Euler Problem 18 & 67: Maximum Path Sums appeared first on The Devil is in the Data.

]]>Euler Problem 18 and 67 are exactly the same besides that the data set in the second version is larger than in the first one. In this post, I kill two Eulers with one code.

These problems deal with binary trees, which is a data structure where each node has two children. A practical example of a binary tree is a pedigree chart, where each person or animal has two parents, four grandparents and so on.

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.

3

7 4

2 4 6

8 5 9 3

That is, 3 + 7 + 4 + 9 = 23. Find the maximum total from top to bottom of the triangle below:

75

95 64

17 47 82

18 35 87 10

20 04 82 47 65

19 01 23 75 03 34

88 02 77 73 07 63 67

99 65 04 28 06 16 70 92

41 41 26 56 83 40 80 70 33

41 48 72 33 47 32 37 16 94 29

53 71 44 65 25 43 91 52 97 51 14

70 11 33 28 77 73 17 78 39 68 17 57

91 71 52 38 17 14 91 43 58 50 27 29 48

63 66 04 68 89 53 67 30 73 16 69 87 40 31

04 62 98 27 23 09 70 98 73 93 38 53 60 04 23

As there are only 16,384 routes, it is possible to solve this problem by trying every route. However, Problem 67, is the same challenge with a triangle containing one-hundred rows; it cannot be solved by brute force, and requires a clever method! ;o)

This problem seeks a maximum path sum in a binary tree. The brute force method, as indicated in the problem definition, is a very inefficient way to solve this problem. The video visualises the quest for the maximum path, which takes eleven minutes of hypnotic animation.

A more efficient method is to define the maximum path layer by layer, starting at the bottom. The maximum sum of 2+8 or 2+5 is 10, the maximum sum of 4+5 or 4+9 is 13 and the last maximum sum is 15. These numbers are now placed in the next row. This process cycles until only one number is left. This algorithm solves the sample triangle in four steps:

Step 1:

3

7 4

2 4 6

8 5 9 3

Step 2:

3

7 4

10 13 15

Step 3:

3

20 19

Step 4:

23

In the code below, the data is triangle matrix. The variables *rij* (row) and *kol* (column) drive the search for the maximum path. The triangle for Euler Problem 18 is manually created and the triangle for Euler Problem 67 is read from the website.

path.sum <- function(triangle) { for (rij in nrow(triangle):2) { for (kol in 1:(ncol(triangle)-1)) { triangle[rij - 1,kol] <- max(triangle[rij,kol:(kol + 1)]) + triangle[rij - 1, kol] } triangle[rij,] <- NA } return(max(triangle, na.rm = TRUE)) } # Euler Problem 18 triangle <- matrix(ncol = 15, nrow = 15) triangle[1,1] <- 75 triangle[2,1:2] <- c(95, 64) triangle[3,1:3] <- c(17, 47, 82) triangle[4,1:4] <- c(18, 35, 87, 10) triangle[5,1:5] <- c(20, 04, 82, 47, 65) triangle[6,1:6] <- c(19, 01, 23, 75, 03, 34) triangle[7,1:7] <- c(88, 02, 77, 73, 07, 63, 67) triangle[8,1:8] <- c(99, 65, 04, 28, 06, 16, 70, 92) triangle[9,1:9] <- c(41, 41, 26, 56, 83, 40, 80, 70, 33) triangle[10,1:10] <- c(41, 48, 72, 33, 47, 32, 37, 16, 94, 29) triangle[11,1:11] <- c(53, 71, 44, 65, 25, 43, 91, 52, 97, 51, 14) triangle[12,1:12] <- c(70, 11, 33, 28, 77, 73, 17, 78, 39, 68, 17, 57) triangle[13,1:13] <- c(91, 71, 52, 38, 17, 14, 91, 43, 58, 50, 27, 29, 48) triangle[14,1:14] <- c(63, 66, 04, 68, 89, 53, 67, 30, 73, 16, 69, 87, 40, 31) triangle[15,1:15] <- c(04, 62, 98, 27, 23, 09, 70, 98, 73, 93, 38, 53, 60, 04, 23) answer <- path.sum(triangle) print(answer)

The solution for problem number 67 is exactly the same. The data is read directly from the Project Euler website.

# Euler Problem 67 triangle.file <- read.delim("https://projecteuler.net/project/resources/p067_triangle.txt", stringsAsFactors = F, header = F) triangle.67 <- matrix(nrow = 100, ncol = 100) for (i in 1:100) { triangle.67[i,1:i] <- as.numeric(unlist(strsplit(triangle.file[i,], " "))) } answer <- path.sum(triangle.67) print(answer)

The post Euler Problem 18 & 67: Maximum Path Sums appeared first on The Devil is in the Data.

]]>The post R-Cade Games: Simulating the Legendary Game of Pong appeared first on The Devil is in the Data.

]]>I learnt how to write code from the 1983 book Dr. C. Wacko’s Miracle Guide to Designing and Programming your own Atari Computer Arcade Games. This book explains in a very clear and humorous way how to write computer games in Atari basic. I devoured this book and spent many hours developing silly games. This article is an ode to Dr Wacko, a computer geek’s midlife-crisis and an attempt to replicate the software I developed thirty years ago.

I showed in a previous post that R can be used for board games. The question is whether we create arcade games in R. My challenge is to recreate the look and feel of 1980s arcade games, or *R-Cade* games, using R? The code shown below simulates the legendary game of pong.

The code is based on the Wacko’s* Boing Program* in the above-mentioned book. The R code is fully commented and speaks for itself. Please note that the animation is very clunky when you run it in RStudio. Only the native R Terminal displays the animation correctly.

Perhaps somebody can help me perfect this little ditty. I love to know how to read real-time USB input to control the game, so we get a step closer to the first R-Cade game.

# Sound library library(beepr) # Game parameters skill <- 0.87 # Skill (0-1) score <- 0 high.score <- 0 # Define playing field par(mar = rep(1,4), bg = "black") plot.new() plot.window(xlim = c(0, 30), ylim = c(0, 30)) lines(c(1, 30, 30, 1), c(0, 0, 30, 30), type = "l", lwd = 5, col = "white") # Playing field boundaries (depend on cex) xmin <- 0.5 xmax <- 29.4 ymin <- 0.5 ymax <- 29.4 # initial position x <- sample(5:25, 1) y <- sample(5:25, 1) points(x, y, pch = 15, col = "white", cex = 2) # Paddle control psize <- 4 ypaddle <- y # Set direction dx <- runif(1, .5, 1) dy <- runif(1, .5, 1) # Game play while (x > xmin - 1) { sound <- 0 # Silence Sys.sleep(.05) # Pause screen. Reduce to increase speed points(x, y, pch = 15, col = "black", cex = 2) # Erase ball # Move ball x <- x + dx y <- y + dy # Collision detection if (x > xmax) { dx <- -dx * runif(1, .9, 1.1) # Bounce if (x > xmin) x <- xmax # Boundary sound <- 10 # Set sound } if (y < ymin | y > ymax) { if (y < ymin) y <- ymin if (y > ymax) y <- ymax dy <- -dy * runif(1, .9, 1.1) sound <- 10 } # Caught by paddle? if (x < xmin & (y > ypaddle - (psize / 2)) & y < ypaddle + (psize / 2)) { if (x < xmin) x <- xmin dx <- -dx * runif(1, .9, 1.1) sound <- 2 score <- score + 1 } # Draw ball points(x, y, pch = 15, col = "white", cex = 2) if (sound !=0) beep(sound) # Move paddle if (runif(1, 0, 1) < skill) ypaddle <- ypaddle + dy # Imperfect follow # Draw paddle # Erase back line lines(c(0, 0), c(0, 30), type = "l", lwd = 8, col = "black") # Keep paddle inside window if (ypaddle < (psize / 2)) ypaddle <- (psize / 2) if (ypaddle > 30 - (psize / 2)) ypaddle <- 30 - (psize / 2) # Draw paddle lines(c(0, 0), c(ypaddle - (psize / 2), ypaddle + (psize / 2)), type = "l", lwd = 8, col = "white") } beep(8) text(15,15, "GAME OVER", cex=5, col = "white") s <- ifelse(score == 1, "", "s") text(15,5, paste0(score, " Point", s), cex=3, col = "white")

The post R-Cade Games: Simulating the Legendary Game of Pong appeared first on The Devil is in the Data.

]]>The post Data Science from a Strategic Business Perspective appeared first on The Devil is in the Data.

]]>My presentation outlined the strategy that I developed and am implementing for my employer Coliban Water. This strategy is based on a common-sense approach that leverages our existing strengths. This strategy was also outlined in an article for the Source journal.

Water utilities are, pardon the pun, awash with data. For decades we have been using a rudimentary version of the Internet of Things called SCADA (Supervisory Control and Data Aquisition). This system controls our assets and provides operators and analysts with the needed data. All this data is used to control our systems and stored for future reference.

There is no such thing as ‘dark data’. All data has been used for its designated purpose when it was created. My job at Coliban Water is to create value from this information.

In this presentation, I explained how Coliban Water is creating more value from data by following a systematic strategy,

The post Data Science from a Strategic Business Perspective appeared first on The Devil is in the Data.

]]>The post Tic Tac Toe War Games: The Intelligent Minimax Algorithm appeared first on The Devil is in the Data.

]]>I love the 1983 classic movie War Games. In this film, a computer plays Tic Tac Toe against itself to learn that it cannot win the game to prevent a nuclear war.

Back in those days, I devoured the wonderful book *Writing Strategy Games on your Atari* by John White which contains an algorithm to play Tic Tac Toe War Games. This is my attempt to relive the eighties using R.

You can find the code on my GitHub page.

A previous post describes the function that draws the Tic Tac Toe board. For completeness, the code is replicated below. The game board is a vector of length nine consisting of either -1 (X), 0 (empty field) or 1 (O). The vector indices correspond with locations on the game board:

1 2 3

4 5 6

7 8 9

draw.board <- function(board) { # Draw the board xo <- c("X", " ", "O") # Symbols par(mar = rep(0,4)) plot.new() plot.window(xlim = c(0,30), ylim = c(0,30)) abline(h = c(10, 20), col="darkgrey", lwd = 4) abline(v = c(10, 20), col="darkgrey", lwd = 4) pieces <- xo[board + 2] text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), pieces, cex = 6) # Identify location of any three in a row square <- t(matrix(board, nrow = 3)) hor <- abs(rowSums(square)) if (any(hor == 3)) hor <- (4 - which(hor == 3)) * 10 - 5 else hor <- 0 ver <- abs(colSums(square)) if (any(ver == 3)) ver <- which(ver == 3) * 10 - 5 else ver <- 0 diag1 <- sum(diag(square)) diag2 <- sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (hor > 0) lines(c(0, 30), rep(hor, 2), lwd=10, col="red") if (ver > 0) lines(rep(ver, 2), c(0, 30), lwd=10, col="red") if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd=10, col="red") if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd=10, col="red") }

This second code snippet lets a human player move by clicking anywhere on the graphic display using the locator function. The click location is converted to a number to denote the position on the board. The entered field is only accepted if it has not yet been used (the *empty* variable contains the available fields).

# Human player enters a move move.human <- function(game) { text(4, 0, "Click on screen to move", col = "grey", cex=.7) empty <- which(game == 0) move <- 0 while (!move %in% empty) { coords <- locator(n = 1) # add lines coords$x <- floor(abs(coords$x) / 10) + 1 coords$y <- floor(abs(coords$y) / 10) + 1 move <- coords$x + 3 * (3 - coords$y) } return (move) }

This code snippet defines the *eval.game* function which assesses the current board and assigns a score. Zero means no outcome, -6 means that the *X* player has won and +6 implies that the *O* player has won.

# Evaluate board position eval.game <- function(game, player) { # Determine game score square <- t(matrix(game, nrow = 3)) hor <- rowSums(square) ver <- colSums(square) diag1 <- sum(diag(square)) diag2 <- sum(diag(t(apply(square, 2, rev)))) eval <- c(hor, ver, diag1, diag2) # Determine best score minimax <- ifelse(player == -1, "min", "max") best.score <- do.call(minimax, list(eval)) if (abs(best.score) == 3) best.score <- best.score * 2 return (best.score) }

The computer uses a modified Minimax Algorithm to determine its next move. This article from the Never Stop Building blog and the video below explain this method in great detail.

The next function determines the computer’s move. I have not used a brute-force minimax algorithm to save running time. I struggled building a fully recursive minimax function. Perhaps somebody can help me with this. This code looks only two steps deep and contains a strategic rule to maximise the score.

The first line stores the value of the players move, the second remainder of the matrix holds the evaluations of all the opponents moves. The code adds a randomised variable, based on the strategic value of a field. The centre has the highest value because it is part of four winning lines. Corners have three winning lines and the rest only two winning lines. This means that the computer will, all things being equal, favour the centre over the corners and favour the other fields least. The randomised variables in the code ensure that the computer does not always pick the same field in a similar situation.

# Determine computer move move.computer <- function(game, player) { empty <- which(game == 0) eval <- matrix(nrow = 10, ncol = 9, data = 0) for (i in empty) { game.tmp <- game game.tmp[i] <- player eval[1, i] <- eval.game(game.tmp, player) empty.tmp <- which(game.tmp ==0) for (j in empty.tmp) { game.tmp1 <- game.tmp game.tmp1[j] <- -player eval[(j + 1), i] <- eval.game(game.tmp1, -player) } } if (!any(abs(eval[1,]) == 6)) { # When winning, play move # Analyse opponent move minimax <- ifelse(player == -1, "max", "min") # Minimax best.opponent <- apply(eval[-1,], 1, minimax) eval[1,] <- eval[1,] * -player * best.opponent } # Add randomisation and strategic values board <- c(3, 2, 3, 2, 4, 2, 3, 2, 3) # Strategic values board <- sapply(board, function(x) runif(1, 0.1 * x, (0.1 * x) + 0.1)) # Randomise eval[1, empty] <- eval[1, empty] + player * board[empty] # Randomise moves # Pick best game minimax <- ifelse(player == -1, "which.min", "which.max") # Minimax move <- do.call(minimax, list(eval[1,])) # Select best move return(move) }

This last code snippet enables computers and humans play each other or themselves. The *players* vector contains the identity of the two players so that a human can play a computer or vice versa. The human player moves by clicking on the screen.

The loop keeps running until the board is full or a winner has been identified. A previous Tic Tac Toe post explains the *draw.board* function.

# Main game engine tic.tac.toe <- function(player1 = "human", player2 = "computer") { game <- rep(0, 9) # Empty board winner <- FALSE # Define winner player <- 1 # First player players <- c(player1, player2) draw.board(game) while (0 %in% game & !winner) { # Keep playing until win or full board if (players[(player + 3) %% 3] == "human") # Human player move <- move.human(game) else # Computer player move <- move.computer(game, player) game[move] <- player # Change board draw.board(game) winner <- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6 # Winner, winner, chicken dinner? player <- -player # Change player } }

You can play the computer by running all functions and then entering *tic.tac.toe()*.

I am pretty certain this simplified minimax algorithm is unbeatable—why don’t you try to win and let me know when you do.

Now that this problem is solved, I can finally recreate the epic scene from the WarGames movie. The Tic Tac Toe War Games code uses the functions explained above and the animation package. Unfortunately, there are not many opportunities to create sound in R.

# WAR GAMES TIC TAC TOE source("Tic Tac Toe/Tic Tac Toe.R") # Draw the game board draw.board.wargames <- function(game) { xo <- c("X", " ", "O") # Symbols par(mar = rep(1,4), bg = "#050811") plot.new() plot.window(xlim = c(0,30), ylim = c(0,30)) abline(h = c(10, 20), col = "#588fca", lwd = 20) abline(v = c(10, 20), col = "#588fca", lwd = 20) text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 20, col = "#588fca") text(1,0,"r.prevos.net", col = "#588fca", cex=2) # Identify location of any three in a row square <- t(matrix(game, nrow = 3)) hor <- abs(rowSums(square)) if (any(hor == 3)) hor <- (4 - which(hor == 3)) * 10 - 5 else hor <- 0 ver <- abs(colSums(square)) if (any(ver == 3)) ver <- which(ver == 3) * 10 - 5 else ver <- 0 diag1 <- sum(diag(square)) diag2 <- sum(diag(t(apply(square, 2, rev)))) # Draw winning lines if (all(hor > 0)) for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col="#588fca") if (all(ver > 0)) for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col="#588fca") if (abs(diag1) == 3) lines(c(2, 28), c(28, 2), lwd = 10, col = "#588fca") if (abs(diag2) == 3) lines(c(2, 28), c(2, 28), lwd = 10, col = "#588fca") } library(animation) player <- -1 games <- 100 saveGIF ({ for (i in 1:games) { game <- rep(0, 9) # Empty board winner <- 0 # Define winner #draw.board.wargames(game) while (0 %in% game & !winner) { # Keep playing until win or full board empty <- which(game == 0) move <- move.computer(game, player) game[move] <- player if (i <= 12) draw.board.wargames(game) winner <- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6 player <- -player } if (i > 12) draw.board.wargames(game) } }, interval = c(unlist(lapply(seq(1, 0,-.2), function (x) rep(x, 9))), rep(0,9*94)), movie.name = "wargames.gif", ani.width = 1024, ani.height = 1024)

The post Tic Tac Toe War Games: The Intelligent Minimax Algorithm appeared first on The Devil is in the Data.

]]>The post Data Pseudo-Science: Developing a Biorhythm Calculator appeared first on The Devil is in the Data.

]]>This post provides an example of data Pseudo-Science though a function that calculates and visualises your biorhythm. Based on the graph, I must be having a great day right now.

The broader and more pertinent message in this post is that data pseudo-science is more common than you would think. Is our belief in machine learning justified or are some of these models also a pseudo-science with not much more reliability than a biorhythm?

The idea that our physical states follow a predetermined rhythm has been around as long as mathematics. The basic concept of biorhythm is that a regular sinusoid cycle accurately describes our physical, emotional and intellectual states. Each of these three cycles has a different wavelength ():

- physical: days
- emotional: days
- intellectual: days

The cycle is calculated with , where indicates the number of days since birth. This idea was developed by German surgeon Wilhelm Fliess in the late 19^{th} century and was popularised in the United States in the late 1970s. There is no scientific evidence of the validity of this theory but it is an entertaining way to play with data.

The combination of the 23- and 28-day cycles repeats every 644 days, while the triple combination of 23-, 28-, and 33-day cycles repeats every 21,252 days, 58 years, two months and three weeks. You can, by the way, never reach a point where all cycles are maximised. The best you can achieve is 299.7 our of a maximum 300 which occurs when you are 17,003 days old.

When I was a teenager in the 1980s, several books and magazines described computer code to calculate your biorhythm. I used to play with these functions on my Atari 130XE computer.

Building a biorhythm calculator in R is easy. This function takes two dates as input and plots the biorhythm for the two weeks before and after the date. To calculate your biorhythm, run the function with your date of birth and target date: biorhythm(“yyyy-mm-dd”). The default version uses today as the target.

library(ggplot2) library(reshape2) biorhythm <- function(dob, target = Sys.Date()) { dob <- as.Date(dob) target <- as.Date(target) t <- round(as.numeric(difftime(target, dob))) days <- (t - 14) : (t + 14) period <- data.frame(Date = seq.Date(from = target - 15, by = 1, length.out = 29), Physical = sin (2 * pi * days / 23) * 100, Emotional = sin (2 * pi * days / 28) * 100, Intellectual = sin (2 * pi * days / 33) * 100) period <- melt(period, id.vars = "Date", variable.name = "Biorhythm", value.name = "Percentage") ggplot(period, aes(x = Date, y = Percentage, col = Biorhythm)) + geom_line() + ggtitle(paste("DoB:", format(dob, "%d %B %Y"))) + geom_vline(xintercept = as.numeric(target)) } biorhythm("1969-09-12", "2017-03-30")

Biorhythms are an early attempt for human beings to predict the future. Although there is no relationship between this algorithm and reality, many people believed in its efficacy. Does the same hold true for the hyped capabilities of machine learning?

Data pseudo-science is not only an issue when people use spurious mathematical relationships such as biorhythms or astrology. This post is also written as a warning not to only rely on numerical models to predict qualitative aspects of life.

The recent failures in predicting the results of elections, even days before the event, are a case in point. There are many reasons machine learning methods can go wrong. When machine learning algorithms fail, they are often just as useful as a biorhythm. It would be fun to write a predictive analysis package for R using only pseudoscientific approaches such as I-Ching, astrology or biorhythm.

The post Data Pseudo-Science: Developing a Biorhythm Calculator appeared first on The Devil is in the Data.

]]>