Using the iGraph package to Analyse the Enron Corpus

The Enron scandal is one of the most famous corporate governance failures in the history of capitalism. The case itself is interesting for its sake, but in this post, I am more interested in one of the data sets that the subsequent investigation has provided.

This blog post analyses an extensive collection of e-mails from former Enron employees. The Enron corpus is analysed using network analysis tools provided by the iGraph package. Network analysis is a versatile technique that can be used to add value to a lot of different data sets, including the complex corporate relationships of Donald Trump.

The Enron Corpus

As part of their inquiries, The Federal Energy Regulatory Commission used an extensive collection of emails from Enron employees. The Enron corpus is one of the only publicly available collections of emails available for research. This dataset also provides a fascinating playground for citizen data scientists.

The set has privacy issues as it contains messages from living people. When analysing this data set, we need to keep in mind that the majority of former Enron employees were innocent people who lost their jobs due to the greed of their overlords. The code in this post only analyses the e-mail headers, ignoring the content.

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001 - Enron corpus analysis

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001.

The Enron Corpus is a large database of half a million of emails generated by more than 100 Enron employees. You can download the corpus from the Carnegie Mellon School of Computer Science. The first code snippet downloads the 7 May 2015 version of the dataset (about 423Mb, tarred and gzipped) and untars it to your working directory.

# Enron Email Dataset: https://www.cs.cmu.edu/~./enron/
download.file("https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz", destfile = "enron_mail_20150507.tgz")
untar("enron_mail_20150507.tgz")

Preparing the Data

The main folder is maildir, which holds all the personal accounts. Our first task is to load the required libraries and create a list of available emails. This code results in 517,401 e-mail files with 44,859 emails in the inboxes of users.

# E-mail corpus consists of nested folders per user with e-mails as text files
# Create list of all available e-mails
emails <- list.files("maildir/", full.names = T, recursive = T)
length(emails)
# Filter by inbox only
emails <- emails[grep("/inbox", emails)]
length(emails)

The bulk of the code creates a list of emails between Enron employees. The code performs a lot of string manipulations to extract the information from the text files. The content of the e-mails is ignored, the code only retrieves the sender and the receiver. The analysis is limited to e-mails between employees in the corpus. Only those receivers whose inbox forms part of the analysis are included. The result of this code is a data frame with the usernames of the sender and receiver for each email. The data frame contains 2779 emails that meet the criteria.

# Create list of sender and receiver (inbox owner)
inboxes <- data.frame(
    from = apply(as.data.frame(emails), 1, function(x){readLines(x, warn = F)[3]}),
    to = emails,
    stringsAsFactors = F
)

# Keep only enron.com and strip all but username
library(stringr) # String manipulation
inboxes <- inboxes[grepl("@enron.com", inboxes$from),]
inboxes$from <- str_sub(inboxes$from, 7, nchar(inboxes$from) - 10)
to <- str_split(inboxes$to, "/")
inboxes$to <- sapply(to, "[", 3)

# Create list of usernames
users <- data.frame(user = paste0("maildir/", unique(inboxes$to)))

# Remove those without sent mails
sent <- apply(users, 1, function(x){sum(grepl("sent", dir(x)))})
users <- subset(users, sent != 0)

# Replace username with e-mail name
users$mailname <- NA
for (i in 1:nrow(users)){
sentmail <- dir(paste0(users$user[i], "/sent_items/"))
name <- readLines(paste0(users$user[i], "/sent_items/", sentmail[1]), warn = F)[3]
name <- str_sub(name, 7, nchar(name)-10)
users$mailname[i] <- name
}
users$user <- str_sub(users$user, 9)
inboxes <- merge(inboxes, by.x="to", users, by.y="user")
inboxes <- data.frame(from = inboxes$from, to = inboxes$mailname)

inboxes$from <- as.character(inboxes$from)
inboxes$to <- as.character(inboxes$to)

# Only e-mails between inbox users
inboxes <- inboxes[inboxes$from %in% inboxes$to,]

# Remove no.address
inboxes <- subset(inboxes, from != "no.address" & to != "no.address")

# Remove emails to self
inboxes <- subset(inboxes, inboxes$from != inboxes$to)

Network Analysis

The last code snippet defines a graph from the table of e-mails. Each employee is a node in the network, and each e-mail is an edge (line). The iGraph package is a powerful tool to analyse networks. The graph_from_edgelist function creates a network object that can be analysed using the iGraph package. The graph is directed because the information flows from the sender to the receiver.

In the next step, the Spingglass algorithm finds community structure within the data. A community is a group of nodes that are more connected with each other than with any other node.

The last step visualises the network. The diagram is drawn using the Fruchterman-Reingold algorithm, which places the most connected nodes at the centre of the picture. The background colours in the diagram indicate the eight communities.

The graph tells us a lot about the group of employees in the Enron corpus and how they relate to each other. Each of the communities represents a tightly connected group of employees that mainly e-mail each other. Any connections between communities are shown in red. When the vertex.label = NA line is removed, the usernames are displayed at each node.

We can see groups that never email each other, lonely hangers-on and tightly knit cliques within Enron. In the centre of the graph we see w few individuals who are connectors between groups because they send a lot of emails to people outside their community. On the fringes of the graph are the hangers-on who only once or twice emailed somebody in the corpus but were still included in the investigation.

library(igraph)
g <- graph_from_edgelist(as.matrix(inboxes), directed = T)
coms <- spinglass.community(g)

# Plot network
par(mar = c(0,0,2,0))
plot(coms, g, 
     vertex.label=NA, 
     layout = layout.fruchterman.reingold,
     vertex.size = 3
)

View the most recent version of the code on GitHub.

This analysis provides only a quick glimpse into the knowledge contained in the Enron corpus. An extensive range of tools is available to analyse such networks. An interesting exercise would be to overlap this network with the organisation chart to see the relationships between teams. Have fun playing with this fantastic data set!

Enron corpus network with communities.

 

Generating Quadratic Primes: Euler Problem 27

Prime numbers are the engine of the Internet economy. One of the reasons prime numbers are so useful is that they cannot be generated through an algorithm. This impossibility has not stopped mathematicians from trying to find formulas to generate prime numbers.

Euler problem 27 deals with quadratic formulas that can be used to generate sets of prime numbers. We have already discussed this in the post about the Ulam Spiral. This Numerphile video discusses quadratic primes.

Euler Problem 27 Definition

Euler discovered the remarkable quadratic formula:

n^2+n+41

It turns out that the formula will produce 40 primes for the consecutive integer values 0 \leq n \leq 39. However, when n=40, 40^2+40+41=40(40+1)+41 is divisible by 41 , and certainly when n=41, 41^2+41+41 is clearly divisible by 41.

The incredible formula n^2-79n+1601 was discovered, which produces 80 primes for the consecutive values 0 \leq n \leq 79 . The product of the coefficients, -79 and 1601, is -126479.

Considering quadratics of the form: n^2+an+bn^2+an+b ,

where |a| < 1000 and |b| \leq 1000 , where |n| is the modulus/absolute value of n , e.g. |11|=11 and |-4|=4.

Find the product of the coefficients, a and b , for the quadratic expression that produces the maximum number of primes for consecutive values of n, starting with n=0 .

Proposed Solution

The only way to solve this problem is through brute force and reduce the solution space to optimise it for speed (source: mathblog.dk). Because the outcome of the equation must be prime for n = 0, b also has to be prime. We can use the prime sieve from Euler Problem 3, which reduces it from 2000 to 168 options. When we insert n = 1 it follows that a has to be an even number. If 1+a+b has to be prime and b has to be a prime number, then a can only be an odd number. However, when b = 2, a has to be even.

Euler Problem 27 code

seq.a <- seq(-999, 1001, 2) # a has to be odd
seq.b <- (esieve(1000)) # b has to be prime
max.count <- 0
for (a in seq.a) {
    if (a == 2) 
        seq.a <- seq(-1000, 1000, 2) # a has to be even
    for (b in seq.b) {
        n <- 0 # Find sequence of primes for a and b
        while (is.prime(n^2 + a * n + b)) {
            n <- n + 1 } # Store maximum values if (n > max.count) {
            max.count <- n
            max.a <- a
            max.b <- b
        }
    }
}
answer <- max.a * max.b
print(answer)

View the latest version of this code on GitHub.

The Viral Recurring Decimal: Euler Problem 26

A few years ago a fraction broke the internet. What happens when you divide 1 by 998001?

\frac{1}{998001} = 0.000001002003004005006007008009010011012013014015 \ldots

What is special about this fraction is that it lists every three-decimal number except for 998. Look carefully at the sequence to see that is 000, 001, 0002, 003, 004, 005 and so on. After it has reached 999, the sequence continues from the start. This fraction thus has 2997 recurring decimals. James Grime from Numberphile explains this mathematical oddity with his usual enthusiasm.

The decimal fraction of 1/998001 is a recurring decimal. These are decimal numbers with periodic digits (repeating its values at regular intervals). Euler problem 26 asks us to analyse recurring decimals (reciprocal cycles).

Euler Problem 26 Definition

A unit fraction contains 1 in the numerator. The decimal representation of the unit fractions with denominators 2 to 10 are given:

1/2 = 0.5
1/3 = 0.(3)
1/4 = 0.25
1/5 = 0.2
1/6 = 0.1(6)
1/7 = 0.(142857)
1/8 = 0.125
1/9 = 0.(1)
1/10 = 0.1

Where 0.1(6) means 0.166666…, and has a 1-digit recurring cycle. It can be seen that 1/7 has a 6-digit recurring cycle. Find the value of d < 1000 for which 1/d contains the longest recurring cycle in its decimal fraction part.

Solution

A051626 describes the length of the recurring numbers in 1/n in the On-Line Encyclopaedia of Integer Sequences. To solve Euler Problem 26, we need to generate the first 1000 numbers of this sequence and find out which number has the longest recurring cycle.

R can only display up to 22 decimals by using options(digits=22). The base R capability is unsuitable for solving this problem, so I wrote some code to perform long division the old-fashioned way.

The recur function divides 1 by any arbitrary integer. The code continues until the decimal terminates, for example 1/4 = 0.25, or when a recurring pattern emerges, e.g. 1/7 = 0.(142857).

The function has two arguments: n is the input number. The output argument determines the outcome of the function: “len” for the length of the recurring decimals. Any other value shows the result of the calculation. The output of the function is a string. Using the European notation, the recurring part of the decimals is shown between brackets, e.g. 1/14 = 0.0(714285).

recur <- function(x, output = "") {
    # Prepare variable
    if (x == 0) return(NaN)
    if (x == 1) return(0)
    x <- floor(abs(x))
    # Initiate vectors to store decimals and remainders
    dec <- vector()
    rem <- vector()
    # Initiate values
    i <- 1
    r <- 10
    rem <- r
    # Long division
    repeat {
        dec[i] <- floor(r / x)
        r <- 10 * (r %% x)
        # Test wether the number is terminating or repeating
        if (r == 0 | r %in% rem) break
        rem[i + 1] <- r
        i <- i + 1 
    }
    # Determine number of recurring digits
    rep <- ifelse(r != 0, length(rem) - which(r == rem) + 1, 0)
    # Output
    if (output == "len")
        return(rep)
    else {
        if (rep != 0) {
            if (rep == length(dec)) 
                l <- "("
            else
                l <- c(dec[1:(length(dec) - rep)], "(")
            dec <- c(l, dec[(length(dec) - rep + 1):length(dec)], ")")
        }
        return(paste0("0.", paste0(dec, collapse = "", sep = "")))
        }
}

A051626 <- sapply(1:1000, recur, "len")
answer <- which.max(A051626)
print(answer)

recur(998001)

You can view the latest version of this code on GitHub.

Large integers in R: Fibonacci number with 1000 digits, Euler Problem 25

The Fibonacci Sequence in nature: The nautilus shell. Euler problem 25

The Fibonacci Sequence occurs in nature: The nautilus shell.

Euler Problem 25 takes us back to the Fibonacci sequence and the problems related to working with very large integers.

The Fibonacci sequence follows a simple mathematical rule but it can create things of great beauty. This pattern occurs quite often in nature, like to nautilus shell shown in the image. The video by Arthur Benjamin at the end of this post illustrates some of the magic of this sequence.

Large Integers in R

By default, numbers with more than 7 digits are shown in scientific notation in R, which reduces the accuracy of the calculation. You can change the precision of large integers with the options function but R struggles with integers with more than 22 digits. This example illustrates this issue.

factorial(24)
[1] 6.204484e+23
> options(digits=22)
> factorial(24)
[1] 620448401733239409999872

However, finding a number of 1000 digits is a problem with using special functions.

Euler Problem 25 Definition

The Fibonacci sequence is defined by the recurrence relation:

F_n = F_{n-1} + F_{n-2} , where F_1 = 1 and F_2 = 1 . The 12th term, F_{12} , is the first term to contain three digits.

What is the index of the first term in the Fibonacci sequence to contain 1000 digits?

Proposed Solutions

The first solution uses the GMP library to manage very large integers. This library also contains a function to generate Fibonacci numbers. This solution cycles through the Fibonacci sequence until it finds a number with 1000 digits.

library(gmp) # GNU Multiple Precision Arithmetic Library
n <- 1
fib <- 1
while (nchar(as.character(fib)) < 1000) {
   fib <- fibnum(n) # Determine next Fibonacci number
   n <- n + 1
}
answer <- n
print(answer)

This is a very fast solution but my aim is to solve the first 100 Project Euler problems using only base-R code. The big.add function I developed to solve Euler Problem 13.

t <- proc.time()
fib <- 1 # First Fibonaci number
cur <- 1 # Current number in sequence
pre <- 1 # Previous number in sequence
index <- 2
while (nchar(fib) < 1000) {
    fib <- big.add(cur, pre) # Determine next Fibonacci number
    pre <- cur
    cur <- fib
    index <- index + 1
}
answer <- index
print(answer)

This code is much slower than the GMP library but it was fun to develop.

View the latest version of this code on GitHub.

The Magic of the Fibonacci Numbers

Lexicographic Permutations: Euler Problem 24

Euler Problem 24 asks to develop lexicographic permutations which are ordered arrangements of objects in lexicographic order. Tushar Roy of Coding Made Simple has shared a great introduction on how to generate lexicographic permutations.

Euler Problem 24 Definition

A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4. If all of the permutations are listed numerically or alphabetically, we call it lexicographic order. The lexicographic permutations of 0, 1 and 2 are:

012 021 102 120 201 210

What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?

Brute Force Solution

The digits 0 to 9 have 10! = 3628800 permutations (including combinations that start with 0). Most of these permutations are, however, not in lexicographic order. A brute-force way to solve the problem is to determine the next lexicographic permutation of a number string and repeat this one million times.

nextPerm <- function(a) {
    # Find longest non-increasing suffix
    i <- length(a) while (i > 1 && a[i - 1] >= a[i])
        i <- i - 1
    # i is the head index of the suffix
    # Are we at the last permutation?
    if (i <= 1) return (NA)
    # a[i - 1] is the pivot
    # Find rightmost element that exceeds the pivot
    j <- length(a)
    while (a[j] <= a[i - 1]) 
        j <- j - 1
    # Swap pivot with j
    temp <- a[i - 1]
    a[i - 1] <- a[j]
    a[j] <- temp
    # Reverse the suffix
    a[i:length(a)] <- rev(a[i:length(a)])
    return(a)
    }

numbers <- 0:9
for (i in 1:(1E6 - 1)) numbers <- nextPerm(numbers)
answer <- numbers
print(answer)

This code takes the following steps:

  1. Find largest index i such that a_{i-1} < a_i.
    1. If no such index exists, then this is already the last permutation.
  2. Find largest index j such that j \geq i and a_j > a_{i-1}.
  3. Swap a_j and a_{i-1}.
  4. Reverse the suffix starting at a_i.

Combinatorics

A more efficient solution is to use combinatorics, thanks to MathBlog. The last nine digits can be ordered in 9! = 362880 ways. So the first 9! permutations start with a 0. By extending this thought, it follows that the millionth permutation must start with a 2.

\lfloor (1000000 - 1) / 9! \rfloor  = 2

From this rule, it follows that the 725761st permutation is 2013456789. We now need 274239 more lexicographic permutations:

(1000000 - 1) - (2 \times 9!) = 274239

We can repeat this logic to find the next digit. The last 8 digits can be ordered in 40320 ways. The second digit is the 6th digit in the remaining numbers, which is 7 (2013456789).

\lfloor 274239 / 8! \rfloor  = 6

274239 - (6 \times 7!) = 32319

This process is repeated until all digits have been used.

numbers <- 0:9
n <- length(numbers)
answer <- vector(length = 10)
remain <- 1E6 - 1
for (i in 1:n) {
    j <- floor(remain / factorial(n - i))
    answer[i] <- numbers[j + 1]
    remain <- remain %% factorial(n - i)
    numbers <- numbers[-(j + 1)]
}
answer <- paste(answer, collapse = "")
print(answer)

View the latest version of this code on GitHub.

R blogger Tony’s Bubble Universe created a generalised function to solve this problem a few years ago.

Tic Tac Toe Part 3: The Minimax Algorithm

Tic Tac Toe minimax alogrithmIn two previous posts, I presented code to teach R to play the trivial game of Tic Tac Toe. I thought this was an unbeatable algorithm. Alas, a comment from Alberto shattered my pride as he was able to beat my code.

The reason for the demise of my code was that I didn’t implement a full minimax algorithm, but instead looked only two moves ahead. I thought the common strategy rules (start in the centre and occupy the corners) would make the program unbeatable. When I simulated the game by instructing the computer to play against itself, Alberto’s strategy never arose because the code forces the centre field. Alberto’s code shows that you need to look at least three moves ahead for a perfect game. He has been so kind to share his code and gave me permission to publish it.

Alberto recreated two functions, for completeness I have added the complete working code that merges his improvements with my earlier work. The first two functions are identical to the previous post. These functions draw the game board and process the human player’s move by waiting for a mouse click.

# Draw the game board
draw.board <- function(game) {
    xo <- c("X", " ", "O") # Symbols
    par(mar = rep(1,4))
    plot.new()
    plot.window(xlim = c(0,30), ylim = c(0,30))
    abline(h = c(10, 20), col="darkgrey", lwd = 4)
    abline(v = c(10, 20), col="darkgrey", lwd = 4)
    text(rep(c(5, 15, 25), 3), c(rep(25, 3), rep(15,3), rep(5, 3)), xo[game + 2], cex = 4)
    # Identify location of any three in a row
    square <- t(matrix(game, nrow = 3))
    hor <- abs(rowSums(square))
    if (any(hor == 3)) 
        hor <- (4 - which(hor == 3)) * 10 - 5 
    else 
        hor <- 0
    ver <- abs(colSums(square))
    if (any(ver == 3)) 
        ver <- which(ver == 3) * 10 - 5 
    else
        ver <- 0
    diag1 <- sum(diag(square))
    diag2 <- sum(diag(t(apply(square, 2, rev)))) 
    # Draw winning lines
    if (all(hor > 0)) 
        for (i in hor) lines(c(0, 30), rep(i, 2), lwd = 10, col="red")
    if (all(ver > 0)) 
        for (i in ver) lines(rep(i, 2), c(0, 30), lwd = 10, col="red")
    if (abs(diag1) == 3) 
        lines(c(2, 28), c(28, 2), lwd = 10, col = "red")
    if (abs(diag2) == 3) 
        lines(c(2, 28), c(2, 28), lwd = 10, col = "red")
}

# Human player enters a move
move.human <- function(game) {
    text(4, 0, "Click on screen to move", col = "grey", cex=.7)
    empty <- which(game == 0)
    move <- 0
    while (!move %in% empty) {
        coords <- locator(n = 1) # add lines
        coords$x <- floor(abs(coords$x) / 10) + 1
        coords$y <- floor(abs(coords$y) / 10) + 1
        move <- coords$x + 3 * (3 - coords$y)
    }
    return (move)
}

Alberto rewrote the functions that analyse the board and determine the move of the computer. The ganador (Spanish for winning) function assesses the board condition by assigning -10 or + 10 for a winning game and 0 for any other situation.

ganador <- function(juego, player) {
    game <- matrix(juego, nrow = 3, byrow = T)
    hor <- rowSums(game)
    ver <- colSums(game)
    diag <- c(sum(diag(game)), sum(diag(apply(game, 1, rev))))
    if (-3 %in% c(hor, ver, diag))
        return(-10)
    if (3 %in% c(hor, ver, diag))
        return(10)
    else
        return(0)
}

The next function is the actual minimax algorithm. If the computer starts then the first move (9!= 362880 options to assess) takes a little while. The commented lines can be used to force a corner and make the games faster by forcing a random corner.

The minimax function returns a list with the move and its valuation through the ganador function. The function works recursively until it has filled the board and retains the best scoring move using the minimax method. To avoid the computer always playing the same move in the same situation random variables are added.

minimax <- function(juego, player) {
    free <- which(juego == 0)
    if(length(free) == 1) {
        juego[free] <- player
        return(list(move = free, U = ganador(juego, player)))
    }
    poss.results <- rep(0, 9)
    for(i in free) {
        game <- juego
        game[i] <- player
        poss.results[i] <- ganador(game, player)
    }
    mm <- ifelse(player == -1, "which.min", "which.max")
    if(any(poss.results == (player * 10))) {
        move <- do.call(mm, list(poss.results))
        return(list(move = move, U = poss.results[move]))
    }
    for(i in free) {
        game <- juego
        game[i] <- player
        poss.results[i] <- minimax(game, -player)$U
    }
    random <- runif(9, 0, 0.1)
    poss.results[-free] <- 100 * -player
    poss.results <- poss.results + (player * random)
    move <- do.call(mm, list(poss.results))
    return(list(move = move, U = poss.results[move]))
}

This final function stitches everything together and lets you play the game. Simply paste all functions in your R console and run them to play a game. The tic.tac.toe function can take two parameters, “human” and/or “computer”. The order of the parameters determines who starts the game.

# Main game engine
tic.tac.toe <- function(player1 = "human", player2 = "computer") {
    game <- rep(0, 9) # Empty board
    winner <- FALSE # Define winner
    player <- 1 # First player
    players <- c(player1, player2)
    draw.board(game)
    while (0 %in% game & !winner) { # Keep playing until win or full board
        if (players[(player + 3) %% 3] == "human") # Human player
            move <- move.human(game)
        else { # Computer player
            move <- minimax(game, player)
            move <- move$move
            }
        game[move] <- player # Change board
        draw.board(game)
        winner <- max(eval.game(game, 1), abs(eval.game(game, -1))) == 6 # Winner, winner, chicken dinner?
        player <- -player # Change player
    }
}

tic.tac.toe()

This is my last word on Tic Tac Toe but now that the minimax conundrum is solved I could start working on other similar games such as Connect Four, Draughts or even the royal game of Chess.

View the latest version of this code on GitHub.

Euler Problem 23: Non-Abundant Sums

A demonstration of the abundance of the number 12 using Cuisenaire rods (Euler Problem 23)

A demonstration of the abundance of the number 12 using Cuisenaire rods (Wikipedia).

Euler problem 23 asks us to solve a problem with abundant or excessive numbers.

These are numbers for which the sum of its proper divisors is greater than the number itself.

12 is an abundant number because the sum of its proper divisors (the aliquot sum) is larger than 12: (1 + 2 + 3 + 4 + 6 = 16).

All highly composite numbers or anti-primes greater than six are abundant numbers. These are numbers that have so many divisors that they are considered the opposite of primes, as explained in the Numberphile video below.

Euler Problem 23 Definition

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.

A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.

As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis, even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.

Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

Solution

This solution repurposes the divisors function that determines the proper divisors for a number, introduced for Euler Problem 21. The first code snippet creates the sequence of all abundant numbers up to 28123 (sequence A005101 in the OEIS). An abundant number is one where its aliquot sum is larger than n.

# Generate abundant numbers (OEIS A005101)
A005101 <- function(x){
    abundant <- vector()
    a <- 1
    for (n in 1:x) {
        aliquot.sum <- sum(proper.divisors(n)) - n 
        if (aliquot.sum > n) {
            abundant[a] <- n
            a <- a + 1
        }
    }
    return(abundant)
}

abundant <- A005101(28123)

The solution to this problem is also a sequence in the Online Encyclopedia of Integer Sequences (OEIS A048242). This page states that the highest number in this sequence is 20161, not 28123 as stated in the problem definition.

The second section of code creates a list of all potential numbers not the sum of two abundant numbers. The next bit of code sieves any sum of two abundant numbers from the list. The answer is determined by adding remaining numbers in the sequence.

# Create a list of potential numbers that are not the sum of two abundant numbers
A048242 <- 1:20161

# Remove any number that is the sum of two abundant numbers
for (i in 1:length(abundant)) {
    for (j in i:length(abundant)) {
        if (abundant[i] + abundant[j] <= 20161) {
            A048242[abundant[i] + abundant[j]] <- NA
        }
    }
}
A048242 <- A048242[!is.na(A048242)]
answer <- sum(A048242)
print(answer)

View the latest version of this code on GitHub.

The Sierpinski Triangle: Visualising infinity in R

Sierpinski triangleWacław Sierpiński was a mathematical genius who developed several of the earliest fractals. The Sierpiński triangle is an easy to conceptualise geometrical figure but it hides a fascinating mathematical complexity. Start by drawing an equilateral triangle and draw another one in its centre. Then draw equilateral triangles in the four resulting triangles, and so on, ad infinitum.

The original Sierpinski triangle will eventually disappear into Cantor dust, a cloud of ever shrinking triangles of infinitesimal size. The triangle is self-similar, no matter how far you zoom in, the basic geometry remains the same.

The Chaos Game

A fascinating method to create a Sierpinski Triangle is a chaos game. This method uses random numbers and some simple arithmetic rules. Sierpinski Triangles can be created using the following six steps:

  1. Define three points in a plane to form a triangle.
  2. Randomly select any point on the plane.
  3. Randomly select any one of the three triangle points.
  4. Move half the distance from your current position to the selected vertex.
  5. Plot the current position.
  6. Repeat from step 3.

This fractal is an implementation of chaos theory as this random process attracts to a complex ordered geometry. The game only works with random numbers and when selecting random vertices of the triangle.

Sierpinski Triangle Code

This code implements the six rules in R. The code first initializes the triangle, defines a random starting point and then runs a loop to place random dots. The R plot engine does not draw pixels but uses characters, which implies that the diagram is not as accurate as it could be but the general principle is clear. The x(11) and Sys.sleep() commands are used to plot during the for-loop.

# Sierpinsky Triangle

# Initialise triangle
p <- c(0, 500, 1000)
q <- c(0, 1000, 0)
x11()
par(mar = rep(0, 4))
plot(p, q, col= "red", pch = 15, cex = 1, axes = FALSE)

# Random starting point
x <- sample(0:1000, 1)
y <- sample(0:1000, 1)

# Chaos game
for (i in 1:10000) {
    Sys.sleep(.001)
    n <- sample(1:3, 1)
    x <- floor(x + (p[n] - x) / 2)
    y <- floor(y + (q[n] - y) / 2)
    points(x, y, pch = 15, cex = 0.5)
}

This algorithm demonstrates how a seemingly chaotic process can result in order. Many other versions of chaos games exist, which I leave to the reader to play with. If you create your own versions then please share the code in the comment box below.

View the latest version of this code on GitHub.

Euler Problem 22 : Names Scores

Euler Problem 22

R logo in ASCII art by picascii.com

Euler problem 22 is another trivial one that takes us to the realm of ASCII codes. ASCII is a method to convert symbols into numbers, originally invented for telegraphs.

Back in the 8-bit days, ASCII art was a method to create images without using lots of memory. Each image consists of a collection of text characters that give the illusion of an image. Euler problem 22 is, unfortunately, a bit less poetic.

Euler Problem 22 Definition

Using names.txt, a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.

For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 × 53 = 49,714.

What is the total of all the name scores in the file?

Solution

This code reads and cleans the file and sorts the names alphabetically. The charToRaw function determines the numerical value of each character in each name. This output of this function is the hex ASCII code for each character. The letter A is number 65, so we subtract 64 from each value to get the sum total.

# ETL: reads the file and converts it to an ordered vector.
names <- readLines("https://projecteuler.net/project/resources/p022_names.txt", warn = F)
names <- unlist(strsplit(names, ","))
names <- gsub("[[:punct:]]", "", names)
names <- sort(names)

# Total Name scores
answer <- 0
for (i in names) {
    value <- sum(sapply(unlist(strsplit(i, "")), function(x) as.numeric(charToRaw(x)) - 64))
    value <- value * which(names==i)
    answer <- answer + value
}
print(answer)

We can have a bit more fun with this problem by comparing this list with the most popular baby names in 2016. The first section of the code extracts the list of popular names from the website. The rest of the code counts the number of matches between the lists.

# Most popular baby names
library(rvest)
url <- "https://www.babycenter.com/top-baby-names-2016.htm"
babynames <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="babyNameList"]/table') %>%
    html_table()
babynames <- babynames[[1]]

# Convert Project Euler list and test for matches
proper=function(x) paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
names <- proper(names)

sum(babynames$GIRLS %in% names)
sum(babynames$BOYS %in% names)

View the latest version of this code on GitHub.

Euler Problem 21: Amicable Numbers

Euler Problem 21: Amicable numbers key chainEuler problem 21 takes us to the realm of amicable numbers, which are listed in sequence A259180 in the OEIS. Amicable, or friendly, numbers are the most romantic numbers known to maths. Amicable numbers serve absolutely no practical purpose, other than mathematical entertainment.

A related concept is a perfect number, which is a number that equals the sum of its proper divisors. Mathematicians have also defined sociable numbers and betrothed numbers which are similar to amicable numbers. But perhaps these are for another Euler problem.

Euler Problem 21 Definition

Let d(n) be defined as the sum of proper divisors of n (numbers less than n which divide evenly into n). If d(a) = b and d(b) = a, where a \neq b, then a and b are an amicable pair and each of a and b are called amicable numbers.

For example, the proper divisors of 220 are 1, 2, 4, 5, 10, 11, 20, 22, 44, 55 and 110; therefore d(220) = 284. The proper divisors of 284 are 1, 2, 4, 71 and 142; so, d(284) = 220.

Evaluate the sum of all the amicable numbers under 10000.

Solution

The first part of the code provides for a function to list all proper divisors for a given integer x. The loop determines the divisors for the numbers 220 to 10,000, calculates their sum and then checks if these numbers are amicable. When the code finds an amicable number, the counter jumps to the sum of the divisors to check for the next one.

proper.divisors <- function(x) {
    divisors <- vector()
    d <- 1
    for (i in 1:floor(sqrt(x))) {
        if (x %% i == 0) {
            divisors[d] <- i
            if (i != x/i) {
                d <- d + 1
                divisors[d] <- x / i
            }
            d <- d + 1
        }
    }
    return(divisors)
}

answer <- 0
n <- 220
while (n <= 10000) {
    div.sum <- sum(proper.divisors(n)) - n
    if (n == sum(proper.divisors(div.sum)) - div.sum & n != div.sum) {
        answer <- answer + n + div.sum
        print(paste0("(", n, ",", div.sum, ")"))
        n <- div.sum
    }
    n <- n + 1
}
print(answer)

View the latest version of this code on GitHub.

Mystical Amicable Numbers

Amicable numbers were known to the Pythagoreans, who credited them with many mystical properties. Before we had access to computers, finding amicable numbers was a task that required a lot of patience. No algorithm can systematically generate all amicable numbers, and until 1946 only 390 pairs were known. Medieval Muslim mathematicians developed several formulas to create amicable numbers, but the only way to be complete is using brute force.