Pacific Island Hopping using R and iGraph

Last month I enjoyed a relaxing holiday in the tropical paradise of Vanuatu. One rainy day I contemplated how to go island hopping across the Pacific ocean visiting as many island nations as possible. The Pacific ocean is a massive body of water between, Asia and the Americas, which covers almost half the surface of the earth. The southern Pacific is strewn with island nations from Australia to Chile. In this post, I describe how to use R to plan your next Pacific island hopping journey.

Pacific Island hopping

The Pacific Ocean.

Listing all airports

My first step was to create a list of flight connections between each of the island nations in the Pacific ocean. I am not aware of a publically available data set of international flights so unfortunately, I created a list manually (if you do know of such data set, then please leave a comment).

My manual research resulted in a list of international flights from or to island airports. This list might not be complete, but it is a start. My Pinterest board with Pacific island airline route maps was the information source for this list.

The first code section reads the list of airline routes and uses the ggmap package to extract their coordinates from Google maps. The data frame with airport coordinates is saved for future reference to avoid repeatedly pinging Google for the same information.

# Init
library(tidyverse)
library(ggmap)
library(ggrepel)
library(geosphere)

# Read flight list and airport list
flights <- read.csv("Geography/PacificFlights.csv", stringsAsFactors = FALSE)
f <- "Geography/airports.csv"
if (file.exists(f)) {
    airports <- read.csv(f)
    } else airports <- data.frame(airport = NA, lat = NA, lon = NA)

# Lookup coordinates for new airports
all_airports <- unique(c(flights$From, flights$To))
new_airports <- all_airports[!(all_airports %in% airports$airport)]
if (length(new_airports) != 0) {
    coords <- geocode(new_airports)
    new_airports <- data.frame(airport = new_airports, coords)
    airports <- rbind(airports, new_airports)
    airports <- subset(airports, !is.na(airport))
    write.csv(airports, "Geography/airports.csv", row.names = FALSE)
}

# Add coordinates to flight list
flights <- merge(flights, airports, by.x="From", by.y="airport")
flights <- merge(flights, airports, by.x="To", by.y="airport")

Create the map

To create a map, I modified the code to create flight maps I published in an earlier post. This code had to be changed to centre the map on the Pacific. Mapping the Pacific ocean is problematic because the -180 and +180 degree meridians meet around the date line. Longitudes west of the antemeridian are positive, while longitudes east are negative.

The world2 data set in the borders function of the ggplot2 package is centred on the Pacific ocean. To enable plotting on this map, all negative longitudes are made positive by adding 360 degrees to them.

# Pacific centric
flights$lon.x[flights$lon.x < 0] <- flights$lon.x[flights$lon.x < 0] + 360
flights$lon.y[flights$lon.y < 0] <- flights$lon.y[flights$lon.y < 0] + 360
airports$lon[airports$lon < 0] <- airports$lon[airports$lon < 0] + 360

# Plot flight routes
worldmap <- borders("world2", colour="#efede1", fill="#efede1")
ggplot() + worldmap + 
    geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + 
    geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), 
      col = "black", size = 2, segment.color = NA) + 
    geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, 
      yend = lat.y, col = Airline), size = .4, curvature = .2) + 
    theme(panel.background = element_rect(fill="white"), 
          axis.line = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank()
          ) + 
    xlim(100, 300) + ylim(-40,40)

Pacific island hopping

Pacific Island Hopping

This visualisation is aesthetic and full of context, but it is not the best visualisation to solve the travel problem. This map can also be expressed as a graph with nodes (airports) and edges (routes). Once the map is represented mathematically, we can generate travel routes and begin our Pacific Island hopping.

The igraph package converts the flight list to a graph that can be analysed and plotted. The shortest_path function can then be used to plan routes. If I would want to travel from Auckland to Saipan in the Northern Mariana Islands, I have to go through Port Vila, Honiara, Port Moresby, Chuuk, Guam and then to Saipan. I am pretty sure there are quicker ways to get there, but this would be an exciting journey through the Pacific.

library(igraph)
g <- graph_from_edgelist(as.matrix(flights[,1:2]), directed = FALSE)
par(mar = rep(0, 4))
plot(g, layout = layout.fruchterman.reingold, vertex.size=0)
V(g)
shortest_paths(g, "Auckland", "Saipan")

View the latest version of this code on GitHub.

Pacific Flight network

Using the iGraph package to Analyse the Enron Corpus

The Enron scandal is one of the most famous corporate governance failures in the history of capitalism. The case itself is interesting for its sake, but in this post, I am more interested in one of the data sets that the subsequent investigation has provided.

This blog post analyses an extensive collection of e-mails from former Enron employees. The Enron corpus is analysed using network analysis tools provided by the iGraph package. Network analysis is a versatile technique that can be used to add value to a lot of different data sets, including the complex corporate relationships of Donald Trump.

The Enron Corpus

As part of their inquiries, The Federal Energy Regulatory Commission used an extensive collection of emails from Enron employees. The Enron corpus is one of the only publicly available collections of emails available for research. This dataset also provides a fascinating playground for citizen data scientists.

The set has privacy issues as it contains messages from living people. When analysing this data set, we need to keep in mind that the majority of former Enron employees were innocent people who lost their jobs due to the greed of their overlords. The code in this post only analyses the e-mail headers, ignoring the content.

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001 - Enron corpus analysis

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001.

The Enron Corpus is a large database of half a million of emails generated by more than 100 Enron employees. You can download the corpus from the Carnegie Mellon School of Computer Science. The first code snippet downloads the 7 May 2015 version of the dataset (about 423Mb, tarred and gzipped) and untars it to your working directory.

# Enron Email Dataset: https://www.cs.cmu.edu/~./enron/
download.file("https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz", destfile = "enron_mail_20150507.tgz")
untar("enron_mail_20150507.tgz")

Preparing the Data

The main folder is maildir, which holds all the personal accounts. Our first task is to load the required libraries and create a list of available emails. This code results in 517,401 e-mail files with 44,859 emails in the inboxes of users.

# E-mail corpus consists of nested folders per user with e-mails as text files
# Create list of all available e-mails
emails <- list.files("maildir/", full.names = T, recursive = T)
length(emails)
# Filter by inbox only
emails <- emails[grep("/inbox", emails)]
length(emails)

The bulk of the code creates a list of emails between Enron employees. The code performs a lot of string manipulations to extract the information from the text files. The content of the e-mails is ignored, the code only retrieves the sender and the receiver. The analysis is limited to e-mails between employees in the corpus. Only those receivers whose inbox forms part of the analysis are included. The result of this code is a data frame with the usernames of the sender and receiver for each email. The data frame contains 2779 emails that meet the criteria.

# Create list of sender and receiver (inbox owner)
inboxes <- data.frame(
    from = apply(as.data.frame(emails), 1, function(x){readLines(x, warn = F)[3]}),
    to = emails,
    stringsAsFactors = F
)

# Keep only enron.com and strip all but username
library(stringr) # String manipulation
inboxes <- inboxes[grepl("@enron.com", inboxes$from),]
inboxes$from <- str_sub(inboxes$from, 7, nchar(inboxes$from) - 10)
to <- str_split(inboxes$to, "/")
inboxes$to <- sapply(to, "[", 3)

# Create list of usernames
users <- data.frame(user = paste0("maildir/", unique(inboxes$to)))

# Remove those without sent mails
sent <- apply(users, 1, function(x){sum(grepl("sent", dir(x)))})
users <- subset(users, sent != 0)

# Replace username with e-mail name
users$mailname <- NA
for (i in 1:nrow(users)){
sentmail <- dir(paste0(users$user[i], "/sent_items/"))
name <- readLines(paste0(users$user[i], "/sent_items/", sentmail[1]), warn = F)[3]
name <- str_sub(name, 7, nchar(name)-10)
users$mailname[i] <- name
}
users$user <- str_sub(users$user, 9)
inboxes <- merge(inboxes, by.x="to", users, by.y="user")
inboxes <- data.frame(from = inboxes$from, to = inboxes$mailname)

inboxes$from <- as.character(inboxes$from)
inboxes$to <- as.character(inboxes$to)

# Only e-mails between inbox users
inboxes <- inboxes[inboxes$from %in% inboxes$to,]

# Remove no.address
inboxes <- subset(inboxes, from != "no.address" & to != "no.address")

# Remove emails to self
inboxes <- subset(inboxes, inboxes$from != inboxes$to)

Network Analysis

The last code snippet defines a graph from the table of e-mails. Each employee is a node in the network, and each e-mail is an edge (line). The iGraph package is a powerful tool to analyse networks. The graph_from_edgelist function creates a network object that can be analysed using the iGraph package. The graph is directed because the information flows from the sender to the receiver.

In the next step, the Spingglass algorithm finds community structure within the data. A community is a group of nodes that are more connected with each other than with any other node.

The last step visualises the network. The diagram is drawn using the Fruchterman-Reingold algorithm, which places the most connected nodes at the centre of the picture. The background colours in the diagram indicate the eight communities.

The graph tells us a lot about the group of employees in the Enron corpus and how they relate to each other. Each of the communities represents a tightly connected group of employees that mainly e-mail each other. Any connections between communities are shown in red. When the vertex.label = NA line is removed, the usernames are displayed at each node.

We can see groups that never email each other, lonely hangers-on and tightly knit cliques within Enron. In the centre of the graph we see w few individuals who are connectors between groups because they send a lot of emails to people outside their community. On the fringes of the graph are the hangers-on who only once or twice emailed somebody in the corpus but were still included in the investigation.

library(igraph)
g <- graph_from_edgelist(as.matrix(inboxes), directed = T)
coms <- spinglass.community(g)

# Plot network
par(mar = c(0,0,2,0))
plot(coms, g, 
     vertex.label=NA, 
     layout = layout.fruchterman.reingold,
     vertex.size = 3
)

View the most recent version of the code on GitHub.

This analysis provides only a quick glimpse into the knowledge contained in the Enron corpus. An extensive range of tools is available to analyse such networks. An interesting exercise would be to overlap this network with the organisation chart to see the relationships between teams. Have fun playing with this fantastic data set!

Enron corpus network with communities.

 

Euler Problem 18 & 67: Maximum Path Sums

A pedigree is an example of a binary tree: Euler Problem 18

An example of a pedigree. Source: Wikimedia.

Euler Problem 18 and 67 are exactly the same besides that the data set in the second version is larger than in the first one. In this post, I kill two Eulers with one code.

These problems deal with binary trees, which is a data structure where each node has two children. A practical example of a binary tree is a pedigree chart, where each person or animal has two parents, four grandparents and so on.

Euler Problem 18 Definition

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.

3
7 4
2 4 6
8 5 9 3

That is, 3 + 7 + 4 + 9 = 23. Find the maximum total from top to bottom of the triangle below:

75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23

As there are only 16,384 routes, it is possible to solve this problem by trying every route. However, Problem 67, is the same challenge with a triangle containing one-hundred rows; it cannot be solved by brute force, and requires a clever method! ;o)

Solution

This problem seeks a maximum path sum in a binary tree. The brute force method, as indicated in the problem definition, is a very inefficient way to solve this problem. The video visualises the quest for the maximum path, which takes eleven minutes of hypnotic animation.

A more efficient method is to define the maximum path layer by layer, starting at the bottom. The maximum sum of 2+8 or 2+5 is 10, the maximum sum of 4+5 or 4+9 is 13 and the last maximum sum is 15. These numbers are now placed in the next row. This process cycles until only one number is left. This algorithm solves the sample triangle in four steps:

Step 1:

3
7 4
2 4 6
8 5 9 3

Step 2:

3
7 4
10 13 15

Step 3:

3
20 19

Step 4:

23

In the code below, the data is triangle matrix. The variables rij (row) and kol (column) drive the search for the maximum path. The triangle for Euler Problem 18 is manually created and the triangle for Euler Problem 67 is read from the website.

path.sum <- function(triangle) {
    for (rij in nrow(triangle):2) {
        for (kol in 1:(ncol(triangle)-1)) {
            triangle[rij - 1,kol] <- max(triangle[rij,kol:(kol + 1)]) + triangle[rij - 1, kol]
        }
        triangle[rij,] <- NA
    }
    return(max(triangle, na.rm = TRUE))
}

# Euler Problem 18
triangle <- matrix(ncol = 15, nrow = 15)
triangle[1,1] <- 75
triangle[2,1:2] <- c(95, 64)
triangle[3,1:3] <- c(17, 47, 82)
triangle[4,1:4] <- c(18, 35, 87, 10)
triangle[5,1:5] <- c(20, 04, 82, 47, 65)
triangle[6,1:6] <- c(19, 01, 23, 75, 03, 34)
triangle[7,1:7] <- c(88, 02, 77, 73, 07, 63, 67)
triangle[8,1:8] <- c(99, 65, 04, 28, 06, 16, 70, 92)
triangle[9,1:9] <- c(41, 41, 26, 56, 83, 40, 80, 70, 33)
triangle[10,1:10] <- c(41, 48, 72, 33, 47, 32, 37, 16, 94, 29)
triangle[11,1:11] <- c(53, 71, 44, 65, 25, 43, 91, 52, 97, 51, 14)
triangle[12,1:12] <- c(70, 11, 33, 28, 77, 73, 17, 78, 39, 68, 17, 57)
triangle[13,1:13] <- c(91, 71, 52, 38, 17, 14, 91, 43, 58, 50, 27, 29, 48)
triangle[14,1:14] <- c(63, 66, 04, 68, 89, 53, 67, 30, 73, 16, 69, 87, 40, 31)
triangle[15,1:15] <- c(04, 62, 98, 27, 23, 09, 70, 98, 73, 93, 38, 53, 60, 04, 23)

answer <- path.sum(triangle)
print(answer)

Euler Problem 67

The solution for problem number 67 is exactly the same. The data is read directly from the Project Euler website.

# Euler Problem 67
triangle.file <- read.delim("https://projecteuler.net/project/resources/p067_triangle.txt", stringsAsFactors = F, header = F)
triangle.67 <- matrix(nrow = 100, ncol = 100)
for (i in 1:100) {
    triangle.67[i,1:i] <- as.numeric(unlist(strsplit(triangle.file[i,], " ")))
}
answer <- path.sum(triangle.67)
print(answer)

View the latest version of this code on GitHub.

Trumpworld Analysis : Ownership Relations in his Business Network

Trumpworld by BuzzfeedYou do not need a machine learning algorithm to predict that the presidency of Donald Trump will be controversial.

One of the most discussed aspects of his reign is the massive potential for conflicts of interest. Trump’s complex business empire is entangled with national and international politics.

Buzzfeed has mapped many of the relationships between businesses and people in what they have dubbed Trumpworld. They provided the data to enable citizens data science into the wheelings and dealings of Donald J. Trump. The raw data set consists of three subsets of connections between:

  • Organisations
  • People
  • People and organisations

Trumpworld Analysis

This post analyses the connections between organisations using the mighty igraph package in the R language. The code snippet below converts the data to a graph that can be analysed using social network analysis techniques. I have downloaded the table of the raw data file as CSV files. This data is subsetted to contain only ownership relationships.

# Read data
trumpworld.org <- read.csv("TrumpWorld DataOrg.csv")
trumpworld.org.ownership <- subset(trumpworld.org, Connection==&quot;Ownership&quot;)[,1:2]

# Create graph of ownerships
library(igraph)
org.ownership <- graph.edgelist(as.matrix(trumpworld.org.ownership))

# Analysis
nrow(trumpworld.org.ownership)
length(unique(c(trumpworld.org.ownership[,1], trumpworld.org.ownership[,2])))

# Plot Graph
par(mar=rep(0,4))
plot(org.ownership,
 layout=layout.fruchterman.reingold,
 vertex.label=NA,
 vertex.size=2,
 edge.arrow.size=.1
)

Trumpworld analysis - business ownership networkNetwork Analysis

This network contains 309 ownership relationships between 322 firms.

When we plot the data, we see that most relationships are between two firms. The plot is organised with the Fruchterman-Reingold algorithm to improve its clarity.

We can also see a large cluster in the centre. The names have been removed for clarity.

The Trumpland analysis continues with this conglomerate. The second code section excises this connected subnetwork so we can analyse it in more detail.

# Find most connected firm
which.max(degree(org.ownership))
# Create subnetworks
org.ownership.d <- decompose(org.ownership)
# Find largest subnetwork
largest <- which.max(sapply(org.ownership.d, diameter))
#Plot largest subnetwork
plot(org.ownership.d[[largest]],
 layout=layout.fruchterman.reingold,
 vertex.label.cex=.5,
 vertex.size=5,
 edge.arrow.size=.1
)

Digging Deeper

The node with the highest degree identifies the business with the most holdings. This analysis shows that DJT Holdings LLC owns 33 other organisations. These organisations own other organisations. We can now use the cluster function to investigate this subnetwork.

Trumpworld holdings

This Trumpworld analysis shows that the ownership network is clearly a star network. DJT Holdings LLC centrally controls all organisations. Perhaps this graph visualises the management style of the soon to be president Trump. Trump centrally controls his empire, which is typical for a family business.

Does this chart visualise Trump’s leadership style? Is the star network an expression of his lack of trust and thus desire to oversee everything directly?

View this code and associated files on GitHub.