Euler Problem 14: Longest Collatz Sequence

Euler Problem 14 looks at the Collatz Conjecture. These playful sequences, named after German mathematician Lothar Collatz (1910–1990), cause mathematicians a lot of headaches. This video introduces the problem much better than I can describe it.

Euler Problem 14 Definition

The following iterative sequence is defined for the set of positive integers:

n \rightarrow n/2 ( n is even)
n \rightarrow 3n + 1 ( n is odd)

Using the rule above and starting with 13, we generate the following sequence:

13 \rightarrow 40 \rightarrow 20 \rightarrow 10 \rightarrow 5 \rightarrow 16 \rightarrow 8 \rightarrow 4 \rightarrow 2 \rightarrow 1

This sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1. Which starting number, under one million, produces the longest chain? Note: Once the chain starts the terms are allowed to go above one million.

Solution

This problem is highly computationally intensive and it highlights R’s lack of speed. Generating one million Collatz sequences and finding the longest one requires a lot more than a minute of processing time allowed for in Project Euler.

collatz.chain <- function(n) {
    chain <- vector()
    i <- 1
    while (n! = 1) {
        if (n%%2 == 0) 
            n <- n / 2
        else
            n <- 3 * n + 1
        chain[i] <- n
        i <- i + 1
    }
    return(chain)
}
answer <- 0
collatz.max <- 0
for (n in 1:1E6) {
    collatz.length <- length(collatz.chain(n)) 
    if (collatz.length > collatz.max) {
        answer <- n
        collatz.max <- collatz.length
        }
}
print(answer)

The second version of the code contains some optimisations. The code stores the length of all sequences in an array. When the code generates a sequence and lands on a number already analysed, then it adds that previous number to the current one and moves on. This approach requires more memory but saves a lot of computation time. A minor tweak to the code optimises the rule for uneven numbers. Tripling an uneven number and adding one will always result in an even number so we can skip one step. This solution is more than twice as fast as the first version.

collatz.length <- vector(length=1e6)
collatz.length[1] <- 0
for (n in 2:1e6) {
    x <- n
    count <- 0 while (x != 1 & x >= n) {
        if (x %% 2 == 0) {
            x <- x / 2
            count <- count + 1
        }
        else {
            x <- (3 * x + 1) / 2
            count <- count + 2
        } 
    }
    count <- count + collatz.length[x]
    collatz.length[n] <- count
}
answer <- which.max(collatz.length)
print(answer)

Visualising Collatz Sequences

The Collatz sequence is an example of a simple mathematical rule that can create an unpredictable pattern. The number of steps required to reach 1 is listed in A006577 of the Online Encyclopedia of Integer Sequences.

The image below visualises the number of steps for the first 1000 positive numbers. The scatterplot shows some interesting patterns. Does this visualisation show that the Collatz Sequence does have a pattern after all?

Euler Problem 14: Number of halving and tripling steps to reach 1 in the Collatz problem.

Collatz Chains

The Collatz sequences can also be visualised using networks. Each step between two numbers is an edge and the numbers are the vertices. For example, the network for the Collatz sequence for number 10 is 5–16, 16–8, 8–4, 4–2, 2–1. When generating subsequent sequences the network will start to overlap and a tree of sequences appears. The tree below combines the Collatz sequences for the numbers 2 to 26. Number 27 has a very long sequence, making the tree much harder to read.

Network of Collatz sequences n=2-26

Network of Collatz sequences n=2-26

edgelist <- data.frame(a = 2, b = 1)
for (n in 3:26) {
   chain <- as.character(c(n, collatz.chain(n)))
   chain <- data.frame(a = chain[-length(chain)], b = chain[-1])
   edgelist <- rbind(edgelist, chain)
}
library(igraph)
g <- graph.edgelist(as.matrix(edgelist))
g <- simplify(g)
par(mar=rep(0,4))
V(g)$color <- degree(g, mode = "out") + 1
plot(g, 
     layout=layout.kamada.kawai,
     vertex.color=V(g)$color, 
     vertex.size=6,
     vertex.label.cex=.7,
     vertex.label.color="black",
     edge.arrow.size=.1,
     edge.color="black"
     )

 

Euler Problem 13: Large Sum of 1000 Digits

Euler Problem 13 asks to add one hundred numbers with fifty digits. This seems like a simple problem where it not that most computers are not designed to deal with numbers with a lot of integers. For example:

 2^{64} = 18446744073709551616

When asking R to compute this value we get 1.844674e+19, losing most of the digits and limiting the accuracy of the results. Computers solve this problem using Arbitrary-precision Arithmetic. There are many software libraries that can process long integers without loosing accuracy. Euler Problem 13 requires this type of approach.

Euler Problem 13 Definition

Work out the first ten digits of the sum of the following one-hundred 50-digit numbers.

Solution

The easy way to solve this problem is to use the gmp package for working with very large integers. This package uses a special number types such as Big Rational and Big Integer. The number of digits in these number types is only limited by the size of the memory.

library(gmp)
numbers <- readLines("Euler/p013_numbers.txt")
digits <- sum(as.bigz(numbers))
answer <- substr(as.character(digits),1,10)

Using Base-R

To find the solution to this problem using only base R, I wrote a function to add numbers using strings instead of integers. The function adds leading zeros to the smallest number to make them both the same length. The function then proceeds to add numbers in the same way we were taught in primary school. This function can in principle be used for several other Euler Problems using large integers.

# Add numbers with many digits
big.add <- function(a, b) {
    # Add leading zeros to smallest numer
    if (nchar(a) < nchar(b)) 
        a <- paste0(paste(rep(0, nchar(b) - nchar(a)), collapse = ""), a) 
    if (nchar(a) > nchar(b)) 
        b <- paste0(paste(rep(0, nchar(a) - nchar(b)), collapse = ""), b)
    solution <- vector()
    remainder <- 0
    for (i in nchar(b):1) {
        p <- as.numeric(substr(a, i, i))
        q <- as.numeric(substr(b, i, i))
        r <- p + q + remainder 
        if (r >= 10 & i!=1) {
            solution <- c(solution, r %% 10)
            remainder <- (r - (r %% 10))/10
        } else {
            solution <- c(solution, r)
            remainder <- 0
        }
    }
    return(paste(rev(solution), collapse = ""))
}

With this function, the problem is easy to solve. The second part of the code runs this function over the one hundred numbers provided on the Euler Problem page and calculates the answer.

numbers <- readLines("Euler/p013_numbers.txt")
for (i in numbers) {
    answer <- big.add(answer, i)
}
answer <- substr(answer, 1, 10)

Multiplying Big Numbers

You can expand this function to multiply a very large number with a smaller number using the Reduce function. This function adds the number a to itself, using the big.add function. The outcome of the addition is used in the next iteration until it has been repeated b times. The number b in this function needs to be a ‘low’ number because it uses a vector of the length b.

big.mult <- function(a, b) {
    Reduce(big.add, rep(a, as.numeric(b)))
}

Euler Problem 12: Highly Divisible Triangular Number

Euler Problem 12: Divisors of triangular numbers.

The divisors of 10 illustrated with Cuisenaire rods: 1, 2, 5, and 10 (Wikipedia).

Euler Problem 12 takes us to the realm of triangular numbers and proper divisors.

The image on the left shows a hands-on method to visualise the number of divisors of an integer. Cuisenaire rods are learning aids that provide a hands-on way to explore mathematics.

Euler Problem 12 Definition

The sequence of triangle numbers is generated by adding the natural numbers. So the 7th triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28 . The first ten terms would be: 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, \ldots Let us list the factors of the first seven triangle numbers:

1: 1

3: 1, 3

6: 1, 2, 3, 6

10: 1, 2, 5, 10

15: 1, 3, 5, 15

21: 1, 3, 7 ,21

28: 1, 2, 4, 7, 14, 28

We can see that 28 is the first triangle number to have over five divisors. What is the value of the first triangle number to have over five hundred divisors?

Solution

Vishal Kataria explains a simple method to determine the number of divisors using prime factorization as explained by in his video below. The prime factorization of n is given by:

n = p^{\alpha_1}_1 \times p^{\alpha_2}_2 \times p^{\alpha_k}_k

The number of proper divisors is:

d = (\alpha_1 + 1) (\alpha_2 + 1) \ldots (\alpha_k + 1)

The code reuses the prime factorisation function developed for Euler Problem 3. This function results in a vector of all prime factors, e.g. the prime factors of 28 are 2, 2 and 7.

The code to solve this problem determines the values for alpha using the run length function. This function counts the number of times each element in a sequence is repeated. The outcome of this function is a vector of the values and the number of times each is repeated. The prime factors of 28 are 2 and 7 and their run lengths are 2 and 1. The number of divisors can now be determined.

28 = 2^2 \times 7^1

d = (2+1)(1+1) = 6

The code to solve Euler Problem 12 is shown below. The loop continues until it finds a triangular number with 500 divisors. The first two lines increment the index and create the next triangular number. The third line in the loop determines the number of times each factor is repeated (the run lengths). The last line calculates the number of divisors using the above-mentioned formula.

i <- 0
divisors <- 0
while (divisors < 500) {
    i <- i + 1
    triangle <- (i * (i+1)) / 2
    pf <- prime.factors(triangle)
    alpha <- rle(pf)
    divisors <- prod(alpha$lengths+1)
}
answer <- triangle
print(answer)

Percentile Calculations in Water Quality Regulations

Percentile Calculations in Water Quality RegulationsPercentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use?

The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile.

set.seed(1969)
test.data <- rnorm(n=10000, mean=100, sd=15)
library(ggplot2)
ggplot(as.data.frame(test.data), aes(test.data)) + 
 geom_histogram(binwidth=1, aes(y=..density..), fill="dodgerblue") + 
 geom_line(stat="function", fun=dnorm, args=list(mean=100, sd=15), colour="red", size=1) + 
 geom_area(stat="function", fun=dnorm, args=list(mean=100, sd=15), 
 colour="red", fill="red", alpha=0.5, xlim=quantile(test.data, c(0.5, 0.75))) + 
 theme(text=element_text(size=16))

The quantile default function and the 95th percentile give the following results:

> quantile(test.data)
       0%       25%       50%       75%      100% 
 39.91964  89.68041 100.16437 110.01910 153.50195 

> quantile(test.data, probs=0.95)
     95% 
124.7775 

Methods of percentile calculations

The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods.

The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code.

> sapply(1:9, function(m) quantile(test.data, 0.95, type=m))

     95%      95%      95%      95%      95%      95%      95%      95%      95% 
124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 

Percentile calculations in water quality

The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that:

“The 95th percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.”

The Victorian regulators also specify that the percentile should be calculated with the Weibull Method. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R?

The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name $latex Q_8 p &s=1. Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data?

In case you are interested in how the Weibull method works, the weibull.quantile function shown below calculates a quantile p for a vector x using this method. This function gives the same result as quantile(x, p, type=6).

weibull.quantile <- function(x, p) {
    # Order Samples from large to small
    x <- x[order(x, decreasing=FALSE)]
    # Determine ranking of percentile according to Weibull (1939)
    r <- p * (length(x) + 1)
    # Linear interpolation
    rfrac <- (r - floor(r))
    return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1])
}

Turbidity Data Example

Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits.

http://r.prevos.net/wp-content/uploads/sites/11/2017/02/turbidity.png

Turbidity at customer tap for each zone in the Tarnagulla system in 2016 (n=53).

When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result.

ZoneR1R2R3R4R5R6R7R8R9
Bealiba0.3000.3000.2000.2400.2900.3000.2450.3000.300
Dunolly0.400000.400000.300000.340000.390000.435000.345000.405000.40125
Laanecoorie0.500000.500000.400000.440000.490000.535000.445000.505000.50125
Tarnagulla0.40.40.40.40.40.40.40.40.4

The graph and the table were created with the following code snippet:

ggplot(turbidity, aes(Result)) + 
 geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) + 
 facet_wrap(~Zone) + 
 theme(text=element_text(size=16))

tapply(turbidity$Result, turbidity$Zone, 
 function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))

Euler Problem 11: Largest Product in a Grid

Euler Problem 11 Definition

In the 20×20 grid below, four numbers along a diagonal line have been marked in red.

Euler problem 11

The product of these numbers is 26 × 63 × 78 × 14 = 1,788,696. What is the greatest product of four adjacent numbers in the same direction (up, down, left, right, or diagonally) in the 20 by 20 grid?

Solution

The solution applies straightforward vector arithmetic. The product of all verticals is an array of the product of rows 1 to 4, rows 2 to 5 and so on. The code uses a similar logic for the horizontals and the diagonals.

#Read and convert data
square <- readLines("Euler/p011_matrix.txt")
square <- as.numeric(unlist(lapply(square, function(x){strsplit(x, " ")})))
square <- matrix(square, ncol=20)

# Define products
prod.vert <- square[1:17, ] * square[2:18, ] * square[3:19, ] * square[4:20, ]
prod.hori <- square[,1:17] * square[,2:18] * square[,3:19] * square[,4:20]
prod.dia1 <- square[1:17, 1:17] * square[2:18, 2:18] * square[3:19, 3:19] * square[4:20, 4:20]
prod.dia2 <- square[4:20, 1:17] * square[3:19, 2:18] * square[2:18, 3:19] * square[1:17, 4:20]

answer <- max(prod.vert, prod.hori, prod.dia1, prod.dia2)

 

 

 

Lifting the Big Data Veil: Data Science Strategy for Water Utilities

Data Science Strategy for water utilities: The Data Science Venn Diagram (Conway, 2010).

The Data Science Venn Diagram (Conway, 2010).

In my job as manager data science for a medium-sized water utility in Australia, I have developed a strategy to increased the amount of value we extract from data.

Many businesses that seek the promised benefits of Big Data don’t achieve those because they don’t start with the basics.

The most important data science strategy advice is to spend a lot of time getting to know and to improve data quality.

Good data science needs to comply with these four basic principles:

  • Utility: The analysis needs to be able to improve reality, otherwise we end with ‘analysis-paralysis‘. Although we speak of data science, it is really data engineering because we are not seeking the truth, we seek improvement of reality.
  • Soundness: The analysis needs to be scientifically valid so that managers can make reliable decisions.
  • Aesthetics: Visualisations need to be pleasing to the eye, not as a beautification but to ensure users draw correct conclusions.
  • Reproducibility: Analysts need to be able to repeat the work of other people to ensure quality control. This is where the science comes into data analytics.

I have recently published a paper about data science strategy for water utilities to share some of my thoughts on this topic.

Data Science Strategy for Water Utilities

Abstract: Big Data promises future benefits by using smart algorithms to improve the customer experience. Many organisations struggle leveraging the benefits of the data revolution. This paper summarises how water utilities can use the emerging field of data science to create value from information. The paper explains the principles of data science and illustrates these using examples from water utilities. This paper closes with recommendations on how to implement data science projects to maximise value from data. These benefits are realised using existing investments in information technology infrastructure and existing competencies.

You can read an extract of the paper on the Australian Water Association website. The full version is behind their paywall.

Furthermore, I am interested in creating an alliance with other water utility professionals that write code in R. Feel free to comment below to discuss any thoughts you might have on this issue.

Euler Problem 10: Summation of Primes

Euler Problem 10 asks for the summation of primes. Computationally this is a simple problem because we can re-use the prime sieve developed for problem 3.

When generating a large number of primes the erratic pattern at which they occur is much more interesting than their sum. Mathematicians consider primes the basic building blocks of number theory. No matter how hard we look, however, they do not seem to obey any logical sequence. This Euler Problem is simply

Euler Problem 10 Definition

The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17. Find the sum of all the primes below two million.

Solution

The sieve of Eratosthenes function used in Euler Problem 3 can be reused once again to generate the prime numbers between two and two million. An interesting problem occurs when I run the code. When I sum all the primes without the as.numeric conversion, R throws an integer overflow error and recommends the conversion.

primes <- esieve(2e6)
answer <- (sum(as.numeric(primes)))
print(answer)

This results in a vector containing the first 148,933 prime numbers. The largest prime gap is 132 and it seems that sexy primes are more common than any of the other twin primes (note the spikes at intervals of 6 in the bar chart).

Euler Problem 10: Prime gaps for all primes up to two million

The summing of primes reveals an interesting problem in mathematics. Goldbach’s conjecture is one of the oldest and best-known unsolved problems in number theory and states that:

Every even integer greater than 2 can be expressed as the sum of two primes.

Note that this conjecture is only about even numbers. Goldbach also theorised that every odd composite number can be written as the sum of a prime and twice a square. This conjecture is the subject of the Euler Problem 46, which I will work on soon.

SCADA spikes in Water Treatment Data

Turbid waterSCADA spikes are events in the data stream of water treatment plants or similar installations. These SCADA spikes can indicate problems with the process and could result in an increased risk to public health.

The WSAA Health Based Targets Manual specifies a series of decision rules to assess the performance of filtration processes. For example, this rule assesses the performance of conventional filtration:

“Individual filter turbidity ≤ 0.2 NTU for 95% of month and not > 0.5 NTU for ≥ 15 consecutive minutes.”

Turbidity is a measure for the cloudiness of a fluid because of large numbers of individual particles otherwise invisible to the naked eye. Turbidity is an important parameter in water treatment because a high level of cloudiness strongly correlates with the presence of microbes. This article shows how to implement this specific decision rule using the R language.

Simulation

To create a minimum working example, I first create a simulated SCADA feed for turbidity. The turbidity data frame contains 24 hours of data. The seq.POSIXt function creates 24 hours of timestamps at a one-minute spacing. In addition, the rnorm function creates 1440 turbidity readings with an average of 0.1 NTU and a standard deviation of 0.01 NTU. The image below visualises the simulated data. The next step is to assess this data in accordance with the decision rule.

# Simulate data
set.seed(1234)
turbidity <- data.frame(DateTime = seq.POSIXt(as.POSIXct("2017-01-01 00:00:00"), by = "min", length.out=24*60),
                        Turbidity = rnorm(n = 24*60, mean = 0.1, sd = 0.01)
 )

The second section simulates five spikes in the data. The first line picks a random start time for the spike. The second line in the for-loop picks a duration between 10 and 30 minutes. In addition, the third line simulates the value of the spike. The mean value of the spike is determined by the rbinom function to create either a low or a high spike. The remainder of the spike simulation inserts the new data into the turbidity data frame.

# Simulate spikes
for (i in 1:5) {
   time <- sample(turbidity$DateTime, 1)
   duration <- sample(10:30, 1)
   value <- rnorm(1, 0.5 * rbinom(1, 1, 0.5) + 0.3, 0.05)
   start <- which(turbidity$DateTime == time)
   turbidity$Turbidity[start:(start+duration - 1)] <- rnorm(duration, value, value/10)
}

The image below visualises the simulated data using the mighty ggplot. Only four spikes are visible because two of them overlap. The next step is to assess this data in accordance with the decision rule.

library(ggplot2)
ggplot(turbidity, aes(x = DateTime, y = Turbidity)) + geom_line(size = 0.2) + 
   geom_hline(yintercept = 0.5, col = "red") + ylim(0,max(turbidity$Turbidity)) + 
   ggtitle("Simulated SCADA data")

Simulated SCADA data with spikes

SCADA Spikes Detection

The following code searches for all spikes over 0.50 NTU using the run length function. This function transforms a vector into a vector of values and lengths. For example, the run length of the vector c(1, 1, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6) is:

  • lengths: int [1:5] 2 3 4 2 1
  • values : num [1:5] 1 2 3 5 6

The value 1 has a length of 1, the value 2 has a length of 3 and so on. The spike detection code creates the run length for turbidity levels greater than 0.5, which results in a boolean vector. The cumsum function calculates the starting point of each spike which allows us to calculate their duration.

The code results in a data frame with all spikes higher than 0.50 NTU and longer than 15 minutes. The spike that occurred at 11:29 was higher than 0.50 NTU and lasted for 24 minutes. The other three spikes are either lower than 0.50 NTU. The first high spike lasted less than 15 minutes.

# Spike Detection
spike.detect <- function(DateTime, Value, Height, Duration) {
 runlength <- rle(Value > Height)
 spikes <- data.frame(Spike = runlength$values,
 times <- cumsum(runlength$lengths))
 spikes$Times <- DateTime[spikes$times]
 spikes$Event <- c(0,spikes$Times[-1] - spikes$Times[-nrow(spikes)])
 spikes <- subset(spikes, Spike == TRUE & Event > Duration)
 return(spikes)
}
spike.detect(turbidity$DateTime, turbidity$Turbidity, 0.5, 15)

This approach was used to prototype a software package to assess water treatment plant data in accordance with the Health-Based Targets Manual. The finished product has been written in SQL and is available under an Open Source sharing license.

Euler Problem 9 : Special Pythagorean Triple

Euler Problem 9 Definition

Euler Problem 9: Pythagorean Triples

Scatter plot of the legs (a,b) of the first Pythagorean triples with a and b less than 6000. Negative values are included to illustrate the parabolic patterns. By Dearjean13Own work, CC BY-SA 4.0, Link

A Pythagorean triple is a set of three natural numbers, a < b < c , for which, a^2 + b^2 = c^2 . For example:

3^2 + 4^2 = 9 + 16 = 25 = 5^2 .

There exists exactly one Pythagorean triplet for which a + b + c = 1000 .

Find the product of a, b and c.

Brute Force Solution

This solution uses brute force and checks all combinations of a, b and c. To limit the solution space I used the fact that a < b < c, which implies that a < s/3,  and a < b < s/2, where s is the sum of the three sides.

a <- 0
b <- 0
c <- 0
s <- 1000
found <- FALSE
for (a in 1:floor((s/3))) {
    for (b in a:(s/2)) {
        c <- s - a - b
        if (a^2 + b^2 == c^2) {
            found <- TRUE
            break
        }
    }
    if (found) 
        break
}
answer <- a * b * c

 

Trumpworld Analysis : Ownership Relations in his Business Network

Trumpworld by BuzzfeedYou do not need a machine learning algorithm to predict that the presidency of Donald Trump will be controversial.

One of the most discussed aspects of his reign is the massive potential for conflicts of interest. Trump’s complex business empire is entangled with national and international politics.

Buzzfeed has mapped many of the relationships between businesses and people in what they have dubbed Trumpworld. They provided the data to enable citizens data science into the wheelings and dealings of Donald J. Trump. The raw data set consists of three subsets of connections between:

  • Organisations
  • People
  • People and organisations

Trumpworld Analysis

This post analyses the connections between organisations using the mighty igraph package in the R language. The code snippet below converts the data to a graph that can be analysed using social network analysis techniques. I have downloaded the table of the raw data file as CSV files. This data is subsetted to contain only ownership relationships.

# Read data
trumpworld.org <- read.csv("TrumpWorld DataOrg.csv")
trumpworld.org.ownership <- subset(trumpworld.org, Connection==&quot;Ownership&quot;)[,1:2]

# Create graph of ownerships
library(igraph)
org.ownership <- graph.edgelist(as.matrix(trumpworld.org.ownership))

# Analysis
nrow(trumpworld.org.ownership)
length(unique(c(trumpworld.org.ownership[,1], trumpworld.org.ownership[,2])))

# Plot Graph
par(mar=rep(0,4))
plot(org.ownership,
 layout=layout.fruchterman.reingold,
 vertex.label=NA,
 vertex.size=2,
 edge.arrow.size=.1
)

Trumpworld analysis - business ownership networkNetwork Analysis

This network contains 309 ownership relationships between 322 firms.

When we plot the data, we see that most relationships are between two firms. The plot is organised with the Fruchterman-Reingold algorithm to improve its clarity.

We can also see a large cluster in the centre. The names have been removed for clarity.

The Trumpland analysis continues with this conglomerate. The second code section excises this connected subnetwork so we can analyse it in more detail.

# Find most connected firm
which.max(degree(org.ownership))
# Create subnetworks
org.ownership.d <- decompose(org.ownership)
# Find largest subnetwork
largest <- which.max(sapply(org.ownership.d, diameter))
#Plot largest subnetwork
plot(org.ownership.d[[largest]],
 layout=layout.fruchterman.reingold,
 vertex.label.cex=.5,
 vertex.size=5,
 edge.arrow.size=.1
)

Digging Deeper

The node with the highest degree identifies the business with the most holdings. This analysis shows that DJT Holdings LLC owns 33 other organisations. These organisations own other organisations. We can now use the cluster function to investigate this subnetwork.

Trumpworld holdings

This Trumpworld analysis shows that the ownership network is clearly a star network. DJT Holdings LLC centrally controls all organisations. Perhaps this graph visualises the management style of the soon to be president Trump. Trump centrally controls his empire, which is typical for a family business.

Does this chart visualise Trump’s leadership style? Is the star network an expression of his lack of trust and thus desire to oversee everything directly?