Tap Water Sentiment Analysis using Tidytext

In developed countries, tap water is safe to drink and available for a meagre price. Despite the fact that high-quality drinking water is almost freely available, the consumption of bottled water is increasing every year. Bottled water companies use sophisticated marketing strategies, while water utilities are mostly passive providers of public service. Australian marketing expert Russell Howcroft even called water utilities “lazy marketers”. Can we use data science to find out more about how people feel about tap water and learn about the reasons behind this loss in trust in the municipal water supply?

This tap water sentiment analysis estimates the attitudes people have towards tap water by analysing tweets. This article explains how to examine tweets about tap water using the R language for statistical computing and the Tidytext package. The most recent version of the code and the raw data set used in this analysis can be viewed on my GitHub page.

Tap Water Sentiment Analysis

Each tweet that contains the words “tap water” contains a message about the attitude the author has towards that topic. Each text expresses a sentiment about the topic it describes. Sentiment analysis is a data science technique that extracts subjective information from a text. The basic method compares a string of words with a set of words with calibrated sentiments. These calibrated sets are created by asking many people how they feel about a certain word. For example, the word “stink” expresses a negative sentiment, while the word “nice” would be a positive sentiment.

This tap water sentiment analysis consists of three steps. The first step extracts 1000 tweets that contain the words “tap water” from Twitter. The second step cleans the data, and the third step undertakes the analysis visualises the results.

Extracting tweets using the TwitteR package

The TwitteR package by Geoff Gentry makes it very easy to retrieve tweets using search criteria. You will need to create an API on Twitter to receive the keys and tokens. In the code below, the actual values have been removed. Follow the instructions in this article to obtain these codes for yourself. This code snippet calls a private file to load the API codes, extracts the tweets and creates a data frame with a tweet id number and its text.

# Init
library(tidyverse)
library(tidytext)
library(twitteR)

# Extract tap water tweets
source("twitteR_API.R")
setup_twitter_oauth(api_key, api_secret, token, token_secret)
tapwater_tweets <- searchTwitter("tap water", n = 1000, lang = "en") %>%
  twListToDF() %>%
  select(id, text)
tapwater_tweets <- subset(tapwater_tweets, !duplicated(tapwater_tweets$text))
tapwater_tweets$text <- gsub("’", "'", tapwater_tweets$text)
write_csv(tapwater_tweets, "Hydroinformatics/tapwater_tweets.csv")

When I first extracted these tweets, a tweet by CNN about tap water in Kentucky that smells like diesel was retweeted many times, so I removed all duplicate tweets from the set. Unfortunately, this left less than 300 original tweets in the corpus.

Sentiment analysis with Tidytext

Text analysis can be a powerful tool to help to analyse large amounts of text. The R language has an extensive range of packages to help you undertake such a task. The Tidytext package extends the Tidy Data logic promoted by Hadley Wickham and his Tidyverse software collection.

Data Cleaning

The first step in cleaning the data is to create unigrams, which involves splitting the tweets into individual words that can be analysed. The first step is to look at which words are most commonly used in the tap water tweets and visualise the result.

# Tokenisation
tidy_tweets <- tapwater_tweets %>%
  unnest_tokens(word, text)

data(stop_words)
tidy_tweets <- tidy_tweets %>%
  anti_join(stop_words) %>%
  filter(!word %in% c("tap", "water", "rt", "https", "t.co", "gt", "amp", as.character(0:9)))

tidy_tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 5) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) + geom_col(fill = "dodgerblue4") +
    xlab(NULL) + coord_flip() + ggtitle("Most common words in tap water tweets")
ggsave("Hydroinformatics/tapwater_words.png", dpi = 300)

Most common words in tap water sentiment analysis

The most common words related to drinking the water and to bottled water, which makes sense. Also the recent issues in Kentucky feature in this list.

Sentiment Analysis

The Tidytext package contains three lexicons of thousands of single English words (unigrams) that were manually assessed for their sentiment. The principle of the sentiment analysis is to compare the words in the text with the words in the lexicon and analyse the results. For example, the statement: “This tap water tastes horrible” has a sentiment score of -3 in the AFFIN system by Finn Årup Nielsen due to the word “horrible”. In this analysis, I have used the “bing” method published by Liu et al. in 2005.

# Sentiment analysis
sentiment_bing <- tidy_tweets %>%
  inner_join(get_sentiments("bing"))

sentiment_bing %>%
  summarise(Negative = sum(sentiment == "negative"), 
            positive = sum(sentiment == "positive"))

sentiment_bing %>%
  group_by(sentiment) %>%
  count(word, sort = TRUE) %>%
  filter(n > 2) %>%
  ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + 
    coord_flip() + facet_wrap(~sentiment, scales = "free_y") + 
    ggtitle("Contribution to sentiment") + xlab(NULL) + ylab(NULL)
ggsave("Hydroinformatics/tapwater_sentiment.png", dpi = 300)

This tap water sentiment analysis shows that two-thirds of the words that express a sentiment were negative. The most common negative words were “smells” and “scared”. This analysis is not a positive result for water utilities. Unfortunately, most tweets were not spatially located so I couldn’t determine the origin of the sentiment.

Tap Water sentiment analysis

Sentiment analysis is an interesting explorative technique, but it should not be interpreted as absolute truth. This method is not able to detect sarcasm or irony, and words don’t always have the same meaning as described in the dictionary.

The important message for water utilities is that they need to start taking the aesthetic properties of tap water as serious as the health parameters. A lack of trust will drive consumers to bottled water, or less healthy alternatives such as soft drinks are alternative water sources.

If you like to know more about customer perceptions of tap water, then read my book Customer Experience Management for Water Utilities by IWA Publishing.

Customer Experience Management

Using the iGraph package to Analyse the Enron Corpus

The Enron scandal is one of the most famous corporate governance failures in the history of capitalism. The case itself is interesting for its sake, but in this post, I am more interested in one of the data sets that the subsequent investigation has provided.

This blog post analyses an extensive collection of e-mails from former Enron employees. The Enron corpus is analysed using network analysis tools provided by the iGraph package. Network analysis is a versatile technique that can be used to add value to a lot of different data sets, including the complex corporate relationships of Donald Trump.

The Enron Corpus

As part of their inquiries, The Federal Energy Regulatory Commission used an extensive collection of emails from Enron employees. The Enron corpus is one of the only publicly available collections of emails available for research. This dataset also provides a fascinating playground for citizen data scientists.

The set has privacy issues as it contains messages from living people. When analysing this data set, we need to keep in mind that the majority of former Enron employees were innocent people who lost their jobs due to the greed of their overlords. The code in this post only analyses the e-mail headers, ignoring the content.

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001 - Enron corpus analysis

Laid-off Enron employees outside Enron headquarters as the company collapsed in 2001.

The Enron Corpus is a large database of half a million of emails generated by more than 100 Enron employees. You can download the corpus from the Carnegie Mellon School of Computer Science. The first code snippet downloads the 7 May 2015 version of the dataset (about 423Mb, tarred and gzipped) and untars it to your working directory.

# Enron Email Dataset: https://www.cs.cmu.edu/~./enron/
download.file("https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz", destfile = "enron_mail_20150507.tgz")
untar("enron_mail_20150507.tgz")

Preparing the Data

The main folder is maildir, which holds all the personal accounts. Our first task is to load the required libraries and create a list of available emails. This code results in 517,401 e-mail files with 44,859 emails in the inboxes of users.

# E-mail corpus consists of nested folders per user with e-mails as text files
# Create list of all available e-mails
emails <- list.files("maildir/", full.names = T, recursive = T)
length(emails)
# Filter by inbox only
emails <- emails[grep("/inbox", emails)]
length(emails)

The bulk of the code creates a list of emails between Enron employees. The code performs a lot of string manipulations to extract the information from the text files. The content of the e-mails is ignored, the code only retrieves the sender and the receiver. The analysis is limited to e-mails between employees in the corpus. Only those receivers whose inbox forms part of the analysis are included. The result of this code is a data frame with the usernames of the sender and receiver for each email. The data frame contains 2779 emails that meet the criteria.

# Create list of sender and receiver (inbox owner)
inboxes <- data.frame(
    from = apply(as.data.frame(emails), 1, function(x){readLines(x, warn = F)[3]}),
    to = emails,
    stringsAsFactors = F
)

# Keep only enron.com and strip all but username
library(stringr) # String manipulation
inboxes <- inboxes[grepl("@enron.com", inboxes$from),]
inboxes$from <- str_sub(inboxes$from, 7, nchar(inboxes$from) - 10)
to <- str_split(inboxes$to, "/")
inboxes$to <- sapply(to, "[", 3)

# Create list of usernames
users <- data.frame(user = paste0("maildir/", unique(inboxes$to)))

# Remove those without sent mails
sent <- apply(users, 1, function(x){sum(grepl("sent", dir(x)))})
users <- subset(users, sent != 0)

# Replace username with e-mail name
users$mailname <- NA
for (i in 1:nrow(users)){
sentmail <- dir(paste0(users$user[i], "/sent_items/"))
name <- readLines(paste0(users$user[i], "/sent_items/", sentmail[1]), warn = F)[3]
name <- str_sub(name, 7, nchar(name)-10)
users$mailname[i] <- name
}
users$user <- str_sub(users$user, 9)
inboxes <- merge(inboxes, by.x="to", users, by.y="user")
inboxes <- data.frame(from = inboxes$from, to = inboxes$mailname)

inboxes$from <- as.character(inboxes$from)
inboxes$to <- as.character(inboxes$to)

# Only e-mails between inbox users
inboxes <- inboxes[inboxes$from %in% inboxes$to,]

# Remove no.address
inboxes <- subset(inboxes, from != "no.address" & to != "no.address")

# Remove emails to self
inboxes <- subset(inboxes, inboxes$from != inboxes$to)

Network Analysis

The last code snippet defines a graph from the table of e-mails. Each employee is a node in the network, and each e-mail is an edge (line). The iGraph package is a powerful tool to analyse networks. The graph_from_edgelist function creates a network object that can be analysed using the iGraph package. The graph is directed because the information flows from the sender to the receiver.

In the next step, the Spingglass algorithm finds community structure within the data. A community is a group of nodes that are more connected with each other than with any other node.

The last step visualises the network. The diagram is drawn using the Fruchterman-Reingold algorithm, which places the most connected nodes at the centre of the picture. The background colours in the diagram indicate the eight communities.

The graph tells us a lot about the group of employees in the Enron corpus and how they relate to each other. Each of the communities represents a tightly connected group of employees that mainly e-mail each other. Any connections between communities are shown in red. When the vertex.label = NA line is removed, the usernames are displayed at each node.

We can see groups that never email each other, lonely hangers-on and tightly knit cliques within Enron. In the centre of the graph we see w few individuals who are connectors between groups because they send a lot of emails to people outside their community. On the fringes of the graph are the hangers-on who only once or twice emailed somebody in the corpus but were still included in the investigation.

library(igraph)
g <- graph_from_edgelist(as.matrix(inboxes), directed = T)
coms <- spinglass.community(g)

# Plot network
par(mar = c(0,0,2,0))
plot(coms, g, 
     vertex.label=NA, 
     layout = layout.fruchterman.reingold,
     vertex.size = 3
)

View the most recent version of the code on GitHub.

This analysis provides only a quick glimpse into the knowledge contained in the Enron corpus. An extensive range of tools is available to analyse such networks. An interesting exercise would be to overlap this network with the organisation chart to see the relationships between teams. Have fun playing with this fantastic data set!

Enron corpus network with communities.

 

Euler Problem 22 : Names Scores

Euler Problem 22

R logo in ASCII art by picascii.com

Euler problem 22 is another trivial one that takes us to the realm of ASCII codes. ASCII is a method to convert symbols into numbers, originally invented for telegraphs.

Back in the 8-bit days, ASCII art was a method to create images without using lots of memory. Each image consists of a collection of text characters that give the illusion of an image. Euler problem 22 is, unfortunately, a bit less poetic.

Euler Problem 22 Definition

Using names.txt, a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.

For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 × 53 = 49,714.

What is the total of all the name scores in the file?

Solution

This code reads and cleans the file and sorts the names alphabetically. The charToRaw function determines the numerical value of each character in each name. This output of this function is the hex ASCII code for each character. The letter A is number 65, so we subtract 64 from each value to get the sum total.

# ETL: reads the file and converts it to an ordered vector.
names <- readLines("https://projecteuler.net/project/resources/p022_names.txt", warn = F)
names <- unlist(strsplit(names, ","))
names <- gsub("[[:punct:]]", "", names)
names <- sort(names)

# Total Name scores
answer <- 0
for (i in names) {
    value <- sum(sapply(unlist(strsplit(i, "")), function(x) as.numeric(charToRaw(x)) - 64))
    value <- value * which(names==i)
    answer <- answer + value
}
print(answer)

We can have a bit more fun with this problem by comparing this list with the most popular baby names in 2016. The first section of the code extracts the list of popular names from the website. The rest of the code counts the number of matches between the lists.

# Most popular baby names
library(rvest)
url <- "https://www.babycenter.com/top-baby-names-2016.htm"
babynames <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="babyNameList"]/table') %>%
    html_table()
babynames <- babynames[[1]]

# Convert Project Euler list and test for matches
proper=function(x) paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
names <- proper(names)

sum(babynames$GIRLS %in% names)
sum(babynames$BOYS %in% names)

View the latest version of this code on GitHub.

Euler Problem 17: Number Letter Counts

Euler Problem 17: written numbersEuler Problem 17 asks to count the letters in numbers written as words. This is a skill we all learnt in primary school mainly useful when writing cheques—to those that still use them.

Each language has its own rules for writing numbers. My native language Dutch has very different logic to English. Both Dutch and English use compound words after the number twelve.

Linguists have theorised this is evidence that early Germanic numbers were duodecimal. This factoid is supported by the importance of a “dozen” as a counting word and the twelve hours in the clock. There is even a Dozenal Society that promotes the use of a number system based on 12.

The English language changes the rules when reaching the number 21. While we say eight-teen in English, we do no say “one-twenty”. Dutch stays consistent and the last number is always spoken first. For example, 37 in English is “thirty-seven”, while in Dutch it is written as “zevenendertig” (seven and thirty).

Euler Problem 17 Definition

If the numbers 1 to 5 are written out in words: one, two, three, four, five, then there are 3 + 3 + 5 + 4 + 4 = 19 letters used in total. If all the numbers from 1 to 1000 (one thousand) inclusive were written out in words, how many letters would be used?

NOTE: Do not count spaces or hyphens. For example, 342 (three hundred and forty-two) contains 23 letters and 115 (one hundred and fifteen) contains 20 letters. The use of “and” when writing out numbers is in compliance with British usage.

Solution

The first piece of code provides a function that generates the words for numbers 1 to 999,999. This is more than the problem asks for, but it might be a useful function for another application. The last line concatenates all words together and removes the spaces.

numword.en <- function(x) { if (x > 999999) return("Error: Oustide my vocabulary")
    # Vocabulary 
    single <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine")
    teens <- c( "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
    tens <- c("ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
    # Translation
    numword.10 <- function (y) {
        a <- y %% 100
        if (a != 0) {
            and <- ifelse(y > 100, "and", "")
            if (a < 20)
                return (c(and, c(single, teens)[a]))
            else
                return (c(and, tens[floor(a / 10)], single[a %% 10]))
        }
    }
    numword.100 <- function (y) {
        a <- (floor(y / 100) %% 100) %% 10
        if (a != 0)
            return (c(single[a], "hundred"))
    }
    numword.1000 <- function(y) {
        a <- (1000 * floor(y / 1000)) / 1000
        if (a != 0)
            return (c(numword.100(a), numword.10(a), "thousand"))
    }
    numword <- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse=" ")
    return (trimws(numword))
}

answer <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.en), collapse="")))
print(answer)

Writing Numbers in Dutch

I went beyond Euler Problem 17 by translating the code to spell numbers in Dutch. Interesting bit of trivia is that it takes 307 fewer characters to spell the numbers 1 to 1000 in Dutch than it does in English.

It would be good if other people can submit functions for other languages in the comment section. Perhaps we can create an R package with a multi-lingual function for spelling numbers.

numword.nl <- function(x) {
    if (x > 999999) return("Error: Getal te hoog.")
    single <- c("een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "negen")
    teens <- c( "tien", "elf", "twaalf", "dertien", "veertien", "fifteen", "zestien", "zeventien", "achtien", "negentien")
    tens <- c("tien", "twintig", "dertig", "veertig", "vijftig", "zestig", "zeventig", "tachtig", "negengtig")
    numword.10 <- function(y) {
        a <- y %% 100
        if (a != 0) {
            if (a < 20)
                return (c(single, teens)[a])
            else
                return (c(single[a %% 10], "en", tens[floor(a / 10)]))
        }
    }
    numword.100 <- function(y) {
        a <- (floor(y / 100) %% 100) %% 10
        if (a == 1)
            return ("honderd")
        if (a > 1) 
            return (c(single[a], "honderd"))
    }
    numword.1000 <- function(y) {
        a <- (1000 * floor(y / 1000)) / 1000
        if (a == 1)
            return ("duizend ")
        if (a > 0)
            return (c(numword.100(a), numword.10(a), "duizend "))
    }
    numword<- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse="")
    return (trimws(numword))
}

antwoord <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.nl), collapse="")))
print(antwoord)

print(answer - antwoord)

View the latest version of this code on GitHub.