Tap Water Sentiment Analysis using Tidytext

In developed countries, tap water is safe to drink and available for a meagre price. Despite the fact that high-quality drinking water is almost freely available, the consumption of bottled water is increasing every year. Bottled water companies use sophisticated marketing strategies, while water utilities are mostly passive providers of public service. Australian marketing expert Russell Howcroft even called water utilities “lazy marketers”. Can we use data science to find out more about how people feel about tap water and learn about the reasons behind this loss in trust in the municipal water supply?

This tap water sentiment analysis estimates the attitudes people have towards tap water by analysing tweets. This article explains how to examine tweets about tap water using the R language for statistical computing and the Tidytext package. The most recent version of the code and the raw data set used in this analysis can be viewed on my GitHub page.

Tap Water Sentiment Analysis

Each tweet that contains the words “tap water” contains a message about the attitude the author has towards that topic. Each text expresses a sentiment about the topic it describes. Sentiment analysis is a data science technique that extracts subjective information from a text. The basic method compares a string of words with a set of words with calibrated sentiments. These calibrated sets are created by asking many people how they feel about a certain word. For example, the word “stink” expresses a negative sentiment, while the word “nice” would be a positive sentiment.

This tap water sentiment analysis consists of three steps. The first step extracts 1000 tweets that contain the words “tap water” from Twitter. The second step cleans the data, and the third step undertakes the analysis visualises the results.

Extracting tweets using the TwitteR package

The TwitteR package by Geoff Gentry makes it very easy to retrieve tweets using search criteria. You will need to create an API on Twitter to receive the keys and tokens. In the code below, the actual values have been removed. Follow the instructions in this article to obtain these codes for yourself. This code snippet calls a private file to load the API codes, extracts the tweets and creates a data frame with a tweet id number and its text.

# Init
library(tidyverse)
library(tidytext)
library(twitteR)

# Extract tap water tweets
source("twitteR_API.R")
setup_twitter_oauth(api_key, api_secret, token, token_secret)
tapwater_tweets <- searchTwitter("tap water", n = 1000, lang = "en") %>%
  twListToDF() %>%
  select(id, text)
tapwater_tweets <- subset(tapwater_tweets, !duplicated(tapwater_tweets$text))
tapwater_tweets$text <- gsub("’", "'", tapwater_tweets$text)
write_csv(tapwater_tweets, "Hydroinformatics/tapwater_tweets.csv")

When I first extracted these tweets, a tweet by CNN about tap water in Kentucky that smells like diesel was retweeted many times, so I removed all duplicate tweets from the set. Unfortunately, this left less than 300 original tweets in the corpus.

Sentiment analysis with Tidytext

Text analysis can be a powerful tool to help to analyse large amounts of text. The R language has an extensive range of packages to help you undertake such a task. The Tidytext package extends the Tidy Data logic promoted by Hadley Wickham and his Tidyverse software collection.

Data Cleaning

The first step in cleaning the data is to create unigrams, which involves splitting the tweets into individual words that can be analysed. The first step is to look at which words are most commonly used in the tap water tweets and visualise the result.

# Tokenisation
tidy_tweets <- tapwater_tweets %>%
  unnest_tokens(word, text)

data(stop_words)
tidy_tweets <- tidy_tweets %>%
  anti_join(stop_words) %>%
  filter(!word %in% c("tap", "water", "rt", "https", "t.co", "gt", "amp", as.character(0:9)))

tidy_tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 5) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) + geom_col(fill = "dodgerblue4") +
    xlab(NULL) + coord_flip() + ggtitle("Most common words in tap water tweets")
ggsave("Hydroinformatics/tapwater_words.png", dpi = 300)

Most common words in tap water sentiment analysis

The most common words related to drinking the water and to bottled water, which makes sense. Also the recent issues in Kentucky feature in this list.

Sentiment Analysis

The Tidytext package contains three lexicons of thousands of single English words (unigrams) that were manually assessed for their sentiment. The principle of the sentiment analysis is to compare the words in the text with the words in the lexicon and analyse the results. For example, the statement: “This tap water tastes horrible” has a sentiment score of -3 in the AFFIN system by Finn Årup Nielsen due to the word “horrible”. In this analysis, I have used the “bing” method published by Liu et al. in 2005.

# Sentiment analysis
sentiment_bing <- tidy_tweets %>%
  inner_join(get_sentiments("bing"))

sentiment_bing %>%
  summarise(Negative = sum(sentiment == "negative"), 
            positive = sum(sentiment == "positive"))

sentiment_bing %>%
  group_by(sentiment) %>%
  count(word, sort = TRUE) %>%
  filter(n > 2) %>%
  ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + 
    coord_flip() + facet_wrap(~sentiment, scales = "free_y") + 
    ggtitle("Contribution to sentiment") + xlab(NULL) + ylab(NULL)
ggsave("Hydroinformatics/tapwater_sentiment.png", dpi = 300)

This tap water sentiment analysis shows that two-thirds of the words that express a sentiment were negative. The most common negative words were “smells” and “scared”. This analysis is not a positive result for water utilities. Unfortunately, most tweets were not spatially located so I couldn’t determine the origin of the sentiment.

Tap Water sentiment analysis

Sentiment analysis is an interesting explorative technique, but it should not be interpreted as absolute truth. This method is not able to detect sarcasm or irony, and words don’t always have the same meaning as described in the dictionary.

The important message for water utilities is that they need to start taking the aesthetic properties of tap water as serious as the health parameters. A lack of trust will drive consumers to bottled water, or less healthy alternatives such as soft drinks are alternative water sources.

If you like to know more about customer perceptions of tap water, then read my book Customer Experience Management for Water Utilities by IWA Publishing.

Customer Experience Management

Euler Problem 8: Largest Product in a Series

Euler Problem 8 is a combination of mathematics and text analysis. The problem is defined as follows:

Euler Problem 8 Definition

The four adjacent digits in the 1,000-digit number below that have the greatest product are 9 \times 9 \times 8 \times 9 = 5832.

73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450

Find the thirteen adjacent digits in the 1,000-digit number that have the greatest product. What is the value of this product?

Solution

The first step is to define the digits as a character string. The answer is found by cycling through all 13-character n-grams to find the highest product using the prod function.

# Define digits
digits <- "7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450"

ngram <- 13  # Define length
answer <- 0
# Clycle through digits
for (i in 1:(nchar(digits)-ngram+1)) {
    # Pick 13 consecutive digits
    adjecent <- substr(digits, i, i + ngram - 1)
    # Define product
    mult <- prod(as.numeric(unlist(strsplit(adjecent, "")))) # Largest? if (mult > answer) 
        answer <- mult
}

Finding n-grams is a basic function of text analysis called tokenization. The Google Ngram Viewer provides the ability to search for word strings (n-grams) in the massive Google books collection. This type of data can be used to define writing styles and analyse the evolution of language.

You can also view this code on GitHub.