Percentile calculations can be more tricky than at first meets the eye. A percentile indicates the value below which a percentage of observations fall. Some percentiles have special names, such as the quartile or the decile, both of which are quantiles. This deceivingly simple definition hides the various ways to determine this number. Unfortunately, there is no standard definition for percentiles, so which method do you use?

The quantile function in R generates sample percentiles corresponding to the given probabilities. By default, the quantile function provides the quartiles and the minimum and maximum values. The code snippet below generates semi-random data, plots the histogram and visualises the third quartile.

set.seed(1969) test.data <- rnorm(n = 10000, mean = 100, sd = 15) library(ggplot2) ggplot(as.data.frame(test.data), aes(test.data)) + geom_histogram(binwidth = 1, aes(y = ..density..), fill = "dodgerblue") + geom_line(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", size = 1) + geom_area(stat = "function", fun = dnorm, args = list(mean = 100, sd = 15), colour = "red", fill="red", alpha = 0.5, xlim = quantile(test.data, c(0.5, 0.75))) + theme(text = element_text(size = 16))

The quantile default function and the 95^{th} percentile give the following results:

> quantile(test.data) 0% 25% 50% 75% 100% 39.91964 89.68041 100.16437 110.01910 153.50195 > quantile(test.data, probs=0.95) 95% 124.7775

## Methods of percentile calculations

The quantile function in R provides for nine different ways to calculate percentiles. Each of these options uses a different method to interpolate between observed values. I will not discuss the mathematical nuances between these methods. Hyndman and Fan (1996) provide a detailed overview of these methods.

The differences between the nine available methods only matter in skewed distributions, such as water quality data. For the normal distribution simulated above the outcome for all methods is exactly the same, as illustrated by the following code.

> sapply(1:9, function(m) quantile(test.data, 0.95, type = m)) 95% 95% 95% 95% 95% 95% 95% 95% 95% 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775 124.7775

## Percentile calculations in water quality

The Australian Drinking Water Quality Guidelines (November 2016) specify that: “based on aesthetic considerations, the turbidity should not exceed 5 NTU at the consumer’s tap”. The Victorian Safe Drinking Water Regulations (2015) relax this requirement and require that:

“The 95

^{th}percentile of results for samples in any 12 month period must be less than or equal to 5.0 NTU.”

The Victorian regulators also specify that the percentile should be calculated with the *Weibull Method*. This requirement raises two questions: What is the Weibull method? How do you implement this requirement in R?

The term Weibull Method is a bit confusing as this is not a name used by statisticians. In Hyndman & Fan (1996), this method has the less poetic name . Waloddi Weibull, a Swedish engineer famous for his distribution, was one of the first to describe this method. Only the regulator in Victoria uses that name, which is based on McBride (2005). This theoretical interlude aside, how can we practically apply this to water quality data?

In case you are interested in how the Weibull method works, the *weibull.quantile* function shown below calculates a quantile *p* for a vector *x* using this method. This function gives the same result as *quantile(x, p, type=6)*.

weibull.quantile <- function(x, p) { # Order Samples from large to small x <- x[order(x, decreasing = FALSE)] # Determine ranking of percentile according to Weibull (1939) r <- p * (length(x) + 1) # Linear interpolation rfrac <- (r - floor(r)) return((1 - rfrac) * x[floor(r)] + rfrac * x[floor(r) + 1]) }

## Turbidity Data Example

Turbidity data is not normally distributed as it is always larger than zero. In this example, the turbidity results for the year 2016 for the water system in Tarnagulla are used to illustrate the percentile calculations. The range of weekly turbidity measurements is between 0.,05 NTU and 0.8 NTU, well below the aesthetic limits.

When we calculate the percentiles for all nine methods available in the base-R function we see that the so-called Weibull method generally provides the most conservative result.

Zone | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | R9 |
---|---|---|---|---|---|---|---|---|---|

Bealiba | 0.300 | 0.300 | 0.200 | 0.240 | 0.290 | 0.300 | 0.245 | 0.300 | 0.300 |

Dunolly | 0.40000 | 0.40000 | 0.30000 | 0.34000 | 0.39000 | 0.43500 | 0.34500 | 0.40500 | 0.40125 |

Laanecoorie | 0.50000 | 0.50000 | 0.40000 | 0.44000 | 0.49000 | 0.53500 | 0.44500 | 0.50500 | 0.50125 |

Tarnagulla | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 |

The graph and the table were created with the following code snippet:

ggplot(turbidity, aes(Result)) + geom_histogram(binwidth=.05, fill="dodgerblue", aes(y=..density..)) + facet_wrap(~Zone) + theme(text=element_text(size=16)) tapply(turbidity$Result, turbidity$Zone, function(x) sapply(1:9, function(m) quantile(x, 0.95, type=m)))

You can view the code on GitHub.

Pingback: Data Science for Water Utilities Using R | The Devil is in the Data

Pingback: Data Science for Water Utilities Using R | The Devil is in the Data

Great article.

Just wanted to mention my blog on plotting percentiles for a time series of values, which could be adapted to use the ‘Weibull’ method.

https://tonyladson.wordpress.com/2016/02/22/plotting-percentiles/

Hi Tony,

Your method is interesting because it fits percentiles rather than calculating the actual values. Thanks for bringing this to my attention, this method could be useful in my work one day.

I have taken the liberty to simplify your code a bit using only two packages.

Pingback: Cuantiles, sí, pero ¿de qué tipo? – datanalytics

Wouldn’t the next step would be to use confidence intervals for the percentile, e.g., ensure that the upper limit of a CI for the 95% quantile is below 5 NTU?

See http://staff.math.su.se/hoehle/blog/2016/10/23/quantileCI.html

That is an interesting question.

The previous version of the regulations asked for this statistic. The 2005 used to specify that: “95% upper confidence limit of the mean of samples of drinking water collected in any 12 month period must be less than or equal to 5.0 NTU”.

However, the vast majority of water quality standards specify a maximum limit for any parameter. I wrote this post because the current regulations are very specific about the method of calculating a percentile.

Pingback: Percentile Calculations in Water Quality Regulations – Mubashir Qasim

Pingback: Percentile Calculations in Water Quality Regulations | A bunch of data

Pingback: Percentile Calculations in Water Quality Regulations - Use-R!Use-R!