Wastewater-Based Epidemiology of Stimulant Drugs: Functional Data Analysis Compared to Traditional Statistical Methods

Stefania Salvatore; Jørgen Gustav Bramness; Malcolm J. Reid; Kevin Victor Thomas; Christopher Harman; Jo Røislien

doi:10.1371/journal.pone.0138669

Abstract

Background

Wastewater-based epidemiology (WBE) is a new methodology for estimating the drug load in a population. Simple summary statistics and specification tests have typically been used to analyze WBE data, comparing differences between weekday and weekend loads. Such standard statistical methods may, however, overlook important nuanced information in the data. In this study, we apply functional data analysis (FDA) to WBE data and compare the results to those obtained from more traditional summary measures.

Methods

We analysed temporal WBE data from 42 European cities, using sewage samples collected daily for one week in March 2013. For each city, the main temporal features of two selected drugs were extracted using functional principal component (FPC) analysis, along with simpler measures such as the area under the curve (AUC). The individual cities’ scores on each of the temporal FPCs were then used as outcome variables in multiple linear regression analysis with various city and country characteristics as predictors. The results were compared to those of functional analysis of variance (FANOVA).

Results

The three first FPCs explained more than 99% of the temporal variation. The first component (FPC1) represented the level of the drug load, while the second and third temporal components represented the level and the timing of a weekend peak. AUC was highly correlated with FPC1, but other temporal characteristic were not captured by the simple summary measures. FANOVA was less flexible than the FPCA-based regression, and even showed concordance results. Geographical location was the main predictor for the general level of the drug load.

Conclusion

FDA of WBE data extracts more detailed information about drug load patterns during the week which are not identified by more traditional statistical methods. Results also suggest that regression based on FPC results is a valuable addition to FANOVA for estimating associations between temporal patterns and covariate information.

Citation: Salvatore S, Bramness JG, Reid MJ, Thomas KV, Harman C, Røislien J (2015) Wastewater-Based Epidemiology of Stimulant Drugs: Functional Data Analysis Compared to Traditional Statistical Methods. PLoS ONE 10(9): e0138669. https://doi.org/10.1371/journal.pone.0138669

Editor: David O. Carpenter, Institute for Health & the Environment, UNITED STATES

Received: May 27, 2015; Accepted: September 2, 2015; Published: September 22, 2015

Copyright: © 2015 Salvatore et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: The data sets supporting the study are freely available (DOI: 10.1111/add.12570).

Funding: This study is funded by the EU-International Training Network SEWPROF (Marie Curie-FP7-PEOPLE Grant #317205) and the Norwegian Centre for Addiction Research (SERAF).

Competing interests: The authors have declared that no competing interests exist.

Introduction

Illicit drug use is a growing global health concern, and it is estimated that around a quarter of the European adult population has used illicit drugs at some point in their lives [1]. In Europe, central nervous system stimulants such as amphetamine and ecstasy (MDMA) are among the most commonly used illicit drugs [1]. The drugs may cause appetite suppression and euphoria with feelings of increased confidence, sociability and energy, making them popular drugs of abuse, particularly in the young [2]. Stimulant use has, however, numerous negative effects, such as insomnia, anxiety, mood disturbance, violent behaviour, dependence and psychosis making them a public health concern [3].

Because of this considerable health risk, reliable estimates of the extent of drug use in a population are important for health professionals and policy makers. Traditionally, estimates of the consumption of stimulants are calculated from data collected from sources such as treatment programmes [4], hospital emergency departments [5, 6], drivers apprehended by the police [7, 8], prisoners [9] and from population surveys (e.g., internet, population, school) [10]. These types of data, however, have their limitations, mostly related to difficulties in capturing representative survey populations. General population surveys may have poor response rates and there is often unwillingness to supply information about an activity that may have a social stigma or legal implications [10]. Further, while data from drug treatment programmes may underestimate prevalence because of limited places in treatment, data gathered from the police may overestimate prevalence as investigations are targeted towards selected populations [5–9].

Wastewater-based epidemiology (WBE) is an alternative and complementary approach for estimating the collective illicit drug use in a community [11]. The concentration of various illicit drugs in the wastewater can be measured directly, overcoming the problems related to surveys and sampling bias. WBE has shown promising results, at both local national and international level [11–13], and analyses of wastewater data have indicated differences in drug loads detected in wastewater on weekdays and at weekends [14–16]. However, as WBE is a relatively new research field, data are often analysed using simple statistical methods which do not take the temporal nature of the data fully into account, potentially overlooking important information. The aim of this study was to move beyond the simple statistical analyses often applied to wastewater-based data, in order to explore whether more advanced statistical methods can extract more information about the patterns of stimulant use.

We reanalysed a WBE dataset on 42 European cities [17] using the framework of functional data analysis (FDA), a statistical method specifically developed for analyzing temporal data [18], and we compared these results with more traditional statistical analyses. For the purpose of the study, we selected two drugs with different patterns throughout the week; ecstasy (MDMA) which is mostly a “party drug” with high expected weekend loads, and amphetamine which is expected to be used more regularly throughout the week [13]. The main temporal features for the illicit drugs throughout the course of a week were estimated using functional principal component analysis (FPCA). FPCA has recently been applied for improved statistical analysis of glucose regulation [19] and monitoring of fetal movement [20] among other things. In order to explore whether differences in temporal drug loads in the wastewater between cities could be related to various geographic or other urban characteristics, we performed both functional analysis of variance (FANOVA) as well as multiple regression analyses on the FPCA results.

Data Material

No specific permissions were required for the present study. The use of wastewater data to study trends in illicit drug use in large catchment areas does not raise any major ethical issues as individuals cannot be identified, and thus cannot be harmed by such a study. The ethics of this approach has been thoroughly discussed in a previous paper [21].

Raw sewage was collected from the inlet of 47 sewage treatment plants in 42 cities from 21 European countries, servicing a combined population of approximately 24.7 million inhabitants. Samples were collected from each location over seven consecutive days, starting for 36 of the 42 cities on Wednesday 6th March 2013 and ending on Tuesday 12th March 2013. For the remaining six cities sampling during this week was not possible, and a different week in the same month was chosen. At all locations, automated sampling devices were used to collect subsamples over 24 hours. These subsamples were then pooled to a 24 hour composite sample. For cities with more than one sewage treatment plant, results were combined to a city average using a weighted mean. More background information, details regarding wastewater plant (WWTP) characteristics and so on can be found in an earlier publication based on the same material [17]. The data sets supporting the results of this article are also freely available [17]. For this study, we selected two drugs with very different consumption patterns; ecstasy (MDMA) which is mostly a party drug and amphetamine which is mostly used in more regular amounts throughout the week [13].

A specific tailored questionnaire was developed in cooperation with local sewage and treatment plant operators in order to evaluate information about the structure state of the sewer and the variability of the population size [17]. For the purpose of this analysis, daily mass loads were normalized by the population size of the catchment (mg/10000 people/day). Moreover concentrations for each drug below the limit of quantification (LOQ) were replaced by LOQ/2 [22] if at least one day in the week had a concentration value above the LOQ. Cities with no measurements above LOQ were excluded. Four cities (9.5%) were excluded for ecstasy (MDMA) and nine cities (21.4%) were excluded for amphetamine. Information on the characteristics of each city and WWTP is provided as supplementary material (S1 Table).

Statistical Analysis

Data description

The unit of observation in the analyses is a seven day week starting Wednesday and ending Tuesday. For six (14.3%) cities, the data sampling started later in the week. Missing data for the two drugs across all the 42 cities ranged from 1.7% to 2.2%. This is low [23], but since functional data analysis (FDA) requires complete datasets we performed single imputation [24] before proceeding with the analysis on the imputed dataset.

The drug loads for weekdays (Mon-Fri) and weekends (Sat-Sun) were described using median and quartiles (Q1, Q3). Wastewater drug load data was heavily skewed, and the data was log-transformed prior to further analysis.

Traditional data analysis

For the log-transformed data for each city we calculated the overall mean throughout the week, the area under the curve (AUC) and the difference d between weekdays and weekends. The significance of the latter was assessed using the Wilcox test.

Functional data analysis

The temporal pattern of wastewater drug loads was analyzed using FDA [18]. In FDA, mathematical functions are first fitted to the observations. Statistical analysis is then performed on the fitted functions rather than the original data. The seven consecutive observations for each of the 42 European cities were discrete samples of an underlying continuous process, and were converted into 42 continuous smooth curves using B-splines with seven basis functions [18, 25]. The optimal smoothing of the functions was estimated using the generalized cross validation (GCV) criterion [26] with a single choice of smoothing parameter for all cities [27]. The smoothing parameter was defined as proportion of the integrated square second derivative of the fitted curves [25]. This smoothing removes the random day-to-day variation, e.g. non-systematic error, measurement error and normal fluctuations in the load of drugs, and so extracts the underlying temporal behaviour.

Functional principal component analysis

Principal component analysis (PCA) is a statistical methodology which is used to reveal the internal structure of the data in order to explain variability [28]. Functional principal component analysis (FPCA) is a generalization of traditional PCA to functional data [18]. A common practice in FPCA is to first normalize the data, that is, to first subtract the mean, as the mean curve is a mode of variation that tends to be shared by most curves [25]. However, as we were interested in the mean temporal differences in drug loads between cities, and also compare to traditional statistical methods, we did not normalize the data. The percentage of explained variation for each FPCs thus cannot be interpreted in the same way as for PCA on normalized data.

We used FPCA to identify the main temporal patterns across the 42 fitted smooth curves. The result of an FPCA is a set of mutually uncorrelated functional principal component (FPC) curves, which explain the main modes of temporal variability across the fitted curves for all cities. The analysis further provides each city with a score for each of the extracted FPC curves, representing the intensity with which that particular temporal pattern is present in the fitted function for that particular city. Cities with close to zero scores on all FPCs have temporal drug loads similar to the overall mean curve, while cities with a high score on a particular FPC have a temporal drug load closer to that specific FPC pattern. Each estimated FPC was interpreted and labelled according to the temporal information it exhibited.

The association between the more traditional statistical measures of wastewater drug loads and the FPCA was assessed by calculating the Pearson correlation coefficient (r) between the FPC scores, the overall mean of the log-transformed data, AUC and the difference d between weekdays and weekend means.

Functional analysis of variance

To move beyond mere exploration of patterns, we wanted to see whether the various temporal patterns of wastewater drug loads throughout a week were associated with basic characteristics of the city: latitude; longitude; gross domestic product (GDP) of country; relative size of the city, i.e., number of inhabitants in the city divided by the number of inhabitants in the country and density of the city, i.e., number of inhabitants in the city divided by the urban area of the city.

Traditional analysis of variance (ANOVA) explores the mean difference in a continuous response between the various categories in a categorical explanatory variable [29]. Functional analysis of variance (FANOVA) is the generalization of ANOVA to functional outcomes [18], and is often the suggested approach when exploring covariates in FDA [18]. We used FANOVA to analyze the effect of the five possible predictors, listed above, on the shape of the wastewater drug load curves [25]. As FANOVA must have categorical covariates we had to dichotomize each of the continuous explanatory variables and compared the mean curves in the two resulting groups. We explored the impact of cut-off point by selecting cut-off points across the whole observed range of the covariates, and considered p-values <0.05 to be statistically significant. Functional confidence intervals (95%) and p-value curves, as well as an overall p-value, were calculated for each covariate using a functional permutation F-test [25].

Multiple regression

FANOVA can be seen as a univariate ANOVA problem for each specific point in time [18], and thus cannot control for covariates. In order to explore multiple predictors simultaneously, without the need for dichotomization, we used the cities' scores for the estimated FPCs as outcome variables in multiple linear regression models. From the full multiple model, including all covariates, an optimal sub-model was chosen using Akaike's Information Criterion (AIC) [30]. AIC is a weighting between model parsimony and fit to the data, and is a measure of the "goodness" of a model [31], and can be used to compare statistical models.

Robustness analysis

To explore the robustness of the FDA results, i.e. whether temporal patterns would emerge purely by chance due to the nature of the curve fitting process, we also performed all of the above FDA on a dataset obtained by random sorting of the original data.

Software

All analyses were performed in R 3.1.0 [32]. The imputation was performed using Amelia II and the amelia package [33], and FDA, FPCA and FANOVA using the fda package [25].

Results

Data summary

Wastewater drug loads for the 42 cities throughout the week are shown in Fig 1.1 and 1.2, and summarized in Table 1. The median load of ecstasy (MDMA) increased significantly at the weekend (p<0.001) but not for amphetamine (p = 0.369). The overall mean of the log-transformed data throughout the seven day week was highly correlated (r = 0.999) with the area under the curve (AUC) (Tables 2 and 3).

Download:

Fig 1. Raw data, individual curves and results from the FPCA for each drug.

1.1–1.2 shows the raw data for each drug; 1.3–1.4 shows the raw data (light grey) with the individually fitted curves (dark grey) and the mean of these curves (black); 1.5–1.10 shows the mean of the fitted curves (solid line) and how the shape of an individual curve differs from the mean curve if a multiple of the principal component curve is added to (+ +) or subtracted from (- -) the mean curve. The multiples correspond to one SD of the FPC1, FPC2 and FPC3 scores, respectively.

https://doi.org/10.1371/journal.pone.0138669.g001

Download:

Table 1. Wastewater drug loads for 42 European cities throughout the week.

https://doi.org/10.1371/journal.pone.0138669.t001

Download:

Table 2. Pearson correlation coefficients between FPC scores for the ecstasy (MDMA) loads and simple summary measures.

https://doi.org/10.1371/journal.pone.0138669.t002

Download:

Table 3. Pearson correlation coefficients between FPC scores for amphetamine loads and simple summary measures.

https://doi.org/10.1371/journal.pone.0138669.t003