Abstract

To improve the detection rate and reduce the correction error of abnormal data for water quality, an outlier detection and correction method is proposed based on the improved Variational Mode Decomposition (improved VMD) and Least Square Support Vector Machine (LSSVM) algorithms. The correlation coefficient is introduced, for solving the optimal parameter k of VMD algorithm, and an improved VMD algorithm is obtained. Combined with LSSVM algorithm, the outliers of water quality can be detected and repaired. This method is applied for the detection and correction of water quality monitoring outliers using dissolved oxygen which is retrieved from the water quality monitoring station in Hangzhou, Zhejiang Province, China. The result shows that the improved VMD algorithm is of higher detection rate and lower error rate than those of Empirical Mode Decomposition (EMD) and Ensemble Empirical Mode Decomposition (EEMD). The LSSVM algorithm increases the fitting accuracy and decreases correction error in comparison with SVM and BP neural network, which provides important references for the implementation of environmental protection measures.

1. Introduction

Water resource is an important strategic resource of the country, and it has important influence on economic and social development. In recent years, people's awareness of environmental protection has been gradually strengthened, and the state's supervision of water pollution is gradually increasing. Water quality monitoring has become an important link in the process of water pollution control [1, 2]. Whether the water quality monitoring data is normal or not has a significant impact on the implementation of water environmental protection measures. Therefore, it is of great significance to detect and correct the outlier of monitoring data on water quality [3, 4].

Carlson and Byer applied the Pauta criterion to outlier detection of water quality for the first time, and it is assumed that data exceeding three sigma of the sample mean is outlier [5]. This method is simple and intuitive, and it can detect partial outlier. But the monitoring data of water quality has the characteristics of high monitoring frequency and large fluctuation; hence, this method is of bad accuracy in outlier detection of water quality [6]. On the basis of the above research, many scholars had conducted in-depth research on the outlier detection of water quality according to the water quality characteristics. Park S and his team used principal component analysis (PCA) to build a model [7], and outlier for monitoring data of water quality can be detected by calculating the residual error of the model. The PCA model reduced input dimension; however, the analysis results are poor in the case of low correlation between indicators. Hou D B’s team took the monitoring value of ammonia nitrogen as an example [8] and detected the water quality data to be normal or not based on the wavelet transform and wavelet neural network (RBF); the model improves the detection rate and reduces the error rate to some extent. Each of these methods has common limitations, the detection effect is poor when the data fluctuates greatly, and new methods [911] had been applied to abnormal data detection by scholars. Using the EMD algorithm to detect outlier for water quality, an anomaly detection method based on scale adaptive matching was proposed by Yang Z L [12]. The water quality anomaly detection is transferred to the time and frequency domain, and it provides a new idea for water quality outlier detection. However, the EMD algorithm [13, 14] has the modal mixing problem in the process of signal decomposition and the overall detection effect is affected. Zhang F and his team optimized the outlier detection method for water quality monitoring based on the EEMD algorithm [15], and ensemble empirical modal decomposition is applied to analyze abnormal monitoring data and reduce the problem of modal mixing effect, but the detection rate of abnormal data needs further improvement.

Through the analysis of previous studies, it is found that the detection of outlier for water quality has made great progress, but there is a need to improve detection rate although few scholars have corrected outlier detection. On the basis of these studies, this paper proposes an outlier detection and correction method for monitoring data of water quality based on improved VMD and LSSVM. This method has a higher detection rate and lower error rate than the EMD and EEMD algorithms. In addition, the paper adds to the correction of outlier by the LSSVM algorithm. Compared with SVM and BP neural network, LSSVM algorithm improves the fitting accuracy, and the error of reconstructing data is smaller. Finally, the algorithm package of this paper is useful for engineering application through our independently developed software platform.

2. Experimental Methods

2.1. The VMD Algorithm

The Variational Mode Decomposition (VMD) algorithm [1618] is a variational problem solving process based on the classical Wiener filtering [19, 20], Hilbert transforms [21, 22], and frequency mixing, and it mainly includes two parts: the construction and solution of variational problems. The goal of VMD is to decompose an input signal into a discrete subsignal. Suppose that the signal f can be decomposed into a modal function with the minimum sum of k bandwidth, and each mode has a central frequency with limited bandwidth, then the constraint condition is the sum of each mode equal to the input signal f. A variational model with constraints is established [23]:where is the Dirac distribution, represents convolution, is the modal components, and is the central frequency of the modal components.

To find optimal solution for the constrained variational model, it needs to be converted into a nonconstrained variational problem, and the secondary punitive factor and the Lagrange multipliers are introduced.

The VMD algorithm adopts the alternate direction method of multipliers (ADMM) to solve the optimal solution of variational problem by alternating update parameters ,, and [24]. The solution is as follows:(1)Initialize .(2)Alternate update parameters and .(3)Repeat step , when

Stop updating, and get k number of intrinsic mode function (IMF) [17].

2.2. The LSSVM Algorithm

The LSSVM (Least Square Support Vector Machine) algorithm is an improved algorithm for SVM (Support Vector Machines). LSSVM is a statistical learning theory that adopts a least squares linear system as a loss function [25] and it transforms inequality constraints into equality constraints, takes the loss function of the sum of squared errors as the empirical loss of training set, and turns the empirical risk from first power to second power. Finally, the quadratic programming problem is transformed into linear equations and is solved by least square method. The speed of solution and the accuracy of convergence are both improved [26].

Let us say there is a nonlinear sample set S, and, the training sample and () is an n-dimensional vector, and we map the original space sample from to the feature space by a nonlinear mapping . The optimal decision function is constructed in this high dimensional feature space:

Use structural risk minimization principle to solve parameters , b. The problem of function fitting becomes the following optimization problem:In the formula: is the weight vector, is the regularization parameter, is the error variable, and b is the deviation. Using the Lagrangian method to solve this optimization problem:In this formula: is the Lagrange multiplier.

According to Kuhn-Tucher conditions [27]:The following formulae can be obtained:Selecting the radial basis function:In this formula: is the kernel parameter and is the kernel function. The optimization problem is transformed into solving a linear algebraic system of equations:In (15), is a column vector.Using least square method to solve a and b, the linearized expression has been achieved:

3. Improved Experimental Method

3.1. The Improved VMD Algorithm

The VMD algorithm searches for the optimal solution of the variational model by nonrecursive iteration in frequency domain and determines the center frequency and bandwidth of each amplitude modulation and frequency modulation component, and finally adaptive frequency division and separation of components are realized [28]. This method has high resolution and can effectively avoid the problem of pattern confusion, but the precondition of obtaining the optimal decomposition results is that the number k of modal decomposition is determined in advance [29]. Traditionally, empirical method or referring EMD mode decomposition is used to determine the k value of parameters; in order to overcome the difficulty of solving the optimal value of parameter k in VMD algorithm, an improved VMD algorithm based on Newton method is proposed. The specific steps are as follows:

Construct objective function: the correlation coefficient (COR) is introduced, which represents the degree of correlation between the modal decomposition signal and the original signal [30]. When the COR value is below the threshold, it is considered that the modal decomposition margin no longer contains information related to the original signal, and the original signal is completely decomposed. The correlation coefficient of two time series and is defined as follows:Among them, is the k modal component, is the original signal, and is the correlation coefficient between the k mode component and the original signal. The objective function is defined as follows:

Iterative search optimal solution [31]: set the initial iteration parameter k=1, execute the iteration loop according to k=k+1, and stop iterating until (in this paper, takes 0.2). The algorithm block diagram for solving the optimal parameter k by Newton's method is shown in Figure 1.

The k value is introduced into the VMD algorithm to get the optimal mode decomposition.

3.2. The Optimization of LSSVM Algorithm

In the LSSVM algorithm, the regularization parameter and the kernel width are two very important parameters. If the is too small, it will lead to less-learning and otherwise it can lead to over-learning. The value of directly affects the accuracy of model fitting [32]. In this paper, particle swarm optimization (PSO) algorithm is used to optimize these two parameters to improve the performance of LSSVM in curve fitting. The flow chart of LSSVM algorithm optimization is shown in Figure 2.

The specific steps are as follows [33]:(1)Parameter initialization: initialization of PSO parameters, including population size, learning factors, and inertia weight.(2)Calculating individual fitness: the fitness value of each particle is calculated by LSSVM model, and then the current fitness value is compared with the best fitness value of the particle itself to get the optimal position of the particle.(3)Population regeneration: comparing the optimal position fitness of each particle with the optimal position fitness of the population, the best one is selected as the optimal position of the population, and the position and velocity of each particle in the population are updated.(4)When the number of iterations reaches the maximum, the optimization is finished and the current optimal particle is selected as the parameter of LSSVM algorithm. Otherwise, jump to step , and continue to do iterative optimization.

3.3. The Model of Outlier Detection and Correction for Water Quality

In order to improve the detection rate and reduce the correction error of abnormal data for water quality, an outlier detection and correction method for monitoring data of water quality was proposed based on improved VMD and LSSVM. First of all, the data of water quality monitoring station is preprocessed by the rule of the Pauta criterion to eliminate the obvious outlier. Then, the improved VMD algorithm is used in the mode decomposition for the residual monitoring data sequence and the outlier of monitoring data was detected by superimposing the low frequency modal components. Finally, the LSSVM algorithm is used to correct the outlier. The detailed block diagram of water quality outlier detection and correction model is shown in Figure 3.

4. Results and Discussion

4.1. Performance Comparison of Signals Decomposition

Using the EMD and improved VMD algorithms to decompose the simulation signals and to compare the performance of the two algorithms in signal decomposition, the simulation signals are composed of three cosine signals of 55Hz, 266Hz, and 580Hz and a group of noise sequence A; the expression of the simulation signal isThe expression of A is as follows:A = zeros(1,1024), ,

In this experiment, the sampling frequency is 5120Hz and the sampling number is 1024. The time domain graph and the corresponding spectrum graph of the simulation signal are shown in Figures 4 and 5, respectively.

As can be seen from Figure 5, the simulation signal mainly contains the frequency components of 55Hz, 266Hz, and 580Hz. Using the improved VMD algorithm to decompose the original signal, the secondary penalty factor α and the number of modal decomposition k need be determined first.

For the selection of α, the value of α will affect the decomposition effect of the improved VMD algorithm. The smaller the α, the greater the bandwidth of the intrinsic mode function (IMF) components and, conversely, the less bandwidth of the IMF's components. According to experience, the value of α is usually 5000~10000. Furthermore, the change of α in the appropriate range will not have too big impact on the decomposition effect. In this experiment, the value of α is 8000.

For the selection of k values, if the k value is too large, the IMF components decomposed will be intermittent and the optimal solution k=4 is obtained by iteration of Newton method. The decomposition results and corresponding spectra by improved VMD are shown in Figures 6 and 7, respectively.

The EMD algorithm is used to decompose the simulation signals, and the decomposition results and corresponding spectra are shown in Figures 8 and 9, respectively.

From Figures 6 and 7 we can see that four IMF signal components are decomposed through the improved VMD algorithm. The frequency of the first three IMF signal components is 55.1076Hz, 265.5186Hz, and 581.135Hz, respectively, and consistent basically with the frequency components contained in the original signal. The fourth IMF component is a set of noise signals with very low intensity, mainly distributed in the frequency range from 1500Hz to 2100Hz. As can be seen from Figures 8 and 9, the decomposition result is not ideal through the EMD algorithm. The phenomenon of modal overlap occurs in IMF components from the third to the fifth. As the experimental results suggest, the decomposition effect of EMD algorithm is not ideal and caused mode mixing problem. The improved VMD algorithm can overcome the disadvantage of mode mixing problem and achieve good decomposition effect. To sum up, the improved VMD algorithm is better than the EMD algorithm for signal decomposition.

4.2. Outlier Detection and Correction

The fourth IMF component in Figure 6 is a series of extremely low noise signals, which needs to be removed. In order to detect the abnormal points in the original signal, add the remaining three IMF components in Figure 6 and a new time series signal is obtained. The time domain diagram of and is shown in Figure 10.

Taking ±50% as the threshold of relative error between original data sequence and the stacking data is calculated. The data is treated as outlier when the relative error exceeds threshold. In Figure 11, the spot marked with red dots is the outlier detected in the original simulation signal.

In order to verify the superiority of improved VMD algorithm in outlier detection, two comparative experiments were designed: using the EMD algorithm and EEMD algorithm to detect outliers. The detection results of the two algorithms are shown in Figures 12 and 13, respectively.

The detection results of EMD, EEMD, and improved VMD are shown in Table 1.

In order to calculate the accuracy of three algorithms for outlier detection, the number of normal data is distinguished into normal data labeled as TP, the number of abnormal data is distinguished into normal data labeled as FP, the number of normal data is distinguished into abnormal data labeled as FN, and the number of abnormal data is distinguished into abnormal data labeled as TN. The calculation methods for the detection rate Acc(Accuracy) and the error rate Fal(False) of the abnormal data are shown by the following formula.

According to formula (21) and formula (22), the detection rate and error rate of EMD, EEMD, and improved VMD algorithms are shown in Table 2.

As seen in Table 2, in the aspect of outlier detection, the accuracy of EMD algorithm is the lowest in the three algorithms, which is 86.84%. The modal mixing in the decomposition process of EMD algorithm is an important reason leading to low accuracy. In the same way, the EMD algorithm has a higher error rate, which is 0.71%. The EEMD algorithm has a promotion compared to the EMD algorithm, and it can be seen that the detection rate of EEMD has been greatly improved, which is 94.74%, and the error rate is also greatly reduced to 0.41%. The improved VMD algorithm proposed in this paper has obvious advantages in signal decomposition precision and noise robustness, it also can be seen that the detection rate of improved VMD algorithm for abnormal data is further increased, which is 97.37%, and the error rate is the lowest of the three algorithms, which is 0.20%. In the three algorithms, the improved VMD algorithm has the best effect for outlier detection.

Remove the abnormal data detected in Figure 10, and the remaining sampling points constitute a new set of sequences. The dispersion normalization method is used to process the sequence.The normalized fitting result is , and the actual fitting result is as follows.

The parameters and of the LSSVM model are determined by the PSO method, and take the final result , as the optimal parameter. The curve is fitted with LSSVM model, and the result is shown in Figure 14.

The correction result of outlier is shown in Figure 15.

In order to increase the contrast of the experiment, two algorithms of SVM and BP neural network are used to correct the outlier detected, respectively. The value of correcting outlier by the LSSVM, SVM, and BP neural network algorithms is shown in Figure 16.

MSE, MAE, and MAPE indicators are used to evaluate the performance of the algorithm for outlier correction, and the results are shown in Table 3.

According to the experimental results of Table 3, the fitting effect of SVM algorithm is the worst of the three algorithms, and the reason is that the selection of kernel function in SVM algorithm is difficult. Compared with the SVM algorithm, the BP neural network algorithm has improved the correction effect for outlier, but it is not obvious because the algorithm is dependent on the selection of training set samples. The LSSVM algorithm adopted in this paper is the improvement of SVM algorithm, and the effect of data fitting is obviously improved. In addition, the value of the MAPE index of the three algorithms is obviously larger than the two indexes of MSE and MAE, which is because the value of the data set selected in this experiment is small and does not affect the performance evaluation of the algorithm. Taken together, the LSSVM algorithm has the best fitting effect; that is, LSSVM has the highest accuracy in outlier correction.

4.3. Outlier Detection and Correction of Actual Monitoring Data

Take the monitoring data of DO for a period of time in a water quality monitoring station in Hangzhou (Wan Cun station from Jan 1, 2018 to Feb 2, 2018) as an example, and record it as x(t). The time sequence for the DO concentration of this monitoring site is shown in Figure 17.

In order to simplify the operation of improved VMD algorithm, the raw data of DO is preprocessed first. According to the Pauta criterion in classical statistics, the preliminary outlier detection results are shown in Figure 18.

After preliminary detection, four outliers are detected. In Figure 18, the sample points marked with red dots are outlier. Removing the four obvious outliers, a new set of sampling sequences is obtained. Using the improved VMD algorithm to decompose the sampling sequence, the optimal mode decomposition number k=3 is obtained by Newton method. The parameters in the improved VMD algorithm are set as follows: the value of α is 8000, and the value of k is 3. The time domain graph and the corresponding spectrum graph of the mode decomposition are shown in Figure 19.

Remove the third IMF component and superpose the remaining two IMF components. A superimposed sequence is obtained, as shown in Figure 20.

Selecting the threshold according to the method in the simulation experiment, the outlier is detected and shown in Figure 21.

As shown in Figure 21, 9 outliers are detected through the VMD algorithm. In addition to the 4 outliers detected during pretreatment, 13 outliers in the DO monitoring value are detected in this experiment. Using the LSSVM algorithm to correct the 13 abnormal data, the correction results are shown in Figure 22.

In order to verify the effectiveness of this method in practical engineering application, we add a set of comparative experiments additionally. A set of standard monitoring data was obtained from Zhejiang Provincial Environmental Protection Department. The data was 200 samples of DO content at Jiu Xi monitoring station from April 1, 2018 to May 3, 2018. Then 20 samples were artificially modified to represent abnormal samples. EMD, EEMD, and improved VMD algorithm is used respectively to detect the outliers, and LSSVM algorithm is used to repair the outliers. The results are shown in Figures 23(a)23(c) and Table 4, respectively.

The results show that the improved VMD algorithm is of great effect on outlier detection and it has high accuracy and low error rate. From the two indicators of detection rate and error rate, we can see that the performance of the improved VMD algorithm is better than that of EMD and EEMD algorithms, which is consistent with the experimental results in simulated data scenarios.

For the outlier correction, the comparison of algorithm performance among SVM, BP neural network, and LSSVM is shown in Table 5.

The result in Table 5 is also consistent with the experimental results in simulated data scenarios. For the MSE, MAE, and MAPE three indicators, the performance of the LSSVM algorithm is obviously better than that of SVM and BP neural network.

The method of this paper has already realized the engineering application of algorithm package in the water quality parameters monitoring and trend forecast system developed by ourselves, and it has been applied to water quality monitoring stations in Zhejiang Province, China. This method avoids the data error caused by equipment failure, external interference, and other factors. It also substitutes traditional artificial statistics and correction and improves the efficiency and service quality of environmental protection. The location of water quality stations in the software platform is shown in Figure 24.

Using the method of this paper, abnormal values of DO concentration in different monitoring stations are detected and corrected. Taking the Wan Cun monitoring site of Hangzhou as an example, the engineering implementation effect of the algorithm package is shown in Figure 25. The black graph shows the historical curve of the DO concentration value of the monitoring station processed through the algorithm package.

In future, the algorithm will be applied to more water quality parameters, such as COD, PH, NH3-N, and TP, which will be more practical.

5. Conclusions

To improve the detection rate and reduce the error rate of outlier for water quality data, an outlier detection and correction method based on improved VMD and LSSVM is proposed, and the method is applied to Wan Cun which is a water quality monitoring station in Hangzhou. The method avoids the shortcoming of EMD algorithm in the process of signal decomposition. On the indicator of detection rate and error rate, the method of this paper is superior to the algorithm of EMD and EEMD. Based on the outlier detection, the outlier of DO monitoring value is corrected. On the indicator of MSE, MAE, and MAPE, improved VMD is better than the algorithm of SVM and BP neural network. The method proposed in this paper can be applied to water quality monitoring and its related fields, which will provide an important reference for the enforcement of environmental department and the implementation of environmental protection measures.

Data Availability

The real-time monitoring data used in the manuscript were obtained from the Drinking Water Quality Automatic Monitoring Station of Zhejiang Environmental Protection Department collected from Jan/01/2018 to May/31/2018. Any researcher can see http://yys.zjemc.org.cn/Home/Map?moudleName=realtime# for more information.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study is supported by International Science and Technology Cooperation Program of Zhejiang Province for Joint Research in High-Tech Industry (No. 2016C54007), National Natural Science Foundation of China and Zhejiang Joint Fund for Integrating of Informatization and Industrialization (No. U1509217), Provincial Key R&D Program of Zhejiang Province (No. 2017C03019), and the National Key R&D Program of China (No. 2016YFC0201400).

Supplementary Materials

A graphical summary of the manuscript that let readers quickly capture the core content of the paper. (Supplementary Materials)