Abstract

Since the prediction accuracy of heavy metal content in soil by common spatial prediction algorithms is not ideal, a prediction model based on the improved deep Q network is proposed. The state value reuse is used to accelerate the learning speed of training samples for agents in deep Q network, and the convergence speed of model is improved. At the same time, adaptive fuzzy membership factor is introduced to change the sensitivity of agent to environmental feedback value in different training periods and improve the stability of the model after convergence. Finally, an adaptive inverse distance interpolation method is adopted to predict observed values of interpolation points, which improves the prediction accuracy of the model. The simulation results show that, compared with random forest regression model (RFR) and inverse distance weighted prediction model (IDW), the prediction accuracy of soil heavy metal content of proposed model is higher by 13.03% and 7.47%, respectively.

1. Introduction

Soil is an important resource for human survival and development, as well as the lifeline of the whole ecosystem. However, with the improvement of production level and the rapid development of economics, the problem of soil pollution has become more and more serious. And heavy metal pollution is one of the most difficult pollutants among all soil pollution sources, which is difficult to be degraded by microorganisms. It not only affects the growth of crops and leads to the decline of crop yield, but also may enter the human body through eating and other ways, so as to threaten human life and health. Liu Xingmei et al. believed that there were both noncarcinogenic and carcinogenic potential risks to human health by eating vegetables contaminated by heavy metals. Therefore, it is necessary to study heavy metals in soil. In recent years, with people's attention to soil heavy metal pollution, more and more relevant researches are carried out, which are more and more in-depth [1]. Lan et al. used pinealone-biochar to stably passivate Pb, Cu, Zn, Cr, and As in soil. As can be seen, the addition of pinealone-biochar and the coexistence of indigenous microorganisms can effectively reduce biological activity of heavy metals and accelerate passivation of heavy metals [2]. Khan Imran and Yang Dong et al. found that silicon in soil has a certain detoxification mechanism for heavy metals, which provides a certain theoretical basis for reducing toxicity of heavy metals in soil [3, 4]. Guo Xujing and Liu Xingmei et al. used spectroscopy combined with parallel factor analysis and two-dimensional correlation spectra to study the complexing characteristics of heavy metals Cr (III) and Cu (II) in soil with biochar source WEOM [1, 5]. Using corn cobs as raw materials, biochar-derived water extractable organic matter can be obtained under low temperature (300oC) pyrolysis conditions, which can be used for soil heavy metal remediation. Rana Anuj and Zhang Jiachun et al. studied the biological activities of heavy metals (Cd and Cr) in crops and believed that heavy metal pollution in soil could be dealt with by reducing the biological activities of heavy metals [6, 7]. The above research results show that the current research on heavy metals in soil mainly focuses on the treatment, and there are few researches on the prediction of contents, while the content prediction of soil heavy metals is the prerequisite for treating heavy metals in soil [8]. Thus a prediction model based on deep reinforcement learning is proposed.

2. Basic Methods

2.1. Deep Q Networks

Deep Q network is a representative algorithm of deep reinforcement learning. Combining the perception ability of deep learning with the decision-making ability of reinforcement learning, the spatial coverage problem of state-action in Q table can be solved [9]. The calculation of target value of deep Q network can be solved by state value function, which is shown in formula (1) [10]:where θ is online value network; is target value network; S is current state; a is action in s state; r is reward of agent for a; means the next state reached by agent in s state when action a is taken.

Figure 1 shows the operation process of deep Q network.

There is the problem of overestimation existing in the combination of deep Q network with neural network and reinforcement learning, which leads to the estimation error of prediction output value of the model and cannot truly reflect the actual value [11]. In addition, the convergence speed of deep Q network is slow, and the stability after model convergence is poor [12]. Therefore, to solve these problems, state value reuse and fuzzy membership factor are utilized to improve deep Q network. At the same time, adaptive inverse distance weighted method is used to adjust the hyperparameters to improve the prediction accuracy.

2.2. Improvement of Deep Q Network
2.2.1. State Value Reuse

State value reuse is to combine the partial output of value function with the obtained reward value to form total reward value, which can replace the original environment reward value to train the agent, and make the total reward value participate in the weight update of Q network. After each round of training, the network error is calculated, and the weight is updated. As can be seen, the calculation method of reward value in the deep Q network model of state value reuse is shown in formula (2) [13]:where s is current state; d is the action performed in s state. P is state probability after the execution of a, and r (s,a, p) is reward value of environment to action. represents partial output value; λ is regulating factor, which is responsible for determining the dominant position of reward value returned by environment in the total reward value, so as to avoid the influence of the size of reward value returned by environment on the model convergence.

2.2.2. Dynamic Fuzzy Membership Factor

As can be seen the deep Q network can be optimized by state value reuse. The environmental feedback reward value and state value of Q network are combined in a certain proportion. Moreover, the combination mode remains unchanged in the whole network model training. In practical application, Q network is not sensitive to environmental feedback reward at the initial stage of training, so it cannot accurately judge the advantages and disadvantages of current environment. Therefore, it is necessary to reduce the proportion of state value to improve agent’s sensory ability to environment. In the middle of training, parameters of Q network move to the optimal solution, and the network performance is getting higher and higher. The regulatory factor should be appropriately increased to enhance the reward or punishment of environment to agent performing actions. At the later stage of training, parameters of Q network basically remain stable, and the maximum value of regulatory factor should be basically maintained to improve the model convergence rate [14, 15]. Thus in the training process of Q network, the proportion of state value and environmental feedback reward value in total reward value should change dynamically. In addition, dynamic fuzzy membership factor δ is introduced in this paper, which is shown in formula (3) [16]:where n is the number of current training steps; n_total is the total number of predicted training steps. δ changes with the change of n. When n is small, δ tends to 0. When n is large, δ increases gradually. When n = n_total, δ approaches 1.

To sum up, the total reward value calculation method of the improved deep Q network model is as follows [17]:where s is current state; a is the action performed in s state. P is probability of environment transferring to next state, and r(s ,a, p) is the reward value of environment to action. represents partial output value of value function in Q network; δ is regulatory factor.

2.3. Adaptive Inverse Distance Weighted Method

Deep Q network model determines the reward value of training environment by the observing interpolation points, which is shown in formula (5). However, the observed value of interpolation point is an unknown value, which is usually predicted by inverse distance weighted method. According to [18], the inverse distance weighted method has poor interpolation effect because it cannot adapt to complex terrain structure. To solve this problem, an adaptive inverse distance weighted method is proposed.where s is current state of agent; a is action performed in s state; is the entered next state where agent performs action a in s state; is fitting value of hi on mutation function curve corresponding to s; is the fitting values of hi corresponding to ; is discrete points of mutation function of hi; r is reward value with environment to agent carrying out action a in s state.

In the adaptive inverse distance weighted method, hyperparameters of each known point in the model are learned, and the nearest adjacent statistics of each point are calculated. Furthermore, the multidimensional spatial discrete points are formed, and spatial modeling is done by Kriging interpolation method [19, 20]. Finally, the corresponding coordinates of interpolation points to be predicted are input into the spatial model, so as to obtain the corresponding hyperparameters of interpolation points. Thus the final predicted values can be obtained by using this hyperparameter to inversely weight the interpolation point. The adjacent distance is calculated as follows [21]:where N is the total number of sample points in research area; A is the area of study area.

The nearest adjacent statistic can be calculated by formula (7) [22]:where dn is the nearest expected distance of prediction point; davg is the expected nearest distance of study area.

3. Prediction of Soil Heavy Metal Content Based on Improved Deep Q Network

The prediction process of soil heavy metal content is designed as follows:(1)Preprocess and divide the collected and sorted original soil heavy metal content into sample point data set and interpolation point data set. The sample point data is the sample data set with known observation value, and the interpolation point data set is the sample data set with unknown observation value.(2)Use adaptive deep Q network to train the sample point data set and record the inverse distance weighted optimal hyperparameter of each point.(3)Compose all optimal hyperparameters into a new data set, and calculate spatial discrete points of mutation function of the data set to obtain the mutation function model, which is shown in formula (8) [23].where s is a point in support set V in random field z, and is any two-point vector in V. When mutation function is in a second-order stable process, the above equation can be rewritten aswhere E and var are mathematical expectation and variance operations, and u is the mathematical expectation of specific point in random field. Since the covariance is related to , formula (9) can be expressed asConsidering that the covariance is related to Euclidean distance of the two spatial points and has nothing to do with the direction, formula (10) can be expressed as(4)Use the weight coefficient, as shown in formula (13), establish the fitting standard, as shown in formula (14), and estimate the parameters of mutation function model to obtain the optimal mutation function model and its optimal parameters:where is fitting value of point hi on the mutation function curve, is discrete point of mutation function of hi, and is the objective function of model parameter optimization.(5)Adopt Kriging method to model the new data set to obtain the hyperparameter distribution model.(6)Input the data of interpolation points into hyperparameter distribution model, and introduce the obtained hyperparameters and corresponding interpolation point data into the inverse distance weighted algorithm, so as to obtain the final predicted value of soil heavy metal content.

The above process can be illustrated in Figure 2.

4. Simulation Experiment

4.1. Data Sources and Preprocessing

The soil heavy metal content data set of suburban farmland of Changsha is selected as the experimental data set. Considering that there are geographic data including latitude and longitude in the data set, it is not suitable for direct input into the model. Therefore, the geographic data including latitude and longitude are converted into data in Cartesian coordinate system before the experiment. Meanwhile, to reduce magnitude of geographical data, the converted data coordinates are shifted to the origin as a whole. In addition, considering the existence of missing values in the data set and the different ranges of data values corresponding to different features, mean interpolation or deletion is carried out for the data, and z-score is standardized and preprocessed, which are shown as follows:

4.2. Evaluation Indicators

The mean square error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE) are selected as evaluation indicators in this experiment, and the calculation methods are as follows [24]:

4.3. Parameter Setting

The initial parameters of all models are set the same, and the specific settings are as follows: The maximum number of training rounds is 5000; the number of training samples in each round is 64; the random seed is 1; the maximum memory size is 500; the initial state of agent is [2, 14]; the difference of training rounds between weight updates of Q network is 200; the learning rate of convolutional neural network is 0.001; the probability factor of e-greedy algorithm is 0.9.

4.4. Experiment Results
4.4.1. Estimation of Model Parameters

To verify the effectiveness of proposed model, the proposed model, deep Q network model, double deep Q network model, and competing deep Q network model are used to learn and train the parameter estimation of inverse distance weighted algorithm, and the training of different models on the data set is recorded. The results are shown in following figures.

As can be seen, the abscissa is the number of training rounds, and the ordinate is the error between predicted value and observed value (mg/kg). To reflect change trend of prediction error more clearly, smoothing processing is carried out on the basis of original graph. Here the initial points of different models are different. The reason is that the agent performs several random decisions before the model training, resulting in different agent states, and then the initial state and initial points of the models are different. However, different initial states of agents do not affect the model performance. For example, on the Ni data set, the initial states of deep Q network and competing Q network model are slightly lower than the proposed model. The convergence speed of proposed model is higher than deep Q network and competing Q network model, which indicates that the initial states of agents do not affect the model performance.

Figure 3 and Figure 4 show that the prediction error of model is not monotone decreasing when all models are trained, but there is a situation where the minimum error is reached and then moves towards a larger error, and the reason is that the model has reached local optimal state in this learning stage. However, the proposed model can jump out of the local optimal state through adaptive dynamic fuzzy membership factor and converge to the global optimal value; thus the proposed model has certain superiority. Figures 35 show that after agent reaches optimal state, it will return to poor state. The reason is that when agent adopts e-greedy strategy to make action decision, there are some unnecessary actions performed, so as to reduce the convergence rate of model and make model return to poor state after reaching optimal state. Figures 36 show that, compared with contrast models, the proposed model has faster convergence speed and more stable performance, and it can avoid falling into local optimum.

The convergence time of different models when performing parameter estimation on different soil heavy metal data sets is shown in Table 1. As can be seen, min is minimum convergence time of model repeating 10 times experiment; mean is the average convergence time of model repeating 10 times experiment; >> represents that the model still does not converge after reaching the maximum number of training rounds. According to the table, the convergence time of the same model on different data sets is different, and the convergence time of different models on the same data set is also different. The difference between minimum convergence time and average convergence time of all models is small, which indicates that each model is stable and the experiment results are reliable. The competing deep Q network model does not converge after reaching the maximum number of training rounds on Cr and Ni data sets. In addition, minimum convergence time and average convergence time of the proposed adaptive deep Q network model on each data set are smaller than those of contrast models; thus the proposed model has better performance and certain advantages. In summary, the convergence speed of proposed model is better than that of contrast models, the performance is better, and the expected effect can be achieved.

4.4.2. Prediction Results of the Model

To verify the prediction effect of proposed model on soil heavy metal content, the prediction effect between proposed and random forest regression model (RFR) and inverse distance weighted model (IDW) is compared. The results are shown in Figures 710, and the comparison between predicted value and actual value on test set is shown in Figure 11. In Figures 710, the abscissa and ordinate are sampling point and predicted value (mg/kg), respectively. In Figure 11, a is the verification result on Cd data set; b is the verification result on Cr data set; c is the verification result on Ni data set; and d is the verification result on Pb data set.

As can been seen from Figures 7(a) and 7(c), for IDW model, the error between predicted value and actual value is obvious. The proposed model has a large error between predicted value and actual value, and the predicted values of most sampling points basically coincide with the actual values, thus the hyperparameters of model are adaptively adjusted, and the prediction accuracy of model to spatial data is improved. Moreover, Figures 7(b), 8(b), 9(b), and 10(b) show that the error between predicted value and actual value of RFR model on training set is the smallest, and the prediction effect is the best. The reason is that RFR model has high feature extraction ability for nonlinear data, and it can adapt to high-dimensional data. Figures 710 show that RFR model has the best spatial prediction performance on the training data set. The prediction performance of proposed model and IDW model is poor at a few sampling points, but the error is within the acceptable range.

Figure 11 shows the predicted effect of different models on test data set, and the overall trend is basically the same. For RFR model, the error between predicted value in test set and actual value is large, which is significantly higher than IDW model and proposed model. The reason may be that there is overfitting in RFR model during training process. What is more, the meaningless features are learned, so as to lead to poor prediction performance. Compared with IDW model, the predicted value of proposed model is closer to the actual value, and there is no abnormal predicted value, which indicates that the proposed model has more stable spatial prediction performance. In conclusion, the prediction performance of proposed model is superior to RFR model and IDW model.

To quantitatively analyze the performance of each model, the evaluation indicators of each model are compared, and the results are shown in Table 2. As can be seen, in training data set, the indicator values of RFR model are lower than those of proposed model and IDW model. However, in the test set, the indicator values of RFR model are greater than those of proposed model, which indicates that RFR overfits on the training set. To avoid such a situation, the parameters of RFR model need to be adjusted, which will consume a lot of time cost. Compared with IDW and RFR models, the MSE, RMSE, MAPE, and MAE values of proposed model on the test set are lower. The reason is that the adaptive deep Q network can adaptively allocate the corresponding hyperparameters of each prediction point, which makes model more suitable for the interpolation spatial characteristics of prediction points. Here the prediction results are more accurate and consistent with the above conclusion, which shows that proposed model has the best performance in all indicators. Compared with RFR model and IDW model, the prediction accuracy of proposed model increases by 13.03% and 7.47%, respectively, and the prediction performance of soil heavy metal content is the best.

5. Conclusion

In conclusion, the proposed prediction method of soil heavy metal content based on deep reinforcement learning uses deep Q network as basic model, and it utilizes state value reuse to promote agent to learn the training samples quickly; thus the convergence rate of model is improved. At the same time, adaptive fuzzy membership factor is introduced to change the sensitivity of agent to environmental feedback value in different training periods, which improves the stability of model after convergence. Moreover, adaptive inverse distance interpolation method is adopted to predict the observed values of interpolation points; thus the prediction accuracy of model is improved. Compared with RFR model and IDW model, the proposed model performs better in MSE, RMSE, MAPE, and MAE. The prediction accuracy of soil heavy metal content is higher, which increases by 13.03% and 7.47%, respectively. Although certain research results have been achieved, there are still some shortcomings. Due to the high interpolation accuracy, the proposed prediction model takes a lot of time to complete a training, which has certain disadvantages for the actual prediction of soil heavy metal content. Therefore, a more suitable method needs to be found to train the model, so as to shorten the training time.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding this work.

Acknowledgments

This work was sponsored in part by science and education joint project of Hunan Natural Science Foundation (2020JJ7058).