1. Introduction
There are several books in the literature that describe state space models in detail [
1,
2,
3,
4,
5]. A major advantage of these models is the possibility of explicitly integrating the unobservable components of a time series by relating to each other stochastically.
State space models have in their structure a latent process, the state, which is not observed. The Kalman filter is typically used to estimate it, as it is a recursive algorithm that, at each time, computes the optimal estimator in the sense that it has the minimum mean squared error of the state when the model is fully specified, and one-step-ahead predictions by updating and improving the predictions of the state vector in real time when new observations become available. The Kalman filter was originally developed by control engineering in the 1960s in one of Kalman’s papers [
6] describing a recursive solution to the linear filter problem for discrete time. Today, this algorithm is applied in various areas of study.
Usually, to estimate the unknown parameters of the model, the maximum likelihood method is used by assuming normality of the errors; however, this assumption cannot always be guaranteed. Non-parametric estimation methods can be a strong contribution when it comes to the initial values of iterative methods used to optimize the likelihood function, which often do not verify the convergence of the algorithms due to the initial choice of these parameters. For example, ref. [
7] propose estimators based on the generalized method of moments, the distribution-free estimators, where these estimators do not depend on the distribution of errors.
Nevertheless, even if the assumption of normality of errors is not verified, the Kalman filter still returns optimal predictions within the class of all linear estimators. However, the optimal properties of Kalman filter predictors can only be ensured when all state space models’ parameters are known. When the unknown parameter vector is replaced by its estimate, the mean squared error of the estimators is underestimated.
The analysis and modeling of dynamic systems through state space models has been quite useful given its flexibility. In its formulation, the state process is assumed to be a Markov process, allowing optimal predictions of the states and, consequently, observations based only on the optimal estimator of the current state to be obtained.
Despite these advantages, any prediction model is dependent on the quality of the data. Particularly, in many cases, meteorological time series are subject to higher uncertainties, and Kalman filter solutions can be biased [
8].
In particular, outliers are an important issue in time series modeling. Time series data are typically dependent on each other and the presence of outliers can impact parameter estimates, forecasting and also inference results [
9]. In the presence of incomplete data and outliers in the observed data, ref. [
10] developed a modified robust Kalman filter. Ref. [
11] showed that linear Gaussian state space models are suitable for estimating the unknown parameters and can consequently affect the state predictions, especially when the measurement error was much larger than the stochasticity of the process. Ref. [
12] proposed a non-parametric estimation method based on statistical data depth functions to obtain robust estimates of the mean and the covariance matrix of the asset returns, which is more robust in the presence of outliers, and also does not require parametric assumptions.
This work arose from the project “TO CHAIR—The Optimal Challenges in Irrigation”, in which short-term forecast models, with the state space representation, were developed to model the time series of maximum air temperature. For this project, we analyzed data provided by the University of Trás-os-Montes and Alto Douro, corresponding to the maximum air temperature observed in a farm, located in the district of Bragança, between 20 February and 11 October 2019, and data from the website weatherstack.com, corresponding to the forecasts with a time horizon of 1 to 6 days of the same meteorological variable for the same location. The main goal focused on improving the accuracy of the forecasts for the farm. However, there were some modeling problems, particularly regarding the convergence of the numerical method, which arose in the presence of outliers.
Therefore, to evaluate and compare the quality of the estimates of the unknown parameters of the linear invariant state space model in the presence of outliers, this paper presents four simulation studies: the first is based on the linear Gaussian state space model; the second is based on the linear Gaussian state space model with contaminated observations; the third is based on the linear non-Gaussian state space model with exponential errors; and the last one is based on the linear non-Gaussian state space model with exponential errors and contaminated observations. For each of the four studies, several scenarios were tested, in which 2000 samples with valid estimates of size n were simulated. The results obtained were evaluated in terms of the difference between the maximum likelihood estimate and the true value of the parameter and the rate of valid estimates.
2. Simulation Design
In general, the linear univariate state space model is given as follows:
where
is the discrete time and
is the observed data;
is a factor, assumed to be known, that relates the observation to the state at time t;
, , , and ;
, , , and ;
, , , and ;
, .
This paper aims to investigate under what conditions the presence of outliers affects the estimation of parameters and states in the state space model. Thus, we simulate time series of size
n (
) using the model defined by Equations (
1) and (
2). For simplicity’s sake, we consider for all simulation studies
,
, and
, that is
To create the contamination scenario, we study real time series concerning maximum air temperature. We used data from two different sources: the first corresponds to daily records of maximum air temperature between 20 February and 11 October 2019 (234 observations) through a portable weather station installed on a farm located in the Bragança district in northeastern Portugal; the second database corresponds to forecasts from the weatherstack.com website. These forecasts have a time horizon of up to 6 days; this means that, for a certain time t, we have forecasts given at times .
So, first we took the difference between the recorded/observed maximum temperature and the website’s forecasts, say,
, where
t is the time, in days, and
h is the time horizon of the forecasts,
days. Next, we calculated the percentage of outliers of
, whose percentage was on average 5%. Regarding the variable
, outliers were removed and replaced by linear interpolation, say,
, in order to remove the contamination present in the data, and its mean was subtracted,
, so that it had zero mean. Then, for each time horizon
h (
), the model with a state space representation presented by Equations (
3) and (
4) was fitted to the data
.
In order to establish a relationship between the estimates of parameters
,
and
, that were obtained from the “non-contaminated” data, and the magnitude of the outliers of
, the linear regression model was fitted, whose relationship is given by
where
, is the magnitude of the outliers, and
is the total variance of
. In total,
(
) shows 59 outliers.
In this work, four simulation scenarios were tested:
For each of the four scenarios, sample sizes of were simulated. In this study, a range of values were simulated for (0.25, 0.75), and and (0.10, 1.00, 5.00, 0.10, 2.00, 0.05). For each parameter combination, 2000 replicates with valid estimates were considered, i.e., estimates within the parameter space: , , and . In all simulations, we take the initial state in the Kalman filter.
To evaluate the quality of the parameter estimates, we considered the Root Mean Square Error (RMSE),
the Mean Absolute Error (MAE),
the Mean Absolute Percentage Error (MAPE),
and the convergence rate. The convergence rate provides information about the percentage of valid estimates among all simulations (simulations with valid and non-valid estimates). The convergence rate is given by the number of valid simulated estimates (in this case, 2000) divided by the number of total simulations.
To estimate the unknown parameters of the state space model (
3) and (
4)
of each simulation, the maximum likelihood method was used by assuming the normality of the disturbances for all four scenarios. Log-likelihood maximization was performed by the Newton–Raphson numerical method. In this study, the R package “astsa” was used [
3,
13,
14].
3. Results
In this section, the simulation results are presented.
Table 1,
Table 2 and
Table 3 present the results of the simulations in terms of the RMSE, MAE, MAPE (%) and the convergence rate (%) for sample sizes
n = 50,
n = 200 and
n = 500, respectively, considering both non-contaminated (NC) and contaminated Gaussian errors.
Table 4,
Table 5 and
Table 6 show the simulation results considering contaminated and non-contaminated exponential errors.
As expected, contamination had an impact on the performance of the maximum likelihood estimators.
First, it is seen that for small sample sizes and non-contaminated errors, the convergence rate tends to decrease. For example, for in the case of non-contaminated Gaussian errors, the convergence rate was over 72%, while for , it was over 57%. For contaminated Gaussian and exponential errors, the convergence rate decreased compared to non-contaminated errors.
Overall, an improvement in the rate of valid estimates (convergence rate) is noticeable when compared to in the case of non-contaminated Gaussian and exponential errors. In the case of contaminated Gaussian and exponential errors, this behavior only occurred when .
When the errors are not contaminated, the RMSE, MAE and MAPE tend to decrease with increasing sample size. However, this premise is not true when the errors are contaminated. In fact, it was found that for both Gaussian and exponential errors, outliers had more impact in two situations: when
and
(
Table 1 and
Table 4); and when
and
(
Table 3 and
Table 6). This impact is reflected in the RMSE, MAE and MAPE, which produced very high values.
Furthermore, there are many cases where, for example, the RMSE of the estimators of the contaminated errors are 3 times higher than the RMSE of the non-contaminated errors. For example, in the case of the Gaussian errors with
n = 500,
,
and
, the RMSE of
,
and
of the contaminated Gaussian errors were about 3, 6 and 11 times higher, respectively, compared to the non-contaminated Gaussian errors (
Table 3).
On the other hand, comparing both the Gaussian and exponential error cases, we find that there are no significant differences in the convergence rate, as well as in the efficiency of the autoregressive estimator. However, the RMSE, MAE and MAPE of the variance estimators, and , are in general higher in the case of exponential errors.
4. Discussion
In this work, outliers were found to impact the performance of the Maximum Likelihood estimators. In particular, it was found through the simulation study that outliers have a very significant impact in both cases: when the sample size is small and the autoregressive parameter is close to 1, and when the sample size is large and the autoregressive parameter is close to 0.25. This impact was reflected in the RMSE, MAE and MAPE values which, in many cases, were higher compared to the case of non-contaminated errors.
Moreover, we notice that the rate of valid estimates (convergence rate) is higher for large sample sizes, and is more evident for non-contaminated Gaussian and exponential errors. On the other hand, it is also important to have large sample sizes to avoid problems related to parameter estimation [
11]. In general, the convergence rate is lower when Gaussian and exponential errors are contaminated.
Therefore, our next step is to develop methods to detect outliers in time series and/or to establish other estimation methods that are more robust, in the sense that they do not assume a distribution of the data and are less sensitive to outliers.
In this work, the outliers were generated from a regression model that established a linear relationship between the magnitude of the outliers and the total variance of the model with the state space representation of maximum air temperature real data. The rate of outliers from the real data was 5%; thus, this was the percentage used in this work.
In the literature, we did not find a unanimous approach for doing this. For example, ref. [
15] contaminated the error of the zero-mean Gaussian equation of state by replacing the standard deviation of the observation error with a 10-times-higher standard deviation with a probability of 10% (symmetric outliers). They also considered the case of asymmetric outliers, where the zero mean of the observation error was replaced with a value 10 times higher than the standard deviation with a probability of 10%. Ref. [
16] followed the same line as [
15], but in this case they call symmetric outliers “zero-mean” and asymmetric outliers “non-zero”, considering the probability of contamination to be 5%. Ref. [
9] contaminated both the observation and state equation errors, considering the magnitude of the outliers equal to 2.5 the standard deviation from the diagonal elements of the observation and state covariance matrices, respectively.