Outliers Impact on Parameter Estimation of Gaussian and Non-Gaussian State Space Models: A Simulation Study

Pereira, Fernanda Catarina; Gonçalves, Arminda Manuela; Costa, Marco

doi:10.3390/engproc2022018031

Open AccessProceeding Paper

Outliers Impact on Parameter Estimation of Gaussian and Non-Gaussian State Space Models: A Simulation Study^†

by

Fernanda Catarina Pereira

^1,*

,

Arminda Manuela Gonçalves

^2,‡

and

Marco Costa

^3,‡

¹

Centre of Mathematics, University of Minho, 4710-057 Braga, Portugal

²

Department of Mathematics and Centre of Mathematics, University of Minho, 4710-057 Braga, Portugal

³

Centre for Research and Development in Mathematics and Applications, Águeda School of Technology and Management, University of Aveiro, 3810-193 Aveiro, Portugal

^*

Author to whom correspondence should be addressed.

^†

Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

^‡

These authors contributed equally to this work.

Eng. Proc. 2022, 18(1), 31; https://0-doi-org.brum.beds.ac.uk/10.3390/engproc2022018031

Published: 22 June 2022

(This article belongs to the Proceedings of The 8th International Conference on Time Series and Forecasting)

Download Versions Notes

Abstract

:

State space models are powerful and quite flexible tools that allow systems that vary significantly over time due to their formulation to be dealt with, because the models’ parameters vary over time. Assuming a known distribution of errors, in particular the Gaussian distribution, parameter estimation is usually performed by maximum likelihood. However, in time series data, it is common to have discrepant values that can impact statistical data analysis. This paper presents a simulation study with several scenarios to find out in which situations outliers can affect the maximum likelihood estimators. The results obtained were evaluated in terms of the difference between the maximum likelihood estimate and the true value of the parameter and the rate of valid estimates. It was found that both for Gaussian and exponential errors, outliers had more impact in two situations: when the sample size is small and the autoregressive parameter is close to 1, and when the sample size is large and the autoregressive parameter is close to 0.25.

Keywords:

state space models; parameter estimation; outliers; simulation study

1. Introduction

There are several books in the literature that describe state space models in detail [1,2,3,4,5]. A major advantage of these models is the possibility of explicitly integrating the unobservable components of a time series by relating to each other stochastically.

State space models have in their structure a latent process, the state, which is not observed. The Kalman filter is typically used to estimate it, as it is a recursive algorithm that, at each time, computes the optimal estimator in the sense that it has the minimum mean squared error of the state when the model is fully specified, and one-step-ahead predictions by updating and improving the predictions of the state vector in real time when new observations become available. The Kalman filter was originally developed by control engineering in the 1960s in one of Kalman’s papers [6] describing a recursive solution to the linear filter problem for discrete time. Today, this algorithm is applied in various areas of study.

Usually, to estimate the unknown parameters of the model, the maximum likelihood method is used by assuming normality of the errors; however, this assumption cannot always be guaranteed. Non-parametric estimation methods can be a strong contribution when it comes to the initial values of iterative methods used to optimize the likelihood function, which often do not verify the convergence of the algorithms due to the initial choice of these parameters. For example, ref. [7] propose estimators based on the generalized method of moments, the distribution-free estimators, where these estimators do not depend on the distribution of errors.

Nevertheless, even if the assumption of normality of errors is not verified, the Kalman filter still returns optimal predictions within the class of all linear estimators. However, the optimal properties of Kalman filter predictors can only be ensured when all state space models’ parameters are known. When the unknown parameter vector is replaced by its estimate, the mean squared error of the estimators is underestimated.

The analysis and modeling of dynamic systems through state space models has been quite useful given its flexibility. In its formulation, the state process is assumed to be a Markov process, allowing optimal predictions of the states and, consequently, observations based only on the optimal estimator of the current state to be obtained.

Despite these advantages, any prediction model is dependent on the quality of the data. Particularly, in many cases, meteorological time series are subject to higher uncertainties, and Kalman filter solutions can be biased [8].

In particular, outliers are an important issue in time series modeling. Time series data are typically dependent on each other and the presence of outliers can impact parameter estimates, forecasting and also inference results [9]. In the presence of incomplete data and outliers in the observed data, ref. [10] developed a modified robust Kalman filter. Ref. [11] showed that linear Gaussian state space models are suitable for estimating the unknown parameters and can consequently affect the state predictions, especially when the measurement error was much larger than the stochasticity of the process. Ref. [12] proposed a non-parametric estimation method based on statistical data depth functions to obtain robust estimates of the mean and the covariance matrix of the asset returns, which is more robust in the presence of outliers, and also does not require parametric assumptions.

This work arose from the project “TO CHAIR—The Optimal Challenges in Irrigation”, in which short-term forecast models, with the state space representation, were developed to model the time series of maximum air temperature. For this project, we analyzed data provided by the University of Trás-os-Montes and Alto Douro, corresponding to the maximum air temperature observed in a farm, located in the district of Bragança, between 20 February and 11 October 2019, and data from the website weatherstack.com, corresponding to the forecasts with a time horizon of 1 to 6 days of the same meteorological variable for the same location. The main goal focused on improving the accuracy of the forecasts for the farm. However, there were some modeling problems, particularly regarding the convergence of the numerical method, which arose in the presence of outliers.

Therefore, to evaluate and compare the quality of the estimates of the unknown parameters of the linear invariant state space model in the presence of outliers, this paper presents four simulation studies: the first is based on the linear Gaussian state space model; the second is based on the linear Gaussian state space model with contaminated observations; the third is based on the linear non-Gaussian state space model with exponential errors; and the last one is based on the linear non-Gaussian state space model with exponential errors and contaminated observations. For each of the four studies, several scenarios were tested, in which 2000 samples with valid estimates of size n

(n = 50, 200, 500)

were simulated. The results obtained were evaluated in terms of the difference between the maximum likelihood estimate and the true value of the parameter and the rate of valid estimates.

2. Simulation Design

In general, the linear univariate state space model is given as follows:

\begin{matrix} Y_{t} & = β_{t} W_{t} + e_{t}, observation equation \end{matrix}

(1)

\begin{matrix} β_{t} & = μ + ϕ (β_{t - 1} - μ) + ε_{t}, state equation \end{matrix}

(2)

where

t = 1, \dots, n

is the discrete time and

$Y_{t}$ is the observed data;
$W_{t}$ is a factor, assumed to be known, that relates the observation $Y_{t}$ to the state $β_{t}$ at time t;
${β_{t}}_{t = 1, \dots, n} \sim A R (1)$ , $- 1 < ϕ < 1$ , $E (β_{t}) = μ$ , and $v a r (β_{t}) = \frac{σ_{ε}^{2}}{1 - ϕ^{2}}$ ;
$E (e_{t}) = 0$ , $E (e_{t} e_{s}) = 0$ , $\forall t \neq s$ , and $v a r (e_{t}) = σ_{e}^{2}$ ;
$E (ε_{t}) = 0$ , $E (ε_{t} ε_{s}) = 0$ , $\forall t \neq s$ , and $v a r (ε_{t}) = σ_{ε}^{2}$ ;
$E (e_{t} ε_{s}) = 0$ , $\forall t, s$ .

This paper aims to investigate under what conditions the presence of outliers affects the estimation of parameters and states in the state space model. Thus, we simulate time series of size n (

n = 50, 200, 500

) using the model defined by Equations (1) and (2). For simplicity’s sake, we consider for all simulation studies

W_{t} = 1

,

\forall t

, and

μ = 0

, that is

\begin{matrix} Y_{t} & = β_{t} + e_{t}, \end{matrix}

(3)

\begin{matrix} β_{t} & = ϕ β_{t - 1} + ε_{t}, t = 1, \dots, n . \end{matrix}

(4)

To create the contamination scenario, we study real time series concerning maximum air temperature. We used data from two different sources: the first corresponds to daily records of maximum air temperature between 20 February and 11 October 2019 (234 observations) through a portable weather station installed on a farm located in the Bragança district in northeastern Portugal; the second database corresponds to forecasts from the weatherstack.com website. These forecasts have a time horizon of up to 6 days; this means that, for a certain time t, we have forecasts given at times

t - 6, t - 5, \dots, t - 1

.

So, first we took the difference between the recorded/observed maximum temperature and the website’s forecasts, say,

Λ_{t, (h)}

, where t is the time, in days, and h is the time horizon of the forecasts,

h = 1, \dots, 6

days. Next, we calculated the percentage of outliers of

Λ_{t, (h)}

, whose percentage was on average 5%. Regarding the variable

Λ_{t, (h)}

, outliers were removed and replaced by linear interpolation, say,

Λ_{t, (h)}^{*}

, in order to remove the contamination present in the data, and its mean was subtracted,

Λ_{t, (h)}^{*} - m e a n (Λ_{t, (h)}^{*})

, so that it had zero mean. Then, for each time horizon h (

h = 1, \dots, 6

), the model with a state space representation presented by Equations (3) and (4) was fitted to the data

Λ_{t, (h)}^{*} - m e a n (Λ_{t, (h)}^{*})

.

In order to establish a relationship between the estimates of parameters

ϕ

,

σ_{ε}^{2}

and

σ_{e}^{2}

, that were obtained from the “non-contaminated” data, and the magnitude of the outliers of

Λ_{t, (h)}

, the linear regression model was fitted, whose relationship is given by

\begin{matrix} k = 1.8874 + 3.5161 \sqrt{\frac{σ_{ε}^{2}}{1 - ϕ^{2}} + σ_{e}^{2}} \end{matrix}

(5)

where

k = | outliers of Λ_{t, (h)} - mean of Λ_{t, (h)} without outliers |

, is the magnitude of the outliers, and

\frac{σ_{ε}^{2}}{1 - ϕ^{2}} + σ_{e}^{2}

is the total variance of

Y_{t}

. In total,

Λ_{t, (h)}

(

h = 1, \dots, 6

) shows 59 outliers.

In this work, four simulation scenarios were tested:

The first is based on the linear Gaussian state space model given by

$\begin{matrix} Y_{t} & = β_{t} + e_{t}, e_{t} \sim N (0, σ_{e}^{2}) \\ β_{t} & = ϕ β_{t - 1} + ε_{t}, ε_{t} \sim N (0, σ_{ε}^{2}), t = 1, \dots, n \end{matrix}$
The second is based on the linear Gaussian state space model with contaminated observations.
To contaminate the model, the deterministic factor k, given in (5), is added in this way

$\begin{matrix} Y_{t} & = β_{t} + e_{t} + I_{t} k, e_{t} \sim N (0, σ_{e}^{2}) \\ β_{t} & = ϕ β_{t - 1} + ε_{t}, ε_{t} \sim N (0, σ_{ε}^{2}), \end{matrix}$

where $I_{t} \sim B (1, 0.05)$ .
The third is based on the linear non-Gaussian state space model with exponential errors defined by

$\begin{matrix} Y_{t} & = β_{t} + e_{t}, e_{t} \sim Exp (λ_{e}) - \frac{1}{λ_{e}} \\ β_{t} & = ϕ β_{t - 1} + ε_{t}, ε_{t} \sim Exp (λ_{ε}) - \frac{1}{λ_{ε}}, t = 1, \dots, n \end{matrix}$
The last one is based on the linear non-Gaussian state space model with exponential errors and contaminated observations. Similar to scenario 2, we have

$\begin{matrix} Y_{t} & = β_{t} + e_{t} + I_{t} k, e_{t} \sim Exp (λ_{e}) - \frac{1}{λ_{e}} \\ β_{t} & = ϕ β_{t - 1} + ε_{t}, ε_{t} \sim Exp (λ_{ε}) - \frac{1}{λ_{ε}}, t = 1, \dots, n \end{matrix}$

where $I_{t} \sim B (1, 0.05)$ , and k given in (5).

For each of the four scenarios, sample sizes of

n = 50, 200, 500

were simulated. In this study, a range of values were simulated for

ϕ

(0.25, 0.75), and

σ_{ε}^{2}

and

σ_{e}^{2}

(0.10, 1.00, 5.00, 0.10, 2.00, 0.05). For each parameter combination, 2000 replicates with valid estimates were considered, i.e., estimates within the parameter space:

- 1 < ϕ < 1

,

σ_{ε} > 0

, and

σ_{e} > 0

. In all simulations, we take the initial state

β_{0} = 0

in the Kalman filter.

To evaluate the quality of the parameter estimates, we considered the Root Mean Square Error (RMSE),

\begin{matrix} RMSE (Θ) = \sqrt{\frac{1}{2000} \sum_{i = 1}^{2000} {(Θ_{i} - {\hat{Θ}}_{i})}^{2}} \end{matrix}

the Mean Absolute Error (MAE),

\begin{matrix} MAE (Θ) = \frac{1}{2000} \sum_{i = 1}^{2000} |Θ_{i} - {\hat{Θ}}_{i}| \end{matrix}

the Mean Absolute Percentage Error (MAPE),

\begin{matrix} MAPE (Θ) = \frac{1}{2000} \sum_{i = 1}^{2000} |\frac{Θ_{i} - {\hat{Θ}}_{i}}{Θ_{i}}| \times 100 \end{matrix}

Θ = (ϕ, σ_{ε}^{2}, σ_{e}^{2})

and the convergence rate. The convergence rate provides information about the percentage of valid estimates among all simulations (simulations with valid and non-valid estimates). The convergence rate is given by the number of valid simulated estimates (in this case, 2000) divided by the number of total simulations.

To estimate the unknown parameters of the state space model (3) and (4)

Θ = (ϕ, σ_{ε}^{2}, σ_{e}^{2})

of each simulation, the maximum likelihood method was used by assuming the normality of the disturbances for all four scenarios. Log-likelihood maximization was performed by the Newton–Raphson numerical method. In this study, the R package “astsa” was used [3,13,14].

3. Results

In this section, the simulation results are presented. Table 1, Table 2 and Table 3 present the results of the simulations in terms of the RMSE, MAE, MAPE (%) and the convergence rate (%) for sample sizes n = 50, n = 200 and n = 500, respectively, considering both non-contaminated (NC) and contaminated Gaussian errors. Table 4, Table 5 and Table 6 show the simulation results considering contaminated and non-contaminated exponential errors.

As expected, contamination had an impact on the performance of the maximum likelihood estimators.

First, it is seen that for small sample sizes and non-contaminated errors, the convergence rate tends to decrease. For example, for

n = 500

in the case of non-contaminated Gaussian errors, the convergence rate was over 72%, while for

n = 50

, it was over 57%. For contaminated Gaussian and exponential errors, the convergence rate decreased compared to non-contaminated errors.

Overall, an improvement in the rate of valid estimates (convergence rate) is noticeable when

ϕ = 0.75

compared to

ϕ = 0.25

in the case of non-contaminated Gaussian and exponential errors. In the case of contaminated Gaussian and exponential errors, this behavior only occurred when

n = 500

.

When the errors are not contaminated, the RMSE, MAE and MAPE tend to decrease with increasing sample size. However, this premise is not true when the errors are contaminated. In fact, it was found that for both Gaussian and exponential errors, outliers had more impact in two situations: when

ϕ = 0.75

and

n = 50

(Table 1 and Table 4); and when

ϕ = 0.25

and

n = 500

(Table 3 and Table 6). This impact is reflected in the RMSE, MAE and MAPE, which produced very high values.

Furthermore, there are many cases where, for example, the RMSE of the estimators of the contaminated errors are 3 times higher than the RMSE of the non-contaminated errors. For example, in the case of the Gaussian errors with n = 500,

ϕ = 0.25

,

σ_{ε}^{2} = 0.10

and

σ_{e}^{2} = 0.05

, the RMSE of

ϕ

,

σ_{ε}^{2}

and

σ_{e}^{2}

of the contaminated Gaussian errors were about 3, 6 and 11 times higher, respectively, compared to the non-contaminated Gaussian errors (Table 3).

On the other hand, comparing both the Gaussian and exponential error cases, we find that there are no significant differences in the convergence rate, as well as in the efficiency of the autoregressive

ϕ

estimator. However, the RMSE, MAE and MAPE of the variance estimators,

σ_{ε}^{2}

and

σ_{e}^{2}

, are in general higher in the case of exponential errors.

4. Discussion

In this work, outliers were found to impact the performance of the Maximum Likelihood estimators. In particular, it was found through the simulation study that outliers have a very significant impact in both cases: when the sample size is small and the autoregressive parameter is close to 1, and when the sample size is large and the autoregressive parameter is close to 0.25. This impact was reflected in the RMSE, MAE and MAPE values which, in many cases, were higher compared to the case of non-contaminated errors.

Moreover, we notice that the rate of valid estimates (convergence rate) is higher for large sample sizes, and is more evident for non-contaminated Gaussian and exponential errors. On the other hand, it is also important to have large sample sizes to avoid problems related to parameter estimation [11]. In general, the convergence rate is lower when Gaussian and exponential errors are contaminated.

Therefore, our next step is to develop methods to detect outliers in time series and/or to establish other estimation methods that are more robust, in the sense that they do not assume a distribution of the data and are less sensitive to outliers.

In this work, the outliers were generated from a regression model that established a linear relationship between the magnitude of the outliers and the total variance of the model with the state space representation of maximum air temperature real data. The rate of outliers from the real data was 5%; thus, this was the percentage used in this work.

In the literature, we did not find a unanimous approach for doing this. For example, ref. [15] contaminated the error of the zero-mean Gaussian equation of state by replacing the standard deviation of the observation error with a 10-times-higher standard deviation with a probability of 10% (symmetric outliers). They also considered the case of asymmetric outliers, where the zero mean of the observation error was replaced with a value 10 times higher than the standard deviation with a probability of 10%. Ref. [16] followed the same line as [15], but in this case they call symmetric outliers “zero-mean” and asymmetric outliers “non-zero”, considering the probability of contamination to be 5%. Ref. [9] contaminated both the observation and state equation errors, considering the magnitude of the outliers equal to 2.5 the standard deviation from the diagonal elements of the observation and state covariance matrices, respectively.

Author Contributions

F.C.P., A.M.G. and M.C. contributed to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by FEDER/COMPETE/NORTE 2020/POCI/FCT funds through grants UID/EEA/-00147/20 13/UID/IEEA/00147/006933-SYSTEC project and To CHAIR - POCI-01-0145-FEDER-028247. A. Manuela Gonçalves was partially financed by Portuguese Funds through FCT (Fundação para a Ciência e a Tecnologia) within the Projects UIDB/00013/2020 and UIDP/00013/2020 of CMAT-UM. Marco Costa was partially supported by The Center for Research and Development in Mathematics and Applications (CIDMA) through the Portuguese Foundation for Science and Technology (FCT-Fundação para a Ciência e a Tecnologia), references UIDB/04106/2020 and UIDP/04106/2020. F. Catarina Pereira was financed by national funds through FCT (Fundação para a Ciência e a Tecnologia) through the individual PhD research grant UI/BD/150967/2021 of CMAT-UM.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.