Abstract

This paper provides a model-based method for the forecast of the total number of currently COVID-19 positive individuals and of the occupancy of the available intensive care units in Italy. The predictions obtained—for a time horizon of 10 days starting from March 29th—will be provided at a national as well as at a more disaggregated level, following a criterion based on the magnitude of the phenomenon. While those regions hit the most by the pandemic have been kept separated, those less affected regions have been aggregated into homogeneous macroareas. Results show that—within the forecast period considered (March 29th–April 7th)—all of the Italian regions will show a decreasing number of COVID-19 positive people. The same will be observed for the number of people who will need to be hospitalized in an intensive care unit. These estimates are valid under constancy of the government’s current containment policies. In this scenario, northern regions will remain the most affected ones, whereas no significant outbreaks are foreseen in the southern regions.

1. Introduction

On March 19th, the death toll paid by Italy for the spread of the virus COVID-19 amounted to 3405 deaths, the highest paid by a single country in the world. Despite hard and relatively timely lockdown policies implemented by the government, on March 26th this figure has risen to 8165 deaths.

In such an emergency situation, a reliable forecast method for the infection development is essential for policy and decision makers to design evidence-based policies and to implement fast actions to curb the spread of the infection. In particular, predicting the number of people currently tested positive for COVID-19 (thereafter “positive cases”) could be useful to draw the epidemiological curve of the infection and therefore to predict its peak. Other than this variable, the forecasting procedure presented in this paper is used to predict the future values of another crucial variable, i.e., the number of people needing hospitalization in an intensive care unit (ICU). The Italian ICU system is at the moment severely stressed due to the spread of the disease; therefore, predictions of future ICU demand could be fruitfully considered in the design and the implementation of operational schemes. The forecast horizon for both the variables is of 10 days starting from March 29th.

Since the Italian regions are affected in different extents by the COVID-19, it has been decided to perform the forecasting exercise for the following geographical areas: Lombardia, Piedmont, Valle d’Aosta, Veneto, Friuli Venezia Giulia, Trentino Alto Adige, Lazio, and Campania. The remaining regions have been grouped in the following macroareas: “Center” (Marche, Umbria, and Toscana) and “South” (Abruzzo, Molise, Puglia Basilicata, Calabria, Sicilia, and Sardegna). At least two other reasons justify such a break down:(1)The different starting times recorded for the lockdowns(2)The southern regions have been hit less severely and therefore, especially at the beginning of the observation period, show several zeroes or low numbers across the considered time span

In essence, in this study, the available official data, detailed in Section 2, have been employed in a three-step procedure, i.e.:(1)Data preprocessing, in which data anomalies are identified and corrected according to an approach of the type a Kalman filter(2)Univariate forecasting, based on an autoregressive moving average (ARMA) model for number of positive cases and ICU(3)Bootstrap-based generation of predicted values and confidence intervals

2. The Data

This paper employs the data related to COVID-19, collected and regularly updated by the Italian National Institute of Health (an agency of the Italian Ministry of Health) and by the Italian Civil Protection Department. The whole data set is freely and publicly available in a comprehensive database, accessible on the Internet at the web address https://github.com/pcm-dpc/COVID-19/tree/master/dati-regioni (the file name is dpc-covid19-ita-regioni-20200323.csv). It collects crucial data related to all the persons tested for COVID-19—from the outbreak of the pandemics (February 24th)—and, in particular,(1)is a collection of 21 data points—representing 19 Italian regions plus the two autonomous provinces of Trento and Bolzano—one for each day starting from the disease’s outbreak,(2)considers crucial variables, such as positive cases, recovered cases, deaths, number of people hospitalized, and number of people admitted to intensive care units (ICUs).

As already pointed out, in the present study, the variables of interest are the number of people who have been(1)tested positive for COVID-19 (in what follows denoted by the bold Latin letter V),(2)hospitalized in an ICU (which will be denoted by the bold Latin letter U).

It is worth outlining how, according to the regulations issued by the Italian government, only the people showing moderate to severe symptoms, generally associated with the infection, or who have been in close proximity with at least one positive person, are tested. Therefore, the predictions obtained are to be referred to the sample, as no attempt have been made to carry out inference procedures for the estimation of the variables at the population level.

In order to correctly process the data, all the regions showing no positive cases at the beginning of the recording period and/or low values along the whole time span have been aggregated into macroareas. This has been done to (i) give more meaningful results and (ii) save degrees of freedom (which are always precious in short time series).

In details, the prediction exercise will be performed on the following regions/macroareas: (A)Nothtern regions(1)Lombardia(2)Piedmont(3)Valle d’ Aosta(4)Veneto(5)Friuli Venezia Giulia(6)Emilia-Romagna(7)Liguria(8)Macroarea “Trentino Alto Adige” (Trento and Bolzano)(B)Center regions(1)Lazio(2)Macroarea “Center”: Marche, Umbria, and Toscana(C)Southern regions(1)Campania(2)Macroarea “South”: Abruzzo, Molise, Puglia, Basilicata, Calabria, Sicilia, and Sardegna

The north Italy regions—at the moment the more severely affected by the pandemic—have been treated separately along with two other regions, i.e., Lazio and Campania, since their major cities—Rome and Naples—deserve special attention for the institutional role played and their population density. On the other hand, the regions showing less worrying figures have been aggregated into macroareas according to their geolocation. The only exception is Valle d’Aosta, which has been left separated as no aggregation options could be found.

To simplify notation, for both the variables of interest V and U, the following convention is introduced:

Here, the upper left superscript (denoted by the upper case Latin letter K) refers to the geographical areas (i.e., North, Center, and South), whereas the subscript is associated with the number the different regions or macroareas are codified with, as above detailed. For example, by the symbols and the number of positive cases for the Emilia-Romagna region and the Center macroarea “Marche-Umbria-Toscana” are respectively identified.

3. Data Preprocessing

Missing data and other anomalies become the first challenge when designing predictive models, as statistical methods, in general, are designed and tested under the assumption of no missing observations [1]. Before delving into the details of the proposed procedure, a word of caution is needed since, unfortunately, a visual inspection of the data suggests the presence of a number of anomalous data, both at a regional and a country level. The detected anomalies might be associated with the biological sample collecting process and the related testing procedures. In fact, the typical lab workflow is governed by a set of rigid protocols which might be critically affected by factors such as the availability of manpower, swabs, reagents, and other laboratory materials. In emergency situations, such a workflow can be disrupted, and temporal inconsistency might appear as a result. For example, a set of samples might be delivered to a laboratory with longer than usual delays with respect to the time of collection, or a given lab can only complete the screening process for a certain number of samples. In both the cases, a shift of one day (or more) in the release of the lab results can be reasonably expected. A further source of anomalies is represented by the data entry and data editing processes, carried out in working environment likely affected by the risk of contagious and under rigid deadlines.

An example of such anomalous data is given in Figure 1, where the series (Lombardia) is depicted. Here, some data points showing values inconsistent with the overall pattern are clearly noticeable. Given the (very small) available sample size, the relative weight of such data is almost surely not negligible and can introduce severe distortions in the model parameter inference procedures and thus in the predicted values.

In order to correct those data, a Kalman smoother state-space model [2] has been applied. In particular, the Kalman smoother adopted is of the type fixed-point smoothing. This algorithm is designed to obtain the estimate of a realization (the time is fixed ) of a given random variable , given a set of observations . A thorough explanation of this method goes beyond the scope of the paper; therefore, the interested reader is referred to the excellent paper by Sage and Melsa [3].

In Figure 2, the corrected version of the series —resulting by applying the Kalman smoother—is depicted. Not only this series lends itself to a better visual inspection but, more importantly, is more suitable to be processed by the adopted prediction model.

4. Theoretical Framework

The approach used in this paper relies on (i) the theory of stochastic process and (ii) a resampling method. While the former is necessary to generate the input (predicted values) of the bootstrap algorithm, as well as to justify the employment of the outlier correction method, the latter serves the purpose of(1)generating the final predictions, which are affected by a reduced amount of uncertainty (with respect to those generated by the stochastic model),(2)yielding the related confidence intervals.

4.1. The Stochastic Processes Paradigm

The approach proposed in the present paper relies on the assumption that the (transformed) time series and are approximately a realization of a process of the type ARMA (autoregressive moving average) [4].

Let be a real order stationary process, and it is said to admit a ARMA(p, q) representation if, for some constant , , will beunder the following conditions:

Here, denotes the sigma algebra induced by , and and are assumed not to have common zero.

The above conditions assure that can be represented aswith decreasing to 0 at geometric rate.

The dynamics of the series under investigations are not suitable for this theoretical framework as it requires 2nd-order stationarity and homoscedasticity; those conditions are simultaneously achieved by preprocessing the series according to the following filter: , being the symbol the difference operator and the exponent indicating the order of the difference. To fully understand the role played by , the backward operator is now introduced. In essence, moves the time index of an observation back by time intervals, i.e., , and thus we have

4.2. The Resampling Method

In order to extract valuable information from our data and, at the same time, decrease the total amount of uncertainty associated with the outcomes of the ARMA model, a resampling procedure has been employed. Among the several resampling methods for dependent data available—many of which are freely and publicly available in the form of powerful routines working under software packages such as Python® or R®—the adopted resampling method is of the type maximum entropy bootstrap (MEB). Proposed by Vinod [5] and subsequently improved (see, e.g., Vinod [6]), it is based on basic assumptions which are different from those usually followed by standard schemes. In more details, while in the classic bootstrap an ensemble represents the population of reference the observed time series is drawn from, in MEB a large number of ensembles (subsets), say becomes the elements belonging to , each of them containing a large number of replicates .

Unlike standard bootstrap schemes, in the MEB case the resample set mimics the observed realization of the underlying stochastic process, in MEB a large number of subsets, say becomes the elements belonging to , each of them containing a large number of replicates . Among the important features of the MEB scheme, it is worth mentioning the consistency of its bootstrap samples with the ergodic theorem (see, e.g., Birkhoff [7]) and with the probabilistic structure of the observed time series. In Figure 3, an example of the application of MEB for the variable is given.

5. The Forecasting Method

In what follows, the proposed procedure is presented in a step-by-step fashion:(1)Equation (2) is estimated for both and so that the model orders (equation (2)) and become available.(2)For each time series and , the MEB procedure is applied so that the sets and —each containing “bonafide” replications—are available, i.e., and (in Figure 4, the set for the variable is given).(3)For each of the replications stored in , equation (2) is estimated according to the model order selected, i.e., , and the 1- to 10-step-ahead predictions—as well as the 5% and 95% bootstrap confidence interval—are generated.(4)The B predictions and the confidence intervals obtained in the previous step are stored in the matrix , whose columns are lower bootstrap confidence interval, bootstrap prediction, and upper bootstrap confidence interval, respectively, denoted by the symbols , , and .(5)The median value is then extracted along with the 95% confidence intervals and , computed according to the t-percentile method. The explanation of this procedure goes beyond the scope of this paper; therefore, the interested reader is referred to the excellent paper by Berkowitz and Kilian [8].(6) and (the subscript is omitted for brevity) are computed conditional to a subset of , say , made up of the bootstrap replications whose range falls between the minimum and maximum values of the values of the confidence intervals computed for . In symbols,(7)Steps 1–6 are repeated for , so that a new matrix of prediction of dimension is built, i.e., , whose columns are as in and denoted by the symbols , , and .

Unfortunately, the whole procedure cannot be considered fully automatic since the estimation of equation (2) (step 1) is required.

5.1. The Adopted Models

The stochastic model structures identified for both and are almost always of the type ARMA (1, 0), with the exception of Campania (ARMA(0, 1), for both the variables and ) and Emilia-Romagna, for which the best model for the variable is of the type ARMA(1, 1). The most suitable prefilter (equation (5)) has been always of the type d = 3 difference of the natural log of the variables of interest.

6. Empirical Evidences

At the national level (data have been plotted in Figure 4), the peak in the number of COVID-19 positives will be reached on April 2nd, with a number of predicted positive close to 77,000. The maximum forecasted value for the occupied ICU—expected for April 4th—will be 4280. These values have been calculated using an indirect methodology, i.e., by summing up the estimates obtained at a disaggregated level. The results related to COVID-19 positives and the ICU occupancy are reported, respectively, in Tables 1 and 2, where the bootstrap standard deviations of the quantities and —respectively denoted with the symbols and —are reported along with their confidence intervals, i.e., (lower) and (upper). In what follows, the main results reported in these tables are commented:(i)Lombardia—the most affected region—will reach the peak of positive cases (25963) and of the demand of ICUs (1425), respectively, on April 2nd and 4th.(ii)Emilia-Romagna is the second most affected region by COVID-19 but still shows a very high number of victims. The trend of infected people will reach its peak on April 5th, whereas the number of cases in intensive care will continue to grow at a progressively slower rate over the forecasting period.(iii)Veneto is the third region for number of deaths. Here, the number of positive cases, as well as the number of cases in ICU, will reach the peak on April 3rd.(iv)For Piedmont—the fourth region for number of victims—the predicted positive cases will reach the peak on March 29th (6635), whereas the persons in ICU will be 431 on March 31st, when the peak is predicted.(v)Liguria will begin a process of relative reduction of positive cases as early as March 29th. The number of cases in intensive care, after a period of stability (lasting until March 31st), will start a slow decreasing path.(vi)Positive cases in Trentino Alto Adige—which incorporates the cities of Trento and Bolzano—are projected to be 2158 on March 30th and then a decreasing trend is expected. The ICU beds occupied in this region will reach its peak on around April 3rd.(vii)The positive cases in Friuli Venezia Giulia show a relatively stable trend in the first half of the prediction interval with a peak around April 4th, after that the absolute number of cases will start decreasing. The number of cases in ICU will reach the peak between March 30th and April 1st.(viii)Valle d’Aosta is a small region which has been relatively less impacted by the virus. Here, a downward trend is expected to start on March 31st (for the positive cases) and around March 31st (cases in ICUs).(ix)The upward trend in the number of positive cases of Lazio is estimated to stop on March 31st and to reach the minimum at the end of the forecasting people (1821 cases). The number of ICU cases is estimated to reach its peak on the period 1–3 April.(x)The macroarea Center will reach its peak at the very beginning of the month of April (for the variable ), whereas for the variable the estimated peak day is around 31st March.(xi)Campania will reach the peak of contagions on April 5th, whereas ICU cases will do on the previous day.(xii)The remaining southern regions (Abruzzo, Molise, Puglia, Basilicata, Calabria, Sicily, and Sardinia) will show an upward trend in the number of future positive cases lasting until April 6th, where 6355 cases are predicted. The number of persons requiring an ICU will reach the peak on April 4th (348 is the estimated number of cases).

7. Conclusion

A forecasting method for two variables typically crucial during a pandemic—i.e., the number of positives and the future occupancy of ICU beds—has been proposed. The whole procedure has been designed to fulfill such a need-to-know goal using a minimal set of data; that is, the time series related to the positives and the ICU occupancy. This is a point of strength, as, especially in the initial stages of a pandemic, the available time series are limited to only basic variables (such as the ones considered in this paper) and are necessarily short, a fact that in general rules out multivariate approaches. In addition to that, this procedure uses two powerful tools, i.e., ARIMA models and the MEB resampling scheme, to generate estimates and confidence intervals of those quantities which are less affected by uncertainty components than it would be without using the bootstrap step. Finally, the procedure includes a filter of the type Kalman, which proved to be effective in correcting irregularities and anomalies (e.g., outliers) typically found in this type of data. At least two are the points of weakness of the proposed method: firstly, the assumption that both the time series of interest are both realizations of (unknown) data generating processes of the type ARIMA is arbitrary and entails the introduction of not negligible amount of uncertainty into the analysis (order selection uncertainty). Secondly, once the “best” model order is found, the inference process inevitably leads to the loss of precious degrees of freedom. Future research directions include exploring different prediction models (e.g., of the type exponential smoothing) and combining the predictions generated by them.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy or position of the Italian National Institute of Statistics.

Conflicts of Interest

The author has no conflicts of interest to declare. In particular, he has no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript.