This page briefly describes the Box-Jenkins time series approach and provides an annotated resource list.
Introduction to Time Series Data
A great deal of information relevant to public health professionals takes the form of time series. Time series are simply defined as a sequence of observations measured at regular time intervals. For example, daily blood pressure measurements taken on a single individual are a time series, as are daily counts of emergency room visits for asthma.
Researchers might be interested in asking several different questions about time series data. These questions include:
Can patterns identified in past observations of a single time series be used to predict its future values?
How do the values of a single time series compare before and after an intervention?
Are the values of one time series associated with the values of another time series (e.g., daily particulate matter measurements and daily mortality)?
In order to answer these questions, we must first note that the structure of time series data presents a unique challenge for researchers, such that traditional regression approaches do not yield valid results. Uncorrelated residuals are a key assumption of many regression methods. However, in a single time series, we find that observations that are close together in time tend to be more similar to each other than those that are farther away in time, leading to correlated residuals. This phenomenon is called “autocorrelation.” Models that fail to account for autocorrelation will have correct parameter estimates, but incorrect standard errors.
Introduction to ARIMA Models
One type of model that does account for autocorrelation is the Autoregressive Integrated Moving Average (ARIMA) model, which is fit using a methodology developed by George Box and Gwilym Jenkins (1970). The application of ARIMA models in health sector is varied, however, it has been used extensively for (i) outbreak detection in the arena of infectious diseases and in (ii) the evaluation of population level health interventions in the format of interrupted time series analysis. Both of these methods require the formal characterization of the inherent pattern in a time series, and using this pattern to forecast future behavior of the time series. For outbreak detection, we forecast the 95% confidence interval for a time series, and deviation of the actual time series values from within 95% CI bounds would constitute a signal. In the interrupted time series, the time series is forecasted into the future, and deviations of actual values from the forecasted values is considered to be a causal effect of public health intervention.
ARIMA models do NOT predict rare “black swan” events, as there is no pattern in the time series to suggest a future event of this type.
The causal framework for ARIMA model differs slightly from Epidemiology frame, and is more consistent with the Granger definition of a cause from economics. (Granger)
The data requirements to fit an ARIMA model are:
- A univariate time series (count or continuous) with at least 50-100 observations
- If the time series consists of count data, the interval over which the count is taken must remain the same over time
- If the time series consists of continuous data, the interval between measurements must remain the same over time
- Data must be presented in a vertical vector (column of data)
Components and Fitting of ARIMA Models
The ARIMA model divides the pattern of a time series into three components: the autoregressive component, p, which describes how observations are related to each other as the result of being close together in time; the differencing component, d, which is used to make a time series stationary (see below); and the moving average component, q, which describes outside “shocks” to the system.
A key requirement of ARIMA models is that the data set of interest is stationary, meaning that it has a constant mean and variance over time. If a data set is not stationary to begin with, stationarity can be achieved by a process called “differencing,” which is represented by the “d” component of the model.
The identification steps involve fitting the autoregressive component (variable “p”), the moving average component of the ARIMA model (variable “q”), as well any required differing to make the time series stationary or to remove seasonal effects (variable “d”). Together, these user-specified parameters are called the order of ARIMA. The formal specification of the model will be ARIMA (p,d,q) when the model is reported.
The first step in model identification is to ensure the process is stationary. Stationarity can be checked with a Dickey-Fuller Test. Any non-significant value under model assumptions suggests the process is non-stationary. The process must be converted to a stationary process to proceed, and this is accomplished by the differencing the time series using a lag in the variable as well as removing any seasonality effects. The lagged values used to difference the time series will constitute the “d” order.
Ex. An additive difference of 1 and seasonal difference of 12 is reported as d=(1,12)
Once the process is stationary, we fit the autoregressive and moving average components. To fit the model we use the Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF) in addition to various model fitting tools provided by software. There are various sets of rules to guide p and q fitting in lower order processes, but generally we let the statistical software fit up to 12-14 orders for AR and MA, and suggest combinations that minimize an AIC or BIC criterion. This part is as much as an artform as it is a structured process. The goal during this phase is to minimize the AIC/BIC criterion.
The estimation procedure involves using the model with p, d and q orders to fit the actual time series. We allow the software to fit the historical time series, while the user checks that there is no significant signal from the errors using an ACF for the error residuals, and that estimated parameters for the autoregressive or moving average components are significant.
After a model is assured to be stationary, and fitted such that there is no information in the residuals, we can proceed to forecasting. Forecasting assesses the performance of the model against real data. There is an option to split the time series into two parts, using the first part to fit the model and the second half to check model performance. Usually the utility of a specific model or the utility of several classes of models to fit actual data can be assessed by minimizing a value such as root mean square.
Sample Code for ARIMA Model Fitting in SAS
/*This line identifies data set */
proc arima data=combined.arima;
/*This section of code ‘var=value(2)’ represents a difference of lag 2*/
/*’Esacf’ is one model fitting tool SAS provide*/
/*’STATIONARITY=(ADF=(0))’ tests the process for stationarity using the dickey fuller test*/
identify var=value(2) esacf STATIONARITY=(ADF=(0));
/*’p=(8)’ specifies an autoregressive term of order 8*/
/*’q=(13)’ specifies a moving average of order 13*/
/*’The model estimates the BIC criterion using maximum likelihood*/
estimate p=(8) q=(13) method=ml;
/*The ‘out=b’ command output a data set for modelling*/
forecast back=1 id=wkdayf interval=week out=b;
Textbooks & Chapters
Box, George, Gwilym M. Jenkins, and Gregory Reinsel, eds. Time Series Analysis: Forecasting and Control. Hoboken: John Wiley & Sons, Inc., 2008.
The classic textbook on the Box-Jenkins methodology for fitting time series models.
Cryer, Jonathan D. and Chan, Kung-Sik. Time Series Analysis: with Applications in R (Springer Texts in Statistics). New York: Springer, 2008.
This textbook covers ARIMA model building in detail, and includes example applications in R. The 2nd edition is available as an e-book through the CUMC library.
Shumway, Robert H., Stoffer, David S. Time Series Analysis and its Applications: With Examples in R (Springer Texts in Statistics). New York: Springer, 2011
This text covers time series analysis from a variety of perspectives, including ARIMA models and spectral analysis. It is available as an e-book through the CUMC library.
Yaffee, Robert A. An Introduction to Time Series Analysis and Forecasting: with Applications of SAS and SPSS. Ch. 3: Introduction to Box-Jenkins Time Series Analysis.
This book chapter contains a review of how to check the stationarity assumption using SAS.
A good introduction to the unique challenges of working with time series data and the components of ARIMA models, targeted at the public health community. Also includes an example of an ARIMA model used to evaluate an interrupted time series.
Another good introduction to ARIMA models, also with a focus on interrupted time series.
This review article discusses time series models for public health data broadly, including both ARIMA models and distributed lag models.
This article is an example of the interrupted time series design. The effect of a November 1993 rise in the speed limit on road deaths is assessed by fitting an ARIMA model to a time series of road deaths.
This article describes how ARIMA models were fit to the Alnus pollen season and were used to predict future pollen concentrations.
This article describes how ARIMA models were fit to emergency department visit data and were used to forecast future values so that unusual events could be detected.
This article discusses ARIMA model development for two time series: bacterial resistance and antimicrobial use, and introduces the concept of transfer function models, which are an extension of ARIMA models used to relate one time series to another.
SAS support for the ARIMA procedure
Example SAS code for fitting a Seasonal ARIMA model
Documentation for the “arima” package in R
Guidance for building ARIMA models in Stata
Rob Hyndman’s Time Series Data Library
Freely available time series data sets
Duke Web page for good ARIMA modeling approaches
Quick-R: Time Series and Forecasting
This useful resource provides information on the most common functions for time series analysis in R, and contains links to several additional resources at the bottom of the page.
A little book of R for Time Series Analysis
Available on the web and as a PDF booklet, this resource by Avril Coghlan is both an introduction to the R language and an introduction to analyzing time series using R.