In this section will look at how a financial time series is modeled statistically. Our focus will be on autoregressive integrated moving average models, or ARIMA models. These models describe how each successive observation in a series is related to previous observations, which is quite different from aggression, where you're relating a dependent variable two independent variables or factors. First we will look at how ARIMA models are used to forecast time series, and how this method compares to linear regression. In this section you will learn how ARIMA differs from linear regression, and how you can apply a variety of models to a single data series. You will also learn how to choose parameters and features for your ARIMA model, and what the consequences of those choices are. Earlier we introduced stationarity. Let's review, a time series is a random process. It can be summarized by a measure of central tendency, typically we use the mean. It can be summarized by a measure of how uncertain we are about the location of that center, typically we use the standard deviation. The mean and the standard deviation summarize data very nicely. For example, if data is normally distributed, then the mean and standard deviation summarize everything there is to know about the data. Problem we have with financial data is that the mean may change over time, or the standard deviation may change over time. Either of these violations means one thing, the series is not stationary. For example, this graph is upward trending and has an increasing mean, this is not stationary. Previously you learn that you could take the first difference of a series, and this new different series may be stationary. Let's begin by distinguishing ARIMA from linear regression. Before this course, you've likely done several types of linear regression. In linear regression you regress a response variable on one or more dependent variables. For example, you might have done a regression in your statistics class where you looked at children's heights and weights. One observation contains one child's height and weigh. Note that the order of the observations is not important, in the end you get a coefficient for each dependent variable. In the height versus weight example, you get one slope coefficient which is the rate of increase of height per unit weight, plus an intercept term. Let's compare that to ARIMA. Recall that ARMA is an acronym for autoregressive integrated moving average. Let's break down the first part. The autoregressive term or AR, tells you that this is indeed a regression, but the A means auto, the series has regressed on itself. Is that cheating? No, because the observations occur over time. You would like to know if, by using past observations, you can somehow predict what the future values will be. We could not do this for a linear regression because the observations do not have any time sequence associated with them. You can think of them all occurring at the same time, but you can do this for a time series model by allowing some number of lags of a variable to help you predict what the next value will be. For example, if you run an AR 1 model, then you will use the one lagged or immediately prior value to predict the next. The MA part tells you there is a moving average component as you saw in a previous section. These are the unobserved errors terms. Let's talk about what both models have in common. First, both models require the data to be stationary. In linear regression you have to use stationery variables, if the variables are non-stationary then you violate the assumptions of the linear regression model. You would need to look for related variables that preserve the stationarity condition. In ARIMA modeling, you also have to use stationery variables. If the variables are non-stationary, then you have to take differences until the data is stationary. Now you see why stationarity is so important. Second, both models are linear as opposed to having exponents or non-linear terms. This makes the results more intuitive, and the estimation easier computationally. Third, both models want a high correlation between the response variable and the dependent variable. Consider the first case of simple linear regression. This is where you have only one dependent variable, you can compute the correlation coefficient between the response and the predictor variables. In multiple linear regression, you want to compute the correlation of each pair, where a pair consists of the response variable, and each dependent variable. In ARIMA models you use the autocorrelation graph to detect where there are high correlations. Forth, both use similar methods of estimating the coefficients. The slope and the Intercept constitute the coefficients. Finally, both use similar statistical test to evaluate the quality of fit. Now let's move on to what the models do differently first. First, there is no natural ordering to the observations in the linear regression. This data may be observational in nature, or experimental in nature, but they are not time series. In ARIMA models the variables we use or time series, associated with each data point is an observation time. These times have a natural ordering. They may be evenly spaced out such as every day or every month, or they may be unevenly spaced such as every time a trade occurs. At some parts of the trading day there is a lot more trading volume than other parts, so the data frequency varies. Note that some people mistakenly put time series and linear regressions, they should really be running time series models instead. There are other time series models besides ARIMA. Second, linear regression uses two different variables, one for the response and one for the predictor. ARIMA models can use a single variable. Remember this is a time series, so out of one variable comes dozens of others. Simply by lagging the time series one time point, two time points, three time points, or so. Third, linear regression prioritises the variables. One of the variables serves as the dependent variable, we call this the response variable, or we simply call it y. It depends on an independent variable we call this variable the predictor variable, or we simply call it x. If we are controlling an experiment, then clear the order is apparent. However, if we have economic or financial data, we are not running a controlled experiment, either variable can serve as the response. ARIMA models subvert this issue entirely, because we simply model how the past values of the series helped to predict its future values.