Welcome to Simple Linear Regression. In this video, you'll learn how to apply the simple linear regression model for understanding the relationship between variables. And then you’ll calculate a prediction based on the fitted model. Linear regression will refer to one independent variable to make a prediction. Multiple linear regression will refer to multiple independent variables to make a prediction. In this module, we'll focus on simple linear regression. Simple linear regression (or SLR) is a method for understanding the relationship between two variables: The predictor (or independent) variable x, and the target (or dependent) variable y. Assume you want to define a linear relationship between the variables shown here. The parameter b zero is the intercept and the parameter b one is the slope. When you fit or train the model, you will define these parameters, b zero and b one. This step requires lots of math, so we will not manually calculate this. The computer will do the job for us. Let’s clarify the prediction step. It’s hard to determine the length of a flight delay, but you can use the departure delay information to get some ideas. If you assume that there is a linear relationship between these variables, you can use this relationship to formulate a model to determine the delay of a flight. If the departure delay of a flight is 20, you can input this value into the model to obtain a prediction of an arrival delay of approximately 32 minutes. There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Independence: Observations are independent of each other. Homoscedasticity: The variance of the residual is the same for any value of X. And normality: For any fixed value of X, Y is normally distributed. For example, assume you want to build a linear regression model of departure delay minutes and arrival delay minutes. Start by checking through the four criteria in the list. First, as departure delay takes longer, the average time of the arrival delay gets longer. Second, the error term is the same across all values of departure delay minutes. (You can understand the error term to be the difference between the predicted value and the actual value of the arrival delay.) Third, each observation of flight delay is independent of each other. And fourth, for the same value of departure delay, arrival delay is normally distributed. In other words, variable X (departure delay minutes) and variable Y (arrival delay minutes) satisfy the four assumptions of linear regression. You will fit a linear regression between these two variables in the following slides. To determine the line, you take data points from the data set marked in blue. You then use these training points to fit the model; the results of the training points are the parameters. You usually store the data points in a data frame as a numeric. The value you would like to predict is called the target. You’ll store this value in the column “ArrDelayMinutes ” or array Y, and you’ll store the dependent variable in the column ”DepDelayMinutes” or array X. Each sample corresponds to a different row in each data frame or array. In many cases, many factors influence the length of a flight delay, for example, the day of the week, or what generally causes the delay, such as the weather, security, or carrier-related issues. In this model, this uncertainty is taken into account by assuming a small random value is added to the point on the line; this is called noise. The figure on the left shows the distribution of the noise. The vertical axis shows the value added and the horizontal axis illustrates the probability that the value will be added. Usually, a small positive value is added, or a small negative value. Sometimes large values are added, but for the most part, the values added are near zero. Let’s summarize the process like this: You have a set of training points. You use these training points to fit or train the model and get parameters. You then use these parameters in the model. You now have a model; you use the hat character on the y to denote the model is an estimate. You can use this model to predict values that you haven't seen. For example, you have no Alaska flights with a 60-minute departure delay. You can use your model to make a prediction for the arrival delay. But don't forget your model is not always correct. You can see this by comparing the predicted value to the actual value. You have a sample for 60-minute departure delays of an Alaska flight, but the predicted value does not match the actual value. If the linear assumption is correct, this error is due to the noise but there can be other reasons. To fit the model in R, you'll first define the predictor variable and target variable and then follow several steps. For this example, let’s start by creating a subset of the original dataset by selecting only Alaska flights and define it as "aa_delays". Then, fit the predictor and target data values into a linear regression model using the lm() function. Finally, use the summary() function to view the result of fitting the linear model. The summary() function produces result summaries of the results of various model fitting functions. This example shows you information about residuals, coefficients, the statistical significance of the model, and more. In reviewing the summary table, you get the values of b zero and b one. The relationship between departure delay and arrival delay is given by the equation in bold: “Arrival Delay Minutes equals 17.35 plus 0.7523 times Departure Delay Minutes”, which reflects the equation you saw earlier in the video. You can obtain a prediction using the predict() function. But first, you need to create new data for the prediction. For this example, the names of the never seen data is “new_depdelay”. It is a data frame containing only one column, “DepDelayMiutes”, with three observations: 12, 19, and 24. These numbers were randomly selected and could be any set of numbers. You can now use the predict() function in R, passing to it the linear regression model that you fitted previously, the new data, and the optional argument interval. This example specifies the confidence interval. When you print the “pred“ object, you can see that there are three columns: “fit”, “lwr”, and “upr”. The "fit" column contains the prediction results of the inputs. And the "lwr" and "upr" columns are the lower bound and upper bound of the 95% confidence intervals of prediction results. The confidence interval reflects the uncertainty around the mean predictions. In this video, you learned how to fit a linear regression model based on a predictor (independent) variable and a target (dependent) variable, and then calculate the prediction using the predict() function.