Hello, everyone, and welcome to our lecture of solving a real world problem using linear regression Part 1. In this video, I'm going to use a car dataset to build a regression model that will allow me to predict the selling price of the car. I'll transform my dataset into a panda data frame and look at the first five rows, so we've got many of them. Size of a year, selling price, and then so on. This are the year where the cars were built, so I'm going to create a new column called the age. The age of the car will be the year that we are right now, 2020 minus the year the car was built. To get today's year, I'm going to import datetime and then now get today exactly where we are right now. X equals datetime, that date time right now so this will return today's year, month, days, hours, second, millisecond, and so on. Now I'm only interested on the year prior so I would just say the year is equal to x that year. This will return 2022. Now, I would define my age column now is equal today's year minus the year the car was built. If you look at it there is a new column here called age. Right now this year is no longer useful. I'm going to drop it. I'm going to drop also car name and then this call of seller type. I'm going to drop those and then look at it, those are gone. The next one I want to know is the info. When I look at the info of the car, I notice that I have a couple categorical data, the fuel type and transmission. I'm going to change them to a dummy variable using this code. This will change those categories here into a dummy variable. If you look at it again, it split the fuel type into two type of fuel, and then the transmission either manual or automatic. Now I want to check if there is any missing data. In summary, my clean data, so there is nothing missing. This data is a perfectly clean, then next one I want to do is to do some visualization. Here I'm doing a scatter plot of the selling price versus the kilometer driving, this is the mileage. Think about this as a mileage. Selling price versus the mileage. Let me do that. Selling price versus the mileage, so when I run it, it should give us this. There is a correlation between the selling price and the mileage. The more mileage they have the cheaper the car is. Now I'm doing a boxplots, the age of the car versus the selling price, and end of the hue is a transmission manual. The yellow are the manual transmission. As you can see the manual transmission are cheaper, so negatively correlated to the selling price. The age of the car is also negatively correlated to the selling price, the older the car is but you probably can't become. Now we can choose our dependent variable and independent variable. The dependent is the selling price, independent will be everything except the selling price. Now it's time to split our dependent variable and independent variable into training set and then test set. Here we set the test set to be 20 percent of the data. Now, once we split that, we can now import our linear regression model and we train our model into the x train and the y train. Now, we are going to use that model here, lm to predict the x test and then check the performance of the model by printing the R2 square. The r2 score will give us 76 percent. Usually, you want these to be close to 100 percent. I multiply here by 100 percent, let me if I have 10 days, so 76, you want these to be close to one, like 0.90 something like that. Then if you look at the coefficient here, these are the coefficients so my fuel type is possibly correlated. But the mileage is negatively correlated to the selling price of the data. We can print all the error here. If the R2 squared is 76 percent, and then the Mean Absolute Squared is this mean square error is six point something, Root Mean Square is two point something. If you look at our prediction and truly as I say you plot them, you said that will miss a few of them. This our score is not perfect. We can make this better. I will see you in the next video where we're going to improve this model to have a better r2 score. In this video, we'll learn how to convert [inaudible] into panda data frame. Look at some more information about the data. Change some categorical data into a dummy variable, do some visualization. Split our data into train test split, and then I'll run the model and then investigate the R2 squared. Thanks everybody. I will see you in the next video.