Hello and welcome to this second video on primary quantitative data analysis. In this video, we will get more in-depth data inspection methods that will guide the type of analysis that we will perform on our data. In the previous video, we learned about data inspection using the analogy of preparing a vehicle to go on a race. In this video, we will look into more detailed inspection which will also relate to the way you will eventually analyze your data. Note, that data preparation is also applicable to data that has been collected from other sources. Now, having created a database and performed the initial data inspection steps, including combining necessary indicators into variables. You can check the distribution of the data in terms of location, spread, and shape or otherwise, descriptive analysis. Let me elaborate this further. Measures of location include mean, mode, and median. The mean or otherwise, the average is obtained from adding several data points together and dividing the sum by the number of data points in a variable. For example, we can obtain the average of the age variable in our data by summing up all the values of age and dividing by the number of respondents. The median, is the number in the middle, when all the data points are arranged in ascending order. And the mode, is the frequently occurring value in the list of data points in the variable. The mean and the median are usually used to show the central location of your data. This means that the value of the mean, informs you, where your data points tend to cluster. The mean is a better value for central location than the median, because the median does not take into account the outliers in the data. Note that, the value of the mean would only make sense for continuous variables and not for categorical variables. The other type of data distribution is measures of spread. And this is categorized into standard deviation, variance, and range. The standard deviation is the extent to which the data points tend to deviate away from the mean. It is calculated based on the mean. The distance of each data point from the mean, is squared and summed and then, averaged to find the variance. In other words, the standard deviation, is derived from the square root of the variance. By determining the variation between the data points, relative to the mean score. If the data points are farther from the mean, there is high deviation within the data set. The variance is then, squared deviation of a variable from its mean. Thirdly, the range is the difference between the minimum and the maximum value of the data points in the variable. To learn more about the measures of spread, you may refer to some online material, where you will find a lot of examples with calculations and detailed formulas. Lastly, the measure of shape commonly refers to skewness. Skewness is a measure of asymmetry of the distribution of a variable, around its mean. So as to say that, a variable may be positively skewed or negatively skewed. These measures described above are generally referred to as descriptive statistics. The tests performed on the data to find out about the relationship between the variables are generally referred to as inferential statistics. And we will explore them further in the next video. But before we do that, let us look into other preparations of data. Usually, before carrying out inferential statistics, there are a few assumptions that you test based on the distribution and the measurement of your data. We will look into the most common assumption in this video and then, discuss some more assumptions also in the next video, including what to do when the data fulfills the assumption, or when the assumption fails. There are various assumptions made before the data analysis, but the most common assumption you make is that your data is normally distributed. Normal distribution means that the distribution of the data generally follows a bell shaped curve and data tends to cluster around the mean. To check for normality, we run a statistical test. In this case, we would have a null hypothesis and an alternative hypothesis. A null hypothesis is a statement being tested in a statistical test of significance. Let me elaborate further what I mean by significance levels. Usually in statistics, we refer to confidence levels or significance to determine whether the hypothesis is accepted or rejected. At 95 percent confidence interval, there is a specified probability that the value of the parameter lies within the shaded range. Anything beyond the shaded range turns out to be insignificant. The significance level is five percent or otherwise, 0.05. This is derived from summing up the two tails, as shown here. That is the lower limit and the upper limit. This is also commonly referred to as P values, and it measures how compatible your data is with a null hypothesis. Anything above 0.05, would mean that we accept the null hypothesis and anything below 0.05, would mean that we reject the null hypothesis and accept its alternative. In other words, the significance level informs us the degree to which our interpretation of the relationship between variables is correct. In performing the test for normality, we use the statistical output called the Shapiro-Wilk test, to determine whether we reject or accept the null hypothesis. For the test of normality, the null hypothesis is that the data is normally distributed and the alternative hypothesis would be, that the data is not normally distributed. Now bear in mind that we would run a test of normality on the dependent variable that consists of continuous data. Meaning, data that is measured either as interval or ratio. Categorical data is already non-normal and therefore, a test of normality would not be applicable. In the example shown here, a normality test shows a significant outcome because it is below 0.05. This means, that we reject the null hypothesis that the data is normally distributed, and accept its alternative, meaning that the data is not normally distributed. If the null hypothesis is rejected, meaning, that the data of the variable is not normally distributed. Then, you would have to create a logarithm of the variable. The detailed explanation of how the log transformation works is of a more advanced course. But generally speaking, the log transformation is arguably the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. There are other assumptions that applied to various inferential tests and we will look into them as we learn more about various tests you can perform on the data. Remember, that the descriptive statistics gives you an idea of the distribution of the data and the significance levels determine whether the null hypothesis is accepted or rejected. Now, with reference to the inspection of the car, I believe that you have an idea of what, and how to inspect the car. That is your data. You may now get ready for the race of data analysis which we'll look at in the next video. Thank you for watching and stay tuned.