[SOUND] Welcome! In this lecture, you will learn more about parameter estimation. We discuss estimators and their properties. In the building blocks of probability theory, we discussed distributions of random variables. We assumed that the parameters of these distributions, for example, the mean mu and the variance sigma-squared of the normal distribution, are known. In practice, that's mostly not the case. For example, we do not know what the mean and standard deviation of economic growth or returns on the stock market are. What we can do is try to infer these parameters from observations. Suppose you have 26 observations of the yearly return on the stock market. We call a set of observations a sample. On the slides, you see a histogram of the sample. The returns in percentages are on the x-axis and the y-axis gives the frequency. The sample mean equals 9.6%. What can we learn from this sample mean about the mean of the return distribution over longer periods of time? Can we be sure that the true mean is larger than zero? What about the standard deviation? What happens when the sample becomes larger? Statistics formalizes this analysis. The starting point for the statistical theory that we need in this MOOC is the assumption that the observations come from a given distribution and are identically and independently distributed, so they are IID. We can, for example, assume that each return y_i comes from a normal distribution with unknown mean mu and variance sigma-squared. We use the vector y, to refer to the whole sample. A statistic is a function of the set of observations, so a function g of the vector y. An estimator is a statistic from which we can learn about the parameters. The sample mean is an estimator for the mean mu. The sample mean is a function of the random variables y_i, which means that m is a random variable itself. In general, we use the Greek letter theta for the parameter of interest and theta-hat for an estimator of theta. An estimate is a numerical value for an estimator. A different set of observations will lead to another value for m, so every statistic is a random variable. To study the properties of the estimator m in relation to y_i, we again use vector notation with iota, a vector of ones. Now a question for you. The answers follow from building blocks P1 and P2 because each y_i is NID with mean mu and variance sigma squared. The distribution of the vector y is multivariate normal, with mean mu times iota and variance sigma-squared times the identity matrix, the matrix with only ones on the diagonal. Because the sample mean is a weighted sum of normals, it follows a normal distribution. To derive the mean and variance, we use the rules for linear transformations of random variables. The inner product, iota� * iota, is equal to n, so m has expectation mu and variance sigma squared over n. A larger sample size will lead to a lower estimator variance. An estimator theta-hat, whose expectation is equal to the true value theta, is called unbiased. An estimator that does not have this property is called biased. The bias is equal to the expectation of theta-hat minus the true value theta. Because the expectation of m equals mu, the sample mean is an unbiased estimator of the true mean mu. This means that, on average, we expect to find the true mean mu by calculating the estimate m based on a sample. The numerical value of the estimator will vary from sample to sample. The square root of the estimator variance is called the standard error. The estimator for a particular parameter that has the lowest variance is called efficient. We call it efficient, because that estimator yields the most precise estimate. If an estimator is unbiased and efficient, it gives us the most precise estimate for the true value of that parameter. Let's now turn to an estimator for the variance. The variance is defined as the expectation of the square of y_i minus mu. But because we do not know mu, we cannot calculate y_i minus mu. Instead we use m and define a random variable z_i equal to y_i minus m. We have to be careful here because m is a function of all observations y_i. We can compactly show this dependence using linear algebra. The vector z is a linear transformation of y, as you can see on the slide. Note that the outer product of iota with itself yields a matrix that has value one everywhere. The matrix that transforms y to z has useful properties. I ask you to consider them in the next question. To find the properties of M, it is useful to write out M, as shown on the slide. You can see and show algebraically that M is symmetric and that the trace equals n minus 1. The property that M-squared equals M is a bit trickier. Let us start by multiplying out the parentheses. The square of the identity matrix is the identity matrix itself. And iota� * iota yields n. Finally, simplify the expression to get the result. We use these results to find the distribution of z. Because z is a linear transformation of normal random variables, it also follows a normal distribution. Its mean vector is zero and the covariance matrix is equal to M times the covariance matrix of y times M�. Because M is symmetric and M times M equals M, we get sigma-squared times M. We want to derive an unbiased estimator for the variance. So we need the expectation of the sum of squares. Because the mean of z is zero, this expectation is equal to the sum of the diagonal elements of the covariance matrix of z. So we use the trace of Sigma-z and that equals sigma-squared times n minus 1. Because the trace is a summation, we can exchange trace and expectation. Please check the steps on the slide that use knowledge from the building blocks on matrices and probability. We use this result to construct the sample variance s-squared as the sum of the squares of y_i minus mu divided by n minus 1. This estimator is unbiased. The sum of squares of y_i minus m divided by n yields a biased estimator. The bias is equal to minus sigma squared divided by n. The reason that we have to divide by n minus one instead of n, is due to the estimation of m. When we estimate the mean from 26 observations, knowledge of 25 observations and the mean is enough to construct the 26th observation. We say that we lose one degree of freedom to estimate the variance because we have already estimated one other parameter, the mean. It can be shown that the sum of squares of y_i minus mu divided by sigma, so z� * z, divided by sigma follows a chi-square distribution with n minus 1 degrees of freedom. Also, the sample mean and the sample variance are independent. Let's apply these concepts to our example of the stock market. The normal distribution fits the histogram reasonably well. The estimator for the sample mean gives an estimate of 9.6%. The sample standard deviation, which is the square root of the sample variance, equals 17.9%. The sample mean is a random variable, so its value is uncertain. Its standard error is equal to 17.9 divided by the square root of 26, which gives 3.5%. To construct a confidence interval around the mean, we use the rule-of-thumb that for a normal distribution, the probability of a realization within two standard deviations around the mean, equals 95%. So with 95% probability, the sample mean falls between 2.6 and 15.6%. The final concept of this lecture is consistency. An increase in the sample size n leads to more precise estimates because the estimator variance decreases. An increase in the sample size can also reduce the bias. If the estimator theta hat becomes ever more concentrated at the true value theta when the sample size grows, it is called consistent. For consistency, it is sufficient that the expectation of the estimator converges to the true parameter value, and that the variance converges to zero. Now, a question for you. To answer the question, we check the sufficient conditions. Both estimators are unbiased, so their expectations are equal to the true parameter values. The variance of the sample mean equals sigma-squared divided by n, so it goes to zero when n increases. The variance of the sample variance, equals 2 times sigma to the power 4 divided by n minus 1. We use here that a chi squared distribution with k degrees of freedom has variance 2k. Check this derivation on the slide. This quantity also goes to zero when n increases, so both estimators are consistent. With this question, we finish our building block on parameter estimation. I invite you to make the training exercise to train yourself with the topics of this lecture. You can find it on the website.