So far, we've seen examples of choosing priors that contain a significant amount of information. You've also seen some examples of choosing priors where we're attempting to not put too much information in to keep them vague. Another approach is referred to as objective Bayesian statistics or inference where we explicitly try to minimize the amount of information that goes into the prior. This is an attempt to have the data have maximum influence on the posterior which mention further this as non-informative priors. For example, let's go back to coin flipping or data comes from Bernoulli distribution with unknown parameter theta. How do we minimize our par information in theta? One obvious intuitive approach is to say that all values of theta are equally likely. So we could have a prior for theta which follows a uniform distribution on the interval 0, 1. Saying all values of theta are equally likely seems like it would have no information in it. Recall however, that a uniform 0, 1 is the same as a beta 1, 1. The effective sample size of a beta prior is the sum of it's two parameters. So in this case it has an effective sample size of two. This is equivalent to data, with one head and one tail already in it. So this is not a completely non informative prior. We could think about a prior that has less information. For example, a beta 1/2, 1/2. This would have only half as much information as an effective sample size of just one. We can take this even further. Think about something like a beta 0.001, 0.001. This would have much less information, have a sample fairly close to 0. In this case, the data would determine the posterior and there would be very little influence from the prior. Can we go even further? In fact we can, we can think of the limiting case. Something that we can think of as a beta 0,0. What would that look like? Well in this case we can say, its density is proportional to theta to the minus 1, 1 minus theta to the minus 1. This is not a proper density. If you integrate this over 0 to 1, you'll get an infinite integral, so it's not a true density in the sense of integrating to 1. There's no way to normalize it, it has an infinite integral. This is what we refer to as an improper prior. It's improper in the sense that it doesn't have a proper density. But it's not necessarily improper, in the sense that we can't use it. If we collect data, we use this prior and as long as we observe at least one head and at least one tail. Or one's success and one's failure then we can get a posterior, f of theta given y, which is proportional to theta to the y minus 1 minus theta to the n minus y minus 1. This you'll recognize as a beta distribution with parameters y and n minus y. It's posterior mean Will be y over n. This you should recognize as theta half the maximum likely estimate, the LME. So by using this improper prior, we get a posterior which gives us point estimates exactly the same as the frequentest approach. But in this case, we can also think of having a full posterior. And so if we want to make interval statements, probability statements, we can actually find an interval and say that there's 95% probability that theta is in this interval. Which is not something you can do under the frequentest approach even though we may get the exact same interval. Key concepts here that I want to state in terms of using improper priors. The first is that improper priors are okay as long as the posterior itself is proper. There may be some mathematical things that need to be checked and you may need to have certain restrictions on the data. In this case, we need to make sure that we observe at least one head and one tail to get a proper posterior. But as long as the posterior is proper, we can go forward and do Bayesian inference even with an improper prior. The second point is that for many problems there does exist a prior, typically an improper prior. That will lead to the same point estimates as you would get under the frequentest paradigm. So we can get very similar results, results that are fully dependent on the data, under the Bayesian approach. But in this case, we can also continue to have a posterior and make posterior interval estimates and talk about posterior probabilities of the parameter. Another example is thinking about the normal case. So think about Y sub i being iid normal with mu and variant sigma squared. Let's start off by assuming that sigma squared is known and we'll just focus on the mean mu. We can think about a vague prior Say mu is normally distributed with mean 0 and variants a million squared. That would just spread things out across the real line. You can take a wide variety of possible values. That would be fairly non informative across a lot of possibilities. We can then think about taking the limit. What happens if we let the variance go to infinity? In that case, we're basically spreading out this distribution across the entire real line. And so we could say, we have a density which is proportional to one. It's just constant across the whole real line. Clearly, this is an improper prior because if you integrate the real line you get an infinite answer. However, if we go ahead and plug this into finding a posterior f of mu given y is proportional to, f(y) given mu f(y). The likelihood here is proportional to e to the minus 1 over 2 sigma squared, sum of y sub i minus mu squared and our prior is just 1. So this is just the likelihood that we recover here, which we can simplify to be e to the minus 1 over 2 sigma squared over n, mu minus y bar quantity squared. This we see that the posterior for mu given y follows a normal distribution with mean y bar and variance sigma squared over n. This should look just like the maximum likelihood estimate. In the case that sigma squared is unknown, the standard non informative prior is f sigma squared is proportional to 1 over sigma squared. This is equivalent to an inverse gamma with both parameters equal to 0. This is an improper prior and it's uniform on the log scale of sigma squared. In this case, we'll end up with a posterior for sigma squared which is an inverse gamma with parameters n minus 1 over 2, and 1/2 the sum of y sub i minus y bar squared. This should also look reminiscent of quantities we get as a frequentist. For example, the samples standard deviation.