Okay, so far we talked about regression problems and saw how they can be solved using either linear or non-linear regression. We also mastered tensor flow to do both types of regression and used fundamentals data to build our first machine learning regression models in tensor flow. Now it's time for us to take another look at machine learning in the setting of probabilistic models. This is because probabilistic models offer generalized non-probabilistic models. Many machine learning models are either probabilistic or equivalent to particular probabilistic models. For example, As we will see shortly, linear regression is equivalent to a Gaussian probabilistic model. Moreover, because in finance we generally deal with random numbers, probabilistic methods are typically preferred in finance to non-probabilistic ones. After we introduce probabilistic framework in machine learning we will be ready to talk about classification tasks in finance. To begin, let's view the machine learning problem of learning from data as a problem of function estimation. Generalizing examples of regressions that we just saw, we can say that all machine learning algorithms are about fitting some sort of a loss function f(X,theta) to some data D where X is a vector of features and theta is a vector of model parameters. The function f is typically assumed to belong to some family of parameterized functions Ftheta. If this family has a fixed number of parameters, in this case we'll view with parametric function fitting. If this family rather has a data-dependent number of parameters, then in this case we will get a non-parametric function fitting. But what are the criteria by which we have to select such a function? For example, when we used a mean squared error loss function for regression, why this function? Why not to use, for example, a sum of logs of square differences between the model and predictions as a loss function? What drives the choice over a particular loss function? It turns out that, at least within probabilistic models, there are some principaled ways to come up with appropriate loss functions for supervised learning. So let's talk next about the probabilistic models. Generally, all supervised learning algorithms belong in one of three major classes. The first class is called probabilistic generative models. These model is directly define a model-based approximation to the true data generating distribution p of x and y. In other words, the output for these models is a joint probability distribution of both inputs x and outputs y, given either by a closed form expression or via a computer algorithm. Examples of such models include Bayesian models, Gaussian mixtures and certain types of neural networks, among other models. This will be discussed in more depths in our next course of our specialization. But for now, the point you have to take away is that because the output of such models is a joint probability distribution of x and y, we can simulate from such distributions. The second type of supervised learning algorithms is another class of probabilistic models called discriminative probabilistic models. They also produce a probability distribution, but here it's a conditional distribution of the output y given x. One example of such models is given by regression that we have discussed above without explicitly evoking any probabilistic assumptions. But as I will show you shortly, it turns out that estimation of regression using a mean square loss is computationally equivalent to estimation of Gaussian generative probabilistic model given by this expression. Your N is the probability density function of a normal distribution. Note that unlike generative models you cannot simulate from a discriminative model. The most you can do if you want to use this model with simulations is to simulate xs from some distribution then simulate ys conditional of these xs using the model definition. And finally there exist machine learning algorithms that are non-probabilistic in nature, in particular support vector machines and some types of neural nets models belong in this class. We will be talking about such algorithms in follow up courses of this specialization but here I want to talk about the first two types of algorithms which are both based on probabilistic frameworks. Within probabilistic models a general way to build a model that fits some data D can be obtained by invoking the Bayes' rule. Here is how it goes, if we are given a model M given by some function f(x,theta), then the model estimation can be formulated as the question of what are the most probable values of parameter theta given the model and data D? This is equivalent to asking about the value of a vector theta that maximizes the probability of this value of theta given model M and data D. But how to work with such expressions? What's the meaning of it? Let's use Bayes' rule to write this probability as a product of a prior probability p of theta conditional on M, times the probability of data D conditional on M and theta divided by a total probability of same data D given model M. We can also write the denominator in this expression more explicitly like this. The new expression for the posterior probability of theta given a data D and model M shows that the denominator here is just the integral of the numerator over all possible values of theta. So it serves a normalization factor which is independent of theta and is needed only to ensure the right normalization of our posterior probability so that it integrates to one where integrated overall possible values of theta. Our final expression is also worth saying in words. In this form it states that the posterior probability is given by the prior probability times the likelihood of theta divided by the evidence which is a common name for the denominator in this formula. Let's briefly go over the meaning of each term in this expression. First, as its name suggests, the prior distribution p of theta, M reflects our prior beliefs about what parameters theta should be that we formed before we saw the data. The likelihood gives the probability of observing data D given model M and parameters theta. Therefore, it provides a quality of match between your model and data. The better your model, M, or the value of theta for a given M the more likely the data actually observed would be observable under the model's assumptions. Let's pause here for a moment to see how well you understand the Bayes rule. So to summarize, the Bayes formula is as simple as its deep. In the next video, we will see some specific cases that can be derived based on this formulation.