But we've previously talked about the importance of allowing CPD representations that encode additional structure in the local dependency model of a variable on its parents. And we talked about the cases of tree CPDs. Which allow us to depend on different variables in different contexts. But none of that helps us deal with the situation that we used as a motivation for this. Which is where we have a variable such as, for example, cough. That depends on multiple different factors: pneumonia, flu, tuberculosis, bronchitis, and so on and so forth. This doesn't lend itself say, to a tree CPD. Because it's not the case that you depend on one only in certain contexts, and not and not in others. Really, you depend on all of them. And all of them sort of contribute something to the probability of exhibiting a cough. So one way for capturing that kind of interaction is a model called the noisy OR CPD. And the noisy OR can, is best understood by simply considering a slightly larger graphical model where we have, where we're trying, where we break down the dependencies of Y on its parents, X1 up to XK, by introducing a bunch of intervening variables. So let's imagine that this is, again, like cough variable. And this is different diseases for example. And what we're doing here is we're introducing a intermediate variable that's captures the event at this disease, X1 if present, causes, a cough, by itself, so this is X1 by itself causes a cough, or causes Y. You can think of it each of diseases is a noisy transmitter. If you have the disease, say if X one is true, then Z1 sort of says, fine X one succeeded in its intent to make it Y true. X2 has its own little filter called Z2 and Z2 makes that same decision relative to X2. So ultimately Y is true so Y is true. If, someone succeeded in making it true. Which means that Y is a deterministic or of its parents, Z1 up to ZK, Z0 to ZK. So now let's make this a little bit more precise, so the probability of ZI being true given XI, there's two cases. If XI0 equals zero well Xi isn't even trying to make ZI true, so in this case the probably of ZI1 is zero, okay? If Xi is not true Zi doesn't get turned on ever. If Z, if Xi is true then there's some probability for Zi to get turned on. And that probability is something called Lambda i. And Lambda I is in the interval 01. And you can think of it as defining some kind of penetrance of, how good is Xi at turning Y on? So if we actually write this down as a probability, and we just consider the CPD of Y equals zero so Y doesn't get turned on given some set of cases X1 up to XK. So now we're asking what is the probability that all of these guys fail to turn on the variable Y? So what I forgot to say is that Z0 is what's called a leak probability and that's the probability that Y gets turned on just by itself. And that has, and that happens with probability lambda zero. So, when does y fail to get turned on? First of all, when it doesn't get turned on by the leak. So, that's one minus lambda zero, times the probability that none of the causes that are on. So, none, of, causes, that are on. Which are the x size that are equal to one turn y on. And so that's a product of one minus Lambda I for all of the XI's that are on. No, that gives us the probabililty of Y equals zero, and obviously the probability of Y equals one is just the compliment of that, one minus. sorry this should be yeah, but os that's the noisy or CPD. And you can generalize this to a much broader notion of independence of causal influence. This is called independence of causal influence because it assumes that you have a bunch of causes for a variable and each of them acts independently to affect the truth of that variable. And so, there's no interactions between the different causes. They each have their own separate mechanism and ultimately it's all aggregated together in in a single in a single variable, Z from which the truth of Y is then is then determined from this aggregate effect of all of the. All of the effects, ZI's of the different causes. So, one example of this is, we, we've already seen the noisy orbit. You can. Easily generalizes to a broad range of other cases. There's noisy ands where the aggregation function is an and. There's noisy maxes which apply in the nonbinary case when causes might not just be turned or off but rather they have different sort of extents of being turn on and then z is actually sort of the maximal extent of of, of the, if, the independent effect of each cause, and so on. So there's a lot, a large range of different models all of which fit into this family, meesie order is probably the one that's most commonly used but the other ones have also been used, in other settings as well. One model that might not immediately be seen to fit into this framework but actually does, is a model that corresponds to the sigmoid CPD. So what's a sigmoid CPD? A sigmoid CPD says that each XI induces a continuous variable which represents WI, XI. So imagine if each XI is discrete, then ZI is just a continuous value, WI, which parameterizes this edge, and it tells us, sort of, how much force, XI is going to exert on making Y true. So if WI is zero it tells us that XI exerts no influence whatsoever. If WI is positive, XI is going to make Y more likely to be true and if WI is negative it's going to make Y less likely to be true. All of these influences are aggregated together in this expression for the variables Z which effectively adds up all of these different influences plus an additional bias term. W0. And now we need to turn this ultimately into, the probability of the variable Y, which is the variable that we care about. And in order to do that, what we're going to do is we're going to pass this continuous quantity Z, which is a real number between negative infinity and infinity, through a Sigmoid Function. The Sigmoid Function is defined as follows, and it's a function that some of you have seen before in the context of machine learning, for example. So Sigmoid takes the value, the continuous value Z, exponentiates it, and then divides by one plus that exponent of Z. And. Since E of Z, since E to the power of Z is a positive number, this gives us a number that is always in the interval of 0,1. And if we look at what this function looks like. It looks like this. So, this is the sigmoid function. The X axis here is the value Z. And the Y axis is the sigmoid function. And you can see that as Z gets very negative, the probability goes to zero. As Z gets, close, very high, the probability gets close to one, and then there's this interval in the middle where intermediate values are taken. You can. So this is kind of like a squelching function that that sort of squashes the function on both ends. Let's look at the behavior of the sigmoid CPD as a function of different parameters. So here is a case where all of the X Is have the same parameter W. And so what we see here is the value of this parameter W, and over here is the number of XI's that are true. So let's look at, first this access over here, the more parents that are true, the more parents that are on, the higher. The probability of Y to be true, okay. And this, it holds for any value of W because these are all positive influences.'Kay. So the more parents are true, the more things are pushing Y to take the value true. This axis over here is the axis of the weight and we can see that for low weights, you need an awful lot of X's to get Y to be true but as W increases, Y becomes true with a lot of fewer positive influences. This graph on the right now. Is what we get when we basically just increase the amplitude of the whole system. We multiply both W and W0 by a factor of ten. And what happens is that, that means that the exponent gets pushed up to extreme values much quicker. Z gets dissect, effectively multiplied by a factor of ten. And that means that the transition becomes considerably sharper. That gives us a little bit of an intuition on how the sigmoid function and how the sigmoid cpd behaves. So what are some examples of this kind of a of an application of this. So I showed this network in an earlier part of this course it's the CBCS network and it's used for the it was developed for the here at Stanford Medical School for diagnosis of internal diseases. And so, up here we have things that represent predisposing factors. And there's actually a fairly eclectic range here. So for example one tree disposing factor is intimate contact with small rodents. because that's the contributing factor for the antivirus. and so there's a whole range of predisposing factors. Down here in the middle we have diseases. An down at the bottom, we have symptoms and test results. Now as I previously mentioned there's approximately 500 variables in this network and they take on network about four values. So the total number of entries in a joint distribution over this space would be approximately four to the 500 different parameters, which is clearly an intractable number. If we were to take this network, and just, if you take this distribution represented with the network shown in this diagram we get considerable sparsificaton and the factorized form. As approximately 134 million parameters, which is still much too many that have a human estimate. By using as in this case, they use a noisy max CPD. They brought the number of parameters to about 8,000 total parameters for this network. Which is a much more attractable number to deal with.