[MUSIC]. Hello and welcome back to week four of computational neuroscience. This week we will be talking about information theory. We'll be exploring information theory as a way to evaluate the coding properties of a neural system. So going back to thinking about spiking output as binary strings, of 0s and 1s. How good a code do these spike trains represent? We'll explore using information theory and related ideas as a way to understand how the coding properties, of our nervous system might be specially structured to accommodate the complex structure of the natural environment. So today we'll be addressing three things, we're going to start by talking about entropy and information, defining our terms. Then we're going to talk about how to compute information in neural spike trains, and then finally we will explore how information can tell us about coding. So, let's go back to our well worn paradigm, a monkey choosing right from left. And suppose we're watching the output of a neuron while different stimuli appearing on the a screen. Here's an example, spike train, is the time sequence in which we're marking spikes, in a given time bin, with a one, and a silence with nothing. Now, here's another example. And another. So, hopefully, when these oddball symbols appeared, either stimulus or spike, you felt a tiny bit of surprise. So, information quantifies that degree of surprise. Let's say there was some overall probably p, that there's a spike in sometime bin. And 1 minus P but there's silence, then the surprise for seeing a spike is defined as minus log 2, so log based 2, of that probably, that's the information that we get from seeing a spike. And the information that we get from seeing silence is minus log 2 of 1 minus p, of the probability of seeing the silence. So why does the information have this form? Like my husband and I some of you probably play squash. And if you do you'll know that what you're trying to do is put the ball somewhere that will surprise your partner. If you're a remarkable player you can put the ball anywhere in the court. If your partner has one bit of information, he knows which half of the court the ball is in. There was an equal probability of being in either, but once he gets that bit he knows which half. Each additional bit of information cuts the possibilities down, by an additional factor of two. So, what we're really doing, is multiplying the probability, the probability of being in this half, is p equal one half. The probability of being in the front half of the court, is an additional one half. Taking the negative log base 2 turns this into 1 plus 1 two bits to specify being in the front left corner. So now that we have a sense of that information we can understand entropy. Entropy is simply the average information of a random variable. So entropy measures variability. I'll warn you right now that in the future, I'll usually drop this, this base 2 on the log, and just assume it. Entropies are always computed in log base 2 and their units are in bits. An intuitive way to think about this, is that the entropy counts the number of yes no questions, as we saw in the case of the squash game, that it takes to specify a variable. So here's another example, let's say I drive down from Seattle to Malibu, and park in the valet parking. When I come back to get my car, the car park attendant is not very helpful and won't tell me where my car is. He'll only grunt for yes answers. So the car could be in any of these, say, eight spots. How many questions will it take before I can find it? So, let's say, is it on the left? Grunt. Is it on the top? Grunt. Is it on the top left? Grunt. So, what's the entropy of this distribution? Let's, let's calculate it. So, remember we defined the entropy. I'll call H. As the sum over the probabilities times log of the probability minus. So what is pi? In this case, the probability of being in any one of these locations is 1 8th. And that's the same for every location in this. In this, car park. And so now H is equal to minus 1/8, sum from i equals 1 to 8, log base 2 of 1/8. Now what is that? Remember that 8 equals 2 to the power of 3, so the log base 2 of 8 is 3. So here we have, now, sum of 1 8th times minus 3. Now we add that up over the eight possibilities and we get 3. So as we saw, it took three questions to specify our car, and that's exactly the entropy of this distribution. So now let's go back to our coding sequences, here's a few different examples, so which of these do you think has the most intrinsic capability of encoding? Encoding relies on the ability to generate stimulus driven variations in the output. If an output has no variation, such as in this case. We're not very optimistic about its ability to encode inputs. So these three sequences differ in their variability. Which do you think has the most inherent coding capacity? So we can use the entropy to quantify that variability. So what does having a large entropy do for a code? It gives the most possibility for representing inputs. The more intrinsic variability there is, the more capacity that code has for representation. So in this simple case, we can compute the entropy as a function of the probability p. Where, again, the other, the other possibility has probability 1 minus p. So entropy, again, is going to be given by a minus p log p minus 1 minus p, log 1 minus p. So now when one puts that function as a function of P of r plus, which here would call p, we find that there is a maximum. So, what's the value of P at which H has a maximum. That's the value at which. P equals one half. In that case, in this distribution, these two symbols are used equally often. So, this is a concept we'll come back to at the end of this lecture. Let's go back to squash. So, we had a possibility of the ball being anywhere in the field. Generally, you're not able to put the ball anywhere with equal probability. It's exactly this reduction in possibility that makes it even possible to play. You could model your opponent's probability of x, the probability of placing the ball somewhere in the court, and you can, to some extent, predict where the ball is. The lower the entropy of your partner's p of x, the more easily you'll defeat him. So, let's come back finally to our spike code. We now appreciate that the entropy tells us of the intrinsic variability of our outputs, by obviously we really need to consider the stimulus, and how it's driving those responses. So here's an example. The stimulus can take one of two directions, and each is perfectly encoded by either a spike or no spike. So here's the stimulus, here's the, the spiking response. Every time there's a rightward stimulus, we get a spike. So how about this case? We'd probably still be comfortable to say that the response is encoding the stimulus. These two are perfectly correlated. On the other hand, there are, there are several other events that, that are misfires. So in this case, the stimulus occurred with no spike, in this case there was a spike with no stimulus. But how about this? At least at a glance, there seems to be little or no relationship between the responses and the stimulus. So just as a side bar, what if the problem were not so much that our code were noisy, but that we haven't exactly understood what the code is doing. That is, maybe there's some temporal sequencing S that should be more appropriately thought about as the true stimulus. This is really the question that we were addressing in week two. How do we know what our stimulus was? But let's go back to the main question. What we really wanted to know is; how much of the variability that we see here in R is actually used for encoding S? We need to incorporate the possibility for error. So, let's do that by assuming now that was, when a spike is generally produced in response to stimulus plus. So here, there's also some possibility that there will be no spike, we'll quantify that using the error probability q. So probability of, of correct response in this case is 1 minus q, and the probability of a incorrect response is q. And let's assume the same error in this case for a silence response. So, now we would like to know, how much of the entropy of our responses is accounted for by noise, by these errors. Because that's going to reduce the responses capacity to encode S. The way we can address that is to compute how much of the response entropy can be assigned to the noise. That is if we can give a stimulus plus, a plus stimulus, a right way stimulus and get a variety of responses. Those conditional responses for a fixed S have some entropy of their own. Similarly when we give stimulus minus. So we call these stimulus driven entropies, the noise entropy. So this brings us to the definition of the mutual information, the amount of information that the response carries about the stimulus. This is given by the total entropy minus the average noise entropy. That is, the amount of entropy that the responses r have of some fixed s, averaged over s, and that's drawn out here. So, here's the total entropy of the responses, and here's the conditional entropy. So the, the entropy of the responses conditioned on a particular stimulus s, averaged over s. So now let's go back to our binomial calculations, and see how the mutual information depends on the noise. Now fixing p. We're going to take p to be the one that maximizes the entropy, so p equals one half. Let's vary the noise probability, and again assume that the noise is the same for spike and silence. That is, there is one value q. So this should be intuitive. When there's no noise entropy the information is just the entropy of the response, which in this case is one bit. As the error rate increases, as the error probability grows larger and larger. Spiking is less and less likely to actually represent the stimulus S, and the mutual information decreases. When the error probability reaches a half, that is, responses occur at chance, there's no mutual information between R and S. So, let's just check that everyone's still on board. More generally, what are the limits? So, if the response is unrelated to the stimulus, what is the probability of argument S? Its simply, the probability of the response. Because there's no relationship between response and stimulus. So the noise entropy is equal to the total entropy, and then the difference of response in noise entropy is zero. At the opposite extreme, the response is perfectly predicted by the stimulus. So in this case the noise entropy is zero. So the mutual information will be given by the total entropy of the response. All of the response's coding capacity is used in encoding the stimulus. So let's just see how that works for continuous variables. We've talked a lot about, about binary choices. Let's think more generally about cases where we have some continuous r, and, some response variability for the encoding of a stimulus s by r. So here's an example where we've given several different stimuli. Each of these distributions is the probability of the response given a particular trice of the stimulus. And now that's going to be weighted by the probability of that stimulus. And when we add all of these conditional distributions together, we get the full probability, P of r. Now, what we're doing by computing the entropy is we're going to compute the entropy of this blue distribution. That's going to be the, that's going to give us the total entropy. And then we're going to compute the entropy of these conditional distributions. And now we're going to average them over the stimulus that drove them. So for these two cases, they differ by the amount of intrinsic noise that each response has. So we give a stimulus s in this case, there's some range of variability that takes out some of my range of r. In this case, when we give that same stimulus, now the degree of noise stretches over a much wider range of the response distribution. So much more of the variability in R is accounted for by variability in responses to specific stimuli. And so, I hope you can see that this kind of response, set of response distributions, is going to encode much more information about S, that the information about S and R is much larger in this case. Then it is for this case. Let's play a little bit with these distributions, because I want to demonstrate a couple of things that I think really illustrate why information is useful and an intuitive measure of the relationship between two variables. I'm using capital letters to denote the random variable, and lower case letters to denote a specific sample from that random variable. So, what I'd like to show you is that the information quantifies how far from independent these two random variables R and S are. To demonstrate that, I'm going to use the Kullback–Leibler divergence; a measure of similarity between probability distributions that we introduced earlier. It's the mutual information, measures independence, then we'd like to quantify the difference between the joint distribution, of R and S, and the distribution, these two variables would have if they were independent. That is, that that, that joint distribution would simply just be the product of their marginal distributions. So, first, to refresh your memory, D KL, let's redefine it. D KL, between two different probability distributions, say P and Q, is equal to an integral. Over probability of x times the log of P of x over Q of x. So now let's apply that to these two distributions. So, let's compute that, we have a integral over ds and over dr. Joint distribution, times the log the joint distribution divided by the marginal distributions. Now we can rewrite that, using the conditional distribution. In the following form, we can rewrite that as the probability of r, given s times the probability of s, that's just equivalent to the joint distribution, divided by P of r, P of s. And now, you can see that P of s cancels out, and we can rewrite this as, now the difference of those two distributions. So we'll just expand that log. All right. Now let's concentrate on this term. Going to be equal to the negative ds dr, probability of s and r, times the log of P of r, plus integral ds dr, P. Now, let's break that up into P of s, P of r given s. Just dividing up the, the joint distribution again into its conditional and marginal. Times the log of P of r given s. Now, let's look at the terms that we've developed here. We can see that we can just integrate over ds. We can integrate the s part out of this joint distribution. And this part is just simply going to be the entropy of P of r. Whereas this one is going to be the entropy of P of r given s, averaged over s, ds P of s. And so what I've shown you is that this form, in terms of the Kullback–Leibler divergence, gives us back the form that we've already seen. The entropy of the responses minus the average, minus the average over, over the stimuli. Of the noise entropy, for a given stimulus. What I hope you realize is that everything we've done here in terms of response and stimulus we could simply flip, response and stimulus, redo the same calculation, and instead end up with entropy of the stimulus minus an average over the responses, of the entropy, of the stimulus given the response. So information is completely symmetric, in the two variables, being computed between. Mutual information between response to stimulus is the same as mutual information, between stimulus and response. So here's our grandma's famous mutual information recipe. What we're going to do to compute this mutual information, is to take a stimulus, s, repeat it many times, and that will give us the probability responses given s. We're going to compute the variability due to the noise. That is, we'll compute the noise entropy, of, of these responses. So, for a given value of s, we'll compute its, the entropy of the responses for that s, we'll repeat this for all s. And then, average over s. Finally, we'll compute the, probability of the responses, that'll just be given by the average over all the stimuli that we presented, times the probability of the response given the stimulus, and that will give us the total entropy of the responses. So, in the next section, we'll be applying that idea to calculating information in spike trains. There'll two methods that, that we work with. One will be starting by calculating information in spike patterns. And then we'll be calculating information in single spikes.