0:00

[MUSIC].

Â Hello and welcome back to week four of computational neuroscience.

Â This week we will be talking about information theory.

Â We'll be exploring information theory as a way to evaluate the coding properties

Â of a neural system. So going back to thinking about spiking

Â output as binary strings, of 0s and 1s. How good a code do these spike trains

Â represent? We'll explore using information theory

Â and related ideas as a way to understand how the coding properties, of our nervous

Â system might be specially structured to accommodate the complex structure of the

Â natural environment. So today we'll be addressing three

Â things, we're going to start by talking about entropy and information, defining

Â our terms. Then we're going to talk about how to

Â compute information in neural spike trains, and then finally we will explore

Â how information can tell us about coding. So, let's go back to our well worn

Â paradigm, a monkey choosing right from left.

Â And suppose we're watching the output of a neuron while different stimuli

Â appearing on the a screen. Here's an example, spike train, is the

Â time sequence in which we're marking spikes, in a given time bin, with a one,

Â and a silence with nothing. Now, here's another example.

Â And another. So, hopefully, when these oddball symbols

Â appeared, either stimulus or spike, you felt a tiny bit of surprise.

Â So, information quantifies that degree of surprise.

Â 1:35

Let's say there was some overall probably p, that there's a spike in sometime bin.

Â And 1 minus P but there's silence, then the surprise for seeing a spike is

Â defined as minus log 2, so log based 2, of that probably, that's the information

Â that we get from seeing a spike. And the information that we get from

Â seeing silence is minus log 2 of 1 minus p, of the probability of seeing the

Â silence. So why does the information have this

Â form? Like my husband and I some of you

Â probably play squash. And if you do you'll know that what

Â you're trying to do is put the ball somewhere that will surprise your

Â partner. If you're a remarkable player you can put

Â the ball anywhere in the court. If your partner has one bit of

Â information, he knows which half of the court the ball is in.

Â There was an equal probability of being in either, but once he gets that bit he

Â knows which half. Each additional bit of information cuts

Â the possibilities down, by an additional factor of two.

Â So, what we're really doing, is multiplying the probability, the

Â probability of being in this half, is p equal one half.

Â 2:45

The probability of being in the front half of the court, is an additional one

Â half. Taking the negative log base 2 turns this

Â into 1 plus 1 two bits to specify being in the front left corner.

Â So now that we have a sense of that information we can understand entropy.

Â Entropy is simply the average information of a random variable.

Â So entropy measures variability. I'll warn you right now that in the

Â future, I'll usually drop this, this base 2 on the log, and just assume it.

Â Entropies are always computed in log base 2 and their units are in bits.

Â An intuitive way to think about this, is that the entropy counts the number of yes

Â no questions, as we saw in the case of the squash game, that it takes to specify

Â a variable. So here's another example, let's say I

Â drive down from Seattle to Malibu, and park in the valet parking.

Â When I come back to get my car, the car park attendant is not very helpful and

Â won't tell me where my car is. He'll only grunt for yes answers.

Â So the car could be in any of these, say, eight spots.

Â How many questions will it take before I can find it?

Â So, let's say, is it on the left? Grunt.

Â Is it on the top? Grunt.

Â Is it on the top left? Grunt.

Â So, what's the entropy of this distribution?

Â Let's, let's calculate it. So, remember we defined the entropy.

Â I'll call H. As the sum over the probabilities times

Â log of the probability minus. So what is pi?

Â In this case, the probability of being in any one of these locations is 1 8th.

Â And that's the same for every location in this.

Â In this, car park. And so now H is equal to minus 1/8, sum

Â from i equals 1 to 8, log base 2 of 1/8. Now what is that?

Â Remember that 8 equals 2 to the power of 3, so the log base 2 of 8 is 3.

Â So here we have, now, sum of 1 8th times minus 3.

Â Now we add that up over the eight possibilities and we get 3.

Â 5:07

So as we saw, it took three questions to specify our car, and that's exactly the

Â entropy of this distribution. So now let's go back to our coding

Â sequences, here's a few different examples, so which of these do you think

Â has the most intrinsic capability of encoding?

Â Encoding relies on the ability to generate stimulus driven variations in

Â the output. If an output has no variation, such as in

Â this case. We're not very optimistic about its

Â ability to encode inputs. So these three sequences differ in their

Â variability. Which do you think has the most inherent

Â coding capacity? So we can use the entropy to quantify

Â that variability. So what does having a large entropy do

Â for a code? It gives the most possibility for

Â representing inputs. The more intrinsic variability there is,

Â the more capacity that code has for representation.

Â So in this simple case, we can compute the entropy as a function of the

Â probability p. Where, again, the other, the other

Â possibility has probability 1 minus p. So entropy, again, is going to be given

Â by a minus p log p minus 1 minus p, log 1 minus p.

Â So now when one puts that function as a function of P of r plus, which here would

Â call p, we find that there is a maximum. So, what's the value of P at which H has

Â a maximum. That's the value at which.

Â P equals one half. In that case, in this distribution, these

Â two symbols are used equally often. So, this is a concept we'll come back to

Â at the end of this lecture. Let's go back to squash.

Â So, we had a possibility of the ball being anywhere in the field.

Â Generally, you're not able to put the ball anywhere with equal probability.

Â It's exactly this reduction in possibility that makes it even possible

Â to play. You could model your opponent's

Â probability of x, the probability of placing the ball somewhere in the court,

Â and you can, to some extent, predict where the ball is.

Â The lower the entropy of your partner's p of x, the more easily you'll defeat him.

Â So, let's come back finally to our spike code.

Â We now appreciate that the entropy tells us of the intrinsic variability of our

Â outputs, by obviously we really need to consider the stimulus, and how it's

Â driving those responses. So here's an example.

Â The stimulus can take one of two directions, and each is perfectly encoded

Â by either a spike or no spike. So here's the stimulus, here's the, the

Â spiking response. Every time there's a rightward stimulus,

Â we get a spike. So how about this case?

Â We'd probably still be comfortable to say that the response is encoding the

Â stimulus. These two are perfectly correlated.

Â On the other hand, there are, there are several other events that, that are

Â misfires. So in this case, the stimulus occurred

Â with no spike, in this case there was a spike with no stimulus.

Â 8:11

But how about this? At least at a glance, there seems to be

Â little or no relationship between the responses and the stimulus.

Â So just as a side bar, what if the problem were not so much that our code

Â were noisy, but that we haven't exactly understood what the code is doing.

Â That is, maybe there's some temporal sequencing S that should be more

Â appropriately thought about as the true stimulus.

Â This is really the question that we were addressing in week two.

Â How do we know what our stimulus was? But let's go back to the main question.

Â What we really wanted to know is; how much of the variability that we see here

Â in R is actually used for encoding S? We need to incorporate the possibility

Â for error. So, let's do that by assuming now that

Â was, when a spike is generally produced in response to stimulus plus.

Â So here, there's also some possibility that there will be no spike, we'll

Â quantify that using the error probability q.

Â So probability of, of correct response in this case is 1 minus q, and the

Â probability of a incorrect response is q. And let's assume the same error in this

Â case for a silence response. So, now we would like to know, how much

Â of the entropy of our responses is accounted for by noise, by these errors.

Â Because that's going to reduce the responses capacity to encode S.

Â 9:33

The way we can address that is to compute how much of the response entropy can be

Â assigned to the noise. That is if we can give a stimulus plus, a

Â plus stimulus, a right way stimulus and get a variety of responses.

Â Those conditional responses for a fixed S have some entropy of their own.

Â Similarly when we give stimulus minus. So we call these stimulus driven

Â entropies, the noise entropy. So this brings us to the definition of

Â the mutual information, the amount of information that the response carries

Â about the stimulus. This is given by the total entropy minus

Â the average noise entropy. That is, the amount of entropy that the

Â responses r have of some fixed s, averaged over s, and that's drawn out

Â here. So, here's the total entropy of the

Â responses, and here's the conditional entropy.

Â So the, the entropy of the responses conditioned on a particular stimulus s,

Â averaged over s. So now let's go back to our binomial

Â calculations, and see how the mutual information depends on the noise.

Â Now fixing p. We're going to take p to be the one that

Â maximizes the entropy, so p equals one half.

Â Let's vary the noise probability, and again assume that the noise is the same

Â for spike and silence. That is, there is one value q.

Â 10:58

So this should be intuitive. When there's no noise entropy the

Â information is just the entropy of the response, which in this case is one bit.

Â As the error rate increases, as the error probability grows larger and larger.

Â Spiking is less and less likely to actually represent the stimulus S, and

Â the mutual information decreases. When the error probability reaches a

Â half, that is, responses occur at chance, there's no mutual information between R

Â and S. So, let's just check that everyone's

Â still on board. More generally, what are the limits?

Â So, if the response is unrelated to the stimulus, what is the probability of

Â argument S? Its simply, the probability of the

Â response. Because there's no relationship between

Â response and stimulus. So the noise entropy is equal to the

Â total entropy, and then the difference of response in noise entropy is zero.

Â At the opposite extreme, the response is perfectly predicted by the stimulus.

Â So in this case the noise entropy is zero.

Â So the mutual information will be given by the total entropy of the response.

Â All of the response's coding capacity is used in encoding the stimulus.

Â So let's just see how that works for continuous variables.

Â We've talked a lot about, about binary choices.

Â Let's think more generally about cases where we have some continuous r, and,

Â some response variability for the encoding of a stimulus s by r.

Â So here's an example where we've given several different stimuli.

Â 12:31

Each of these distributions is the probability of the response given a

Â particular trice of the stimulus. And now that's going to be weighted by

Â the probability of that stimulus. And when we add all of these conditional

Â distributions together, we get the full probability, P of r.

Â Now, what we're doing by computing the entropy is we're going to compute the

Â entropy of this blue distribution. That's going to be the, that's going to

Â give us the total entropy. And then we're going to compute the

Â entropy of these conditional distributions.

Â And now we're going to average them over the stimulus that drove them.

Â So for these two cases, they differ by the amount of intrinsic noise that each

Â response has. So we give a stimulus s in this case,

Â there's some range of variability that takes out some of my range of r.

Â 13:23

In this case, when we give that same stimulus, now the degree of noise

Â stretches over a much wider range of the response distribution.

Â So much more of the variability in R is accounted for by variability in responses

Â to specific stimuli. And so, I hope you can see that this kind

Â of response, set of response distributions, is going to encode much

Â more information about S, that the information about S and R is much larger

Â in this case. Then it is for this case.

Â Let's play a little bit with these distributions, because I want to

Â demonstrate a couple of things that I think really illustrate why information

Â is useful and an intuitive measure of the relationship between two variables.

Â I'm using capital letters to denote the random variable, and lower case letters

Â to denote a specific sample from that random variable.

Â So, what I'd like to show you is that the information quantifies how far from

Â independent these two random variables R and S are.

Â To demonstrate that, I'm going to use the KullbackÃ¢Â€Â“Leibler divergence; a measure of

Â similarity between probability distributions that we introduced earlier.

Â It's the mutual information, measures independence, then we'd like to quantify

Â the difference between the joint distribution, of R and S, and the

Â distribution, these two variables would have if they were independent.

Â That is, that that, that joint distribution would simply just be the

Â product of their marginal distributions. So, first, to refresh your memory, D KL,

Â let's redefine it. D KL, between two different probability

Â distributions, say P and Q, is equal to an integral.

Â Over probability of x times the log of P of x over Q of x.

Â So now let's apply that to these two distributions.

Â So, let's compute that, we have a integral over ds and over dr.

Â Joint distribution, times the log the joint distribution divided by the

Â marginal distributions. Now we can rewrite that, using the

Â conditional distribution. In the following form, we can rewrite

Â that as the probability of r, given s times the probability of s, that's just

Â equivalent to the joint distribution, divided by P of r, P of s.

Â And now, you can see that P of s cancels out, and we can rewrite this as, now the

Â difference of those two distributions. So we'll just expand that log.

Â All right. Now let's concentrate on this term.

Â Going to be equal to the negative ds dr, probability of s and r, times the log of

Â P of r, plus integral ds dr, P. Now, let's break that up into P of s, P

Â of r given s. Just dividing up the, the joint

Â distribution again into its conditional and marginal.

Â Times the log of P of r given s. Now, let's look at the terms that we've

Â developed here. We can see that we can just integrate

Â over ds. We can integrate the s part out of this

Â joint distribution. And this part is just simply going to be

Â the entropy of P of r. Whereas this one is going to be the

Â entropy of P of r given s, averaged over s, ds P of s.

Â And so what I've shown you is that this form, in terms of the KullbackÃ¢Â€Â“Leibler

Â divergence, gives us back the form that we've already seen.

Â The entropy of the responses minus the average, minus the average over, over the

Â stimuli. Of the noise entropy, for a given

Â stimulus. What I hope you realize is that

Â everything we've done here in terms of response and stimulus we could simply

Â flip, response and stimulus, redo the same calculation, and instead end up with

Â entropy of the stimulus minus an average over the responses, of the entropy, of

Â the stimulus given the response. So information is completely symmetric,

Â in the two variables, being computed between.

Â Mutual information between response to stimulus is the same as mutual

Â information, between stimulus and response.

Â 18:12

So here's our grandma's famous mutual information recipe.

Â What we're going to do to compute this mutual information, is to take a

Â stimulus, s, repeat it many times, and that will give us the probability

Â responses given s. We're going to compute the variability

Â due to the noise. That is, we'll compute the noise entropy,

Â of, of these responses. So, for a given value of s, we'll compute

Â its, the entropy of the responses for that s, we'll repeat this for all s.

Â And then, average over s. Finally, we'll compute the, probability

Â of the responses, that'll just be given by the average over all the stimuli that

Â we presented, times the probability of the response given the stimulus, and that

Â will give us the total entropy of the responses.

Â So, in the next section, we'll be applying that idea to calculating

Â information in spike trains. There'll two methods that, that we work

Â with. One will be starting by calculating

Â information in spike patterns. And then we'll be calculating information

Â in single spikes.

Â