0:00

Entropy was a fancy mathematical way to talk about uncertainty,

Â to quantify the uncertainty associated with the probability distribution.

Â Before we move into Information Theory, I just want to say that there are two ways

Â you can think about a distribution with high entropy.

Â So let's say the distribution over some message, or some whatever, x is large.

Â First of all, you can think of this as a bad thing.

Â It's bad because it is very uncertain, it has a high entropy.

Â 0:51

Well, a distribution with a high entropy has more possible states.

Â So, if you are to sample, or

Â to see one of those states, you can learn a lot more about the system.

Â For example, what's more informative, a message that can be only one or

Â zero, or a message that can be an entire string of words?

Â 1:15

Certainly the entire string of words is more informative.

Â However, because of how we defined entropy,

Â the distribution over that string of words, or the possible states that string

Â of words could be in has a very high entropy, compared to the one zero message.

Â However, that means that it is more capable of being informative.

Â 1:41

And this is what we're interested when we study information theory.

Â We're interested in using entropy to talk about how informative a message can be.

Â So in general, information theory can be thought of as

Â the study of how the entropy or

Â the uncertainty of a distribution changes when you receive a message.

Â So here is an example.

Â Let's say that one of your friends and

Â you are in the shower, so you cannot come down to answer it.

Â They knock on the door, you can't answer, so they leave a note.

Â Let's say you have three friends,

Â like me, which we'll denote by the random variable F.

Â So you have Billy, you have Carol and you have Francois.

Â Those are your three friends.

Â Before you read the note they left on your door, there is an equal probability that

Â it could have been anyone of your friends and that probability is one- third.

Â 2:49

So, this is the probability distribution over F before you receive the note.

Â But, then you go downstairs and you read the note, and you know that Billy and

Â Carol only speak English and Fransua only speaks French.

Â But when you read the note, you see that the note is in English.

Â This changes the distribution

Â of who that friend could have been that was knocking on your door.

Â Since the note was in English, you know it could not have been Francois.

Â Therefore, the probability of Billy and

Â the probability of Carol both go up to 0.5, and

Â the probability of Francois goes down to 0.

Â So this is the distribution over F,

Â given that the note was in English.

Â So let's calculate the entropy of these distributions.

Â Well, the entropy of the first one, before you read the note,

Â is H(F), and that's log 2 of the number of states, 3, which is 1.6.

Â And the entropy of the second distribution,

Â the conditional entropy once you've realized that the note was log of.

Â Well, now they're effectively just two states,

Â two possible options, so log of 2 is 1.

Â So the entropy decreases once you receive a message, which fits nicely with our idea

Â that receiving a message decreases your uncertainty about the situation.

Â What would the entropy have been if you had read that the note was in French?

Â In that case, you would have known that the only possible person

Â your friend could have been was Francois.

Â So there would have been effectively one possible state and

Â the entropy would have gone to zero.

Â So, in general when you receive a message, your entropy decreases.

Â Lets look at a neuroscience example, so example two.

Â In this case, let's say you're wondering how cold it is outside.

Â Before you go outside,

Â you know that there is some probability about what the temperature is.

Â Maybe it's in the spring time so the mean of that temperature is 60 degrees Celsius.

Â 5:18

This is before you go outside.

Â But then you go outside and your sensory neurons send you a message.

Â So you go outside and your sensory neuron sends you a message,

Â R, which could be a spike train, for example or a group of spike trains.

Â Once you receive that message,

Â your certainty about what the temperature is increases.

Â So maybe they send you the message that it's actually pretty hot out.

Â You still don't know exactly what the temperature is, but now you're certain

Â that it's centered around 80 degrees maybe, it's a hot day in the spring.

Â This is the distribution over temperatures given despite train R, the message R.

Â Now, I won't calculate the entropy exactly, but hopefully you can see

Â that since P of T given R is a lot narrower than P of T,

Â the entropy of the prior temperature distribution is much greater

Â than the entropy of the conditional temperature distribution.

Â So in this case, you gain information about the temperature from the message.

Â And so we call this conditional entropy the entropy

Â of the temperature given the the neural response, the noise entropy.

Â because whatever uncertainty is associated with the temperature after you've gotten

Â all the information you could from your neurons, will still be not perfect.

Â There will still be some noise, so we call it the noise entropy.

Â So it seems like this intuitive idea of information is very correlated

Â with the change in uncertainty that results from you receiving a message.

Â 7:02

How would we write the change in uncertainty after you receive a message,

Â the change in entropy?

Â So let's say that we'll stay in neuroscience land for now.

Â S is your stimulus, and R is your response.

Â And the stimulus can be a scale or a vector, and the response can be a firing

Â rate or a spike train, or a whole pattern of spikes, or whatever you like.

Â But S is the stimulus and R is the response.

Â And so at the start, you have some distribution,

Â the probability that your random variable is equal to a specific value, S.

Â And after you get the response, you have a conditional distribution.

Â So this is the same thing, but conditioned on the fact that your response was

Â equal to little r and each one of these has an entropy.

Â Remember the entropy takes as input a distribution and

Â produces as output just a number.

Â So there's an entropy of the original stimulus distribution, H(S).

Â And there's an entropy of the conditional stimulus distribution,

Â H(S) given the fact that you measured response little r.

Â And the amount that the entropy decreases is simply H(S) minus H(S) given R.

Â However, as we saw with our Francious and Billy and Carol case,

Â different messages could have yielded different conditional distributions,

Â and therefore, different conditional entropies, different noise entropies.

Â So, in order to get a very general quantity that talks about

Â the entire distribution of S and R, rather than the distribution of S given

Â a single R, we are just going to take the average decrease in entropy.

Â Where the average is taken over all of the things that the message,

Â the response, could have been.

Â So we write that the information between S and

Â R is equal to the original entropy minus the average,

Â the expected value, of the conditional entropy, of the noise entropy.

Â And that average is taken with respect to all of the possible things,

Â the response, the message could have been.

Â And this value is called the mutual information.

Â And the information about S from R,

Â takes as input the joint distribution over S and R and outputs just a number.

Â 9:36

And that number tells you something about how much you can learn about S

Â if you listen to the message R.

Â And so an important thing to realize is that the information, or

Â when you calculate the information you don't need a model

Â like a linear filter system, or a GLM, or anything like that.

Â When you calculate the information, you just say, how much

Â could the response tell you about the stimulus regardless

Â of the mechanism in which the response is actually encoding the stimulus.

Â So, the mutual information doesn't depend on a coding model.

Â In other words, it's just a good way to characterize the system

Â if we know the distribution over response and stimulus.

Â Now, maybe you've noticed, but

Â there's a very nice symmetry in the mutual information.

Â So when we defined it as, what did we say, information about the stimulus

Â given the response was equal to the entropy of the stimulus distribution

Â minus the expected value of the entropy of the conditional distribution

Â where just recall that the expected value of that function

Â 11:44

We also have the probability of R given s and

Â we have a joint distribution, so it's very symmetrical.

Â When we say the information about the stimulus, given the response,

Â which is what we've been talking about so far, we're talking about how much

Â this uncertainty, or the spread of the stimulus distribution, shrinks when

Â we receive a message of information the response carries about the stimulus,

Â averaged over all of the different responses you could have gotten.

Â But you can also think the other way around.

Â Before you even see a stimulus, there is some distribution of responses,

Â some prior distribution of responses.

Â 12:32

When the stimulus comes, that distribution changes.

Â So maybe before you go outside and feel the temperature,

Â there are 10,000 different messages that your sensory neurons could be sending you.

Â But once you go out and feel the temperature, and say it's 81 degrees,

Â 13:10

If you just knew the response, you can take a guess about the stimulus.

Â However, if you just know the stimulus,

Â you can also take a guess about the response.

Â This will happen if the two are correlated,

Â which hopefully they are if your neurons are sending you important information.

Â So how would we write out the information of the response given the stimulus.

Â Well, same thing we had before, but the variables are flipped.

Â So we have the entropy of the response distribution

Â minus the expected conditional entropy of the response

Â distribution, given that a stimulus was presented.

Â And we average over all the different stimuli that could be presented.

Â So these are very, very related quantities and they are so

Â related in fact that they are the same.

Â The mutual information about the stimulus from the response is equal

Â to the mutual information about the response from the stimulus.

Â So this is very, very cool thing.

Â This means that H(S) minus the average

Â value of H(S) given a certain response is

Â equal to H of R minus the expected value of

Â the entropy of the response given a certain stimulus.

Â 14:42

So, there's a very beautiful symmetry in how we calculate the information.

Â And all that this tells us is that our definitions of response and

Â stimulus are rather arbitrary.

Â You can consider a spike train to be the stimulus, and

Â whatever caused that spike train to be the response.

Â It's the same mathematically.

Â Importantly, however, sometimes,

Â one of these quantities is easier to calculate than the other.

Â So sometimes it's easier to calculate these guys, and

Â sometimes it's easier to calculate those guys.

Â 15:37

And second, we calculate the entropy of the response for

Â a certain stimulus and we do that for a bunch of stimuli and

Â then take the average over those stimuli and their probabilities.

Â So this turns out to be a very useful calculation that tells us

Â how much we can learn about one variable given the choice of the other variable.

Â It tells us how correlated two random variables are, how

Â correlated the temperature outside is to a pattern of spikes in your sensory neurons.

Â How correlated the motion of a rat's whisker is

Â to the signal that gets sent to its barrel cortex.

Â In general, it's a very useful quantity to be able to calculate and understand,

Â because it gives us a model free way of talking about

Â how much two probabilistic quantities convey about one another.

Â