Why is the channel capacity C related to the mutual information between X and Y? Consider the block diagram for the communication system and observe that the random variables W, X, Y, and W hat form a Markov chain. Now consider the information diagram for this Markov chain. Since we assume that the encoder is deterministic, the entropy of the X sequence, given W, is equal to zero. This is shown in the information diagram. Likewise, we assume that the decoder is deterministic so that the entropy of W hat, given the sequence Y, is equal to zero. This is shown in the information diagram. Now for reliable communication, W, the message, and W hat, which is the estimate of the message, should be essentially the same. So they are essentially functions of each other. That is, entropy of W hat given W, is equal to zero. This is shown in the information diagram. And entropy of W given W hat is equal to zero. This is shown in the information diagram. Now, we see that the setup of the problem actually imposes a lot of zeros in the information diagram. From the information diagram, we see that the entropy of W is equal to the measure on the atom marked by the red dot. Likewise, the mutual information between the input sequence and the output sequence of the channel is also equal to the measure of the atom marked by the red dot. Therefore we see that H(W) is equal to I(X;Y). That is, the entropy of the message is equal to the mutual information between the input sequence and the output sequence of the channel. This suggests that the channel capacity is obtained by maximizing the mutual information between the input sequence and the output sequence. And we are going to see that this is equivalent to maximizing I(X;Y), the mutual information between the input and the output of a single use of the channel. We now look at the building blocks of the converse proof. First observe that for all i from 1 up to n, the mutual information between X_i and Y_i, by definition, is less than or equal to C, because C is the maximum of I(X;Y) over all input distribution p(x,y). Then summation i equals 1 up to n, I(X_i;Y_i) is less than or equal to n times C. To be established in Lemma 7.16, the mutual information between the input sequence X and the output sequence Y is upper bounded by summation i equals 1 up to n, I(X_i;Y_i). Therefore, one over n times log M, which is the rate of the code, is equal to one over n times log of the size of the message set. Because the message is chosen uniformly, this is equal to one over n times the entropy of W. As we have seen, for reliable communication, this is approximately equal to one over n times the mutual information between the input sequence X and the output sequence Y, which from the above is upper bounded by one over n, times summation i equals 1 up to n, I(X_i;Y_i). And that is upper bounded by C. So, essentially we have the ingredients for proving that, for reliable communication, the rate of the code cannot exceed C, the channel capacity. We now prove Lemma 7.16, which asserts that the mutual information between the X sequence and the Y sequence is upper bounded by summation i equals 1 up to n, the mutual information between X_i and Y_i. First, from the previous proposition we have, the entropy of the Y sequence, given the X sequence is equal to summation i equals one up to n, the entropy of Y_i given X_i. Then the mutual information between the X sequence and the Y sequence, which is equal to the entropy of the Y sequence minus the entropy of the Y sequence, given X sequence, by the independence bound, is less than or equal to summation i equals 1 up to n, the entropy of Y_i, minus summation i equals 1 up to n, the entropy of Y_i given X_i. And this is equal to summation i equals 1 up to n, the mutual information between X_i and Y_i. We now give the formal proof for the converse of the channel coding theorem. The key idea is to apply Fano's inequality to to rigorize the approximation that we have used in the argument. Let R be an achievable rate. That is, for any epsilon greater than zero, there exists for sufficiently large n, an (n,M) code such that 1 over n log M, which is the rate of the code, is greater than R minus epsilon, and lambda_max is less than epsilon. Now, consider log M equals entropy of W, because the message is chosen uniformly from the message set. Now, the entropy of W is equal to entropy of W given W hat plus the mutual information between W and W hat. By the data processing theorem, the mutual information between W and W hat is upper bounded by the mutual information between the X sequence and the Y sequence. By Lemma 7.16, the mutual information between the X sequence and the the Y sequence, is upper bounded by summation i equals 1 up to n, the mutual information between X_i and Y_i, which is at most n times C. By Fano's inequality, the conditional entropy of W given W hat, that is, the first term above, is less than one plus P_e, the average probability of error, that is, the probability that W is not equal to W hat, times log, times log of the size of the message set. And this is equal to one plus P_e times log M. Then, from step 2, we have log M is less than or equal to H(W|W hat) plus nC, where by Fano's inequality, H(W|W hat) is less than one plus P_e times log M. Now P_e, the average probability of error, as we have seen, is less than or equal to lambda_max. And by our assumption of the code, lambda_max is less than epsilon. Now moving the term epsilon time log M to the left hand side, we have 1 minus epsilon times log M is less than one plus nC. And moving 1 minus epsilon to the right hand side, we have log M less than one plus nC divided by one minus
epsilon. Upon dividing by n on both sides, we have one over n times log M less than one over n plus C divided by one minus epsilon. Therefore, following our assumption on the code, we have R minus epsilon less than one over n times log M. And one over n times log M is less than one over n plus C
divided by one minus epsilon. So we obtain R minus epsilon
is less than one over n plus C divided by one minus epsilon.
To complete the proof, we let n goes to infinity so that one over n goes to zero, and then let epsilon goes to zero, to conclude that R is less
than or equal to C. We now discuss a corollary that gives a lower bound on P_e, the average probability of error, when n is large. Consider the highlighted inequality we obtained in step 4 in the proof the converse coding theorem. From this inequality, we obtain P_e greater than or equal to one minus one plus nC divided by log M. Multiplying by one over n upstairs and downstairs in the fraction, we obtain one minus one over n plus C divided by one over n times log M. When n is large, the one over n upstairs is approximately equal to zero. So this can be approximated by one minus C divided by one over n times log M, when n is large. This lower bound on P_e is illustrated by the figure on the left hand side. We now give an asymptotic analysis of P_e when n is large. As we have shown, for large n, P_e is lower bounded by one minus C, divided by one over n, times log M, where one over n, times log M, is the rate of the channel code. The above lower bound on P_e, says that if one over n, times log M, is greater than C, so that C over one over n times log M, is less than one, then P_e is bounded away from zero for large n. This is illustrated in the figure on the left hand side. If we take the rate of the code to be strictly greater than C, then P_e, the average probability of error of the code, when the block length n is large, can only take a value in the range colored in blue, which is bounded away from zero. This asymptotic lower bound on P_e, also implies that if the rate of the code is greater than C, then P_e, is strictly positive for all n. For the details, please see the textbook. Also note that this asymptotic lower bound on P_e tends to one, as one over n log M, that is the rate of the code, tends to infinity.