Section 2.5. Informational Divergence. Definition 2.28. The informational divergence between two probability distributions p and q on a common alphabet X is defined as D(p||q) equals summation x p(x) log p(x) divided by q(x). And it, this can be written as expectation with respect to the distribution p, log of p(X) divided by q(X). In the above definition, we take the following conventions. First the summation is over the support of p. Namely, summation x refers to summation x in S_p. Second, for any constant c bigger than 0, c log c over 0 is equal to infinity. Therefore, if D(p||q) is less than infinity, then p(x) bigger than 0 implies q(x) bigger than 0, or the support of p is a subset of the support of q. The divergence D(p||q) measures the quote unquote distance between p and q. Note the D(p||q) is not symmetrical in p and q. And so, the divergence D is not a true metric. Also, the digergence D does not satisfy the triangular inequality. Lemma 2.29 is an extremely important inequality in information theory, called the Fundamental Inequality. It says that for any a bigger than 0, log a is less than or equal to a-1 with the logarithm is natural logarithm with equality if and only if a is equal to 1. Instead of giving a proof, this inequality can be seen to be true with a very simple plot. To x axis here is a, and it is clear that log a is less than or equal to a minus 1. The corollary of the fundamental inequality is corollary 2.30. It says that for any a bigger than 0, log a is bigger than or equal to 1 minus 1 over a, with equality if and only if a is equal to 1. Here's the proof of the corollary. Let a equals 1 over b in the fundamental inequality log a less than or equal to a-1, where b is bigger than 0, so that a, which is equals to 1 over b, is also bigger than 0. Now by the fundamental inequality we have log 1 over b is less than or equal to 1 over b minus 1, or minus log b is less than or equal to 1 over b minus 1 or log b is bigger than or equal to 1 minus 1 over b, which is the inequality that we need to prove. Equality holds if and only if 1 over b is equal to a and a is equal to 1 that is b is equal to 1. Theorem 2.31 is a very important inequality called the Divergence Inequality. It says that for any two probability distributions p and q on a common alphabet script X. D(p||q) is bigger than or equal to 0, with equality if and only if p is equal to q. Now we prove the divergence inequality. First, for simplicity, assume that S_p is equal to S_q. For proof without this assumption, please see the textbook. Now consider, divergence between p and q is equal to summation x in S_p, p(x) log p(x) divided by q(x). Now change the log to the natural logarithm, and by doing so, we multiply by the constant log e. Now, by corollary 2.30 log p(x) divided by q(x) is bigger than or equal to 1 minus q(x) divided by p(x). Now this p(x) cancels with this p(x). So we have log e multiplied by summation x in S_p p(x) minus summation x in S_p q(x). Now with the assumption S_p equals S_q, we can replace S_p by S_q here. And so, we have log e multiplied by the first summation, which is equal to 1 minus the second summation, which is also equal to 1, so we have 1 minus 1. And hence, we have 0. This proves the divergence inequality. Now, for equality to hold in two. We see from corollary 2.30 that this is the case if only if p(x) divided by q(x) is equal to 1 or p(x) is equal to q(x) for all x in S_p. This proves the theorem. Theorem 2.32, called the Log-Sum Inequality, is a very useful inequality that can be proved by the divergence inequality. It says that for positive numbers a_1, a_2, so on and so forth, and nonnegative numbers b_1, b_2, so on and so forth, such that summation i a_i is less than infinity, and summation i b_i is between 0 and infinity. Then, summation i a_i log a_i divided by b_i is bigger than or equal to summation i a_i log summation i a_i divided by summation i b_i. With the convention that log a_i divided by 0 is equal to infinity. Moreover, equality holds if and only if a_i divided by b_i is equal to a constant for all i. To illustrate the log-sum inequality, let us look at the simplest example, a_1 log a_1, divided by b_1, plus a_2 log a_2 divided by b_2, is bigger than or equal to, a_1 plus a_2 log, a_1 plus a_2 divided by b_1 plus b_2. Now, we prove the log-sum inequality. First we let a_i' equals a_i divided by summation j a_j. And b_i' equals b_i divided by summation j b_j. That is, we normalize a_i into a_i' and bi into b_i'. Then a_i' and b_i' are probability distributions. Using the divergence inequality we have 0 less than or equal to summation i a_i' log a_i' divided by b_i'. Now a_i' is equal to a_i divided by summation j a_j and b_i' is equal to b_i divided by summation j b_j. With this, we have summation i, a_i divided by summation j, a_j times log a_i divided by summation j a_j divided by b_i divided by summation j b_j. Now let us color code a_i, b_i. Summation j a_j, and summation j b_j. In the next step, we move summation j a_j outside summation i, because summation a_j does not depend on i and for the rest of the term, we simply move them over. Now in the next step, first, we move over the constant 1 over summation j a_j. And for the summation inside the square bracket, we are going to break it up into two summations. The first summations is summation i a_i log a_i divided by b_i. And the second summation is minus summation i a_i log summation j a_j divided by summation j b_j. Now the term log summation j a_j divided by summation j b_j does not depend on i. So in the next step we put a parenthesis around summation i a_i. And upon cancelling out a terms summation j a_j we obtain the log-sum inequality. Equality holds if and only if for all i, a_i' is equal to b_i' or a_i divided by b_i is a constant. The theorem is proved. We have seen that the divergence inequality implies the log-sum inequality. It can be shown that the log-sum inequality also implies the divergence inequality. This is left as an exercise. Thus the two inequalities are indeed equivalent. The next inequality is called Pinsker's Inequality that relates to divergence between two probability distributions p and q and the variational distance V(p,q) between the two probability distributions. This inequality says that D(p,q) is bigger than or equal to 1 divided by 2 ln 2 V(p,q)^2. That is, the divergence between p and q is lower bounded by 1 over 2 ln 2, the square of the variational distance between p and q. If either the divergence between p and q, or the divergence between q and p is small. Then so is the variational of distance between p and q. For a sequence of probability distributions q_k, as k tends to infinity, if either the divergence between p and q_k or the divergence between q_k and p tends to 0. Then, the variational distance between p and q_k also tends to 0. In other words, convergence in divergence is a stronger notion than convergence in variational distance. For details, please see problems 23 and 24.