T1: Basics about Entropy

June 29, 2024

Reminder: This post contains 3670 words · 11 min read · by Xianbin

Conditional Entropy
Joint Entropy
Chain rule for entropy
- The Inverse Form of the Chain Rule
Some helpful Inequalities
Some useful inequalities
Important Inequalities
Other Tips

This post give some basic knowledge on entropy.

Conditional Entropy

\textbf{H}(Y\mid X) = \sum_{x\in X} H(Y\mid x) p(x) = - \sum_{x\in X}p(x) \sum_{y\in Y}p(y\mid x) \log p(y\mid x) \\= -\sum_{x\in X}\sum_{y\in Y} p(x,y) \log p(y\mid x) = - \mathbb{E}\big(\log p(Y \mid X)\big)

Joint Entropy

Let $X,Y$ be two random variables. $\textup{H}(X,Y) =-\sum_x\sum_yp(x,y)\log p(x,y) = -\mathbb{E}\log p(X,Y)$

To have an intuitive meaning, we first introduce the following formula.

\textup{H}(X,Y) = \textup{H}(X) + \textup{H}(Y) - \textup{I}(X,Y)

It means that the joint entropy of $(X,Y)$ is the all information we can get from $X,Y$ . Also, we can say that $I(X,Y)$ is the information leaked from each other.

Chain rule for entropy

For two random variables $X,Y$ , we have

\textup{H}(X,Y) =\textup{H}(X) + \textup{H}(Y\mid X) = \textup{H}(Y)+\textup{H}(X\mid Y)

Generally, we have

\textbf{H}(X_1,\ldots,X_n) = \textbf{H}(X_n \mid X_{n-1},\ldots,X_1)+ \cdots + \textbf{H}(X_1)

It is easy to prove the above equality.

For any $(x_1,\ldots, x_n)$ such that $p(x_1,\ldots,x_n) > 0$ , we have

p(x_1,\ldots, x_n) = p(x_n \mid x_{n-1}, \ldots,x_1) \cdots p(x_2\mid x_1) \cdot p(x_1)

Then,

\textbf{H}(X_1,\ldots,X_n) = -\mathbb{E} \log P(X_1,\ldots,X_n) = \\-\mathbb{E} \log P(X_n \mid X_{n-1},\ldots,X_1)-\cdots-\mathbb{E} \log P(X_1) = \\ \textbf{H}(X_1) + \cdots + \textbf{H}(X_n \mid X_{n-1}\ldots X_1)

The Inverse Form of the Chain Rule

$\textbf{H}(X_1, X_2, \ldots, X_n ) = \sum_{i=1}^{n} \textbf{H}(X_i \mid X_{i+1}, \ldots, X_n)$

To prove the above, let $X_1= Y_n, \ldots, X_n = Y_1$ .

Some helpful Inequalities

$\textbf{Lemma 1}.$ . Given two random variables $X,Y$ , we have

\textup{H}(X) \geq \textup{H}(X\mid Y).

$\textbf{Proof}$ . We prove this as follows.

\textup{H}(X\mid Y) - \textup{H}(X) = \sum_y p(y) \sum_xp(x\mid y)\log \frac{1}{p(x\mid y)} - \sum_x p(x) \log \frac{1}{p(x)} \\ =\sum_y p(y) \sum_x p(x\mid y)\log \frac{1}{p(x\mid y)} - \sum_x p(x) \log \frac{1}{p(x)}\sum_y p(y\mid x)\\ =\sum_{x,y}p(x,y)\left( \log \frac{1}{p(x\mid y)} - \log \frac{1}{p(x)}\right)\\ =\sum_{x,y}p(x,y)\left(\log \frac{p(x)p(y)}{p(x,y)} \right)

Generalized Jensen’s inequality:

$f(E[Z]) \leq E(f(z))$ if $f$ is convex;

$f(E[Z]) \geq E(f(z))$ if $f$ is concave.

Here $\log$ is a concave function.

Let $Z = \frac{p(x)p(y)}{p(x,y)}$ . Then, we obtain that

\sum_Z p(Z) \log Z \leq \log \sum_Z Zp(Z)

Then,

\sum_{x,y}p(x,y)\left(\log \frac{p(x)p(y)}{p(x,y)} \right) \leq \log \left(\sum_{x,y}\frac{p(x)p(y)}{p(x,y)}p(x,y)\right) = 0.

Then it is proved.

Some useful inequalities

$\textbf{H}(A) \leq \textbf{H}(A \mid C ) + \textbf{H}(C)$ .
$\textbf{H}(A\mid B) = p(B = 0) \cdot \textbf{H}(A\mid B = 0) + p(B = 1) \cdot \textbf{H}(A \mid B = 1)$ for $B \in \{0,1\}$ .

For 2, let us see the details.

\textbf{H}(A \mid B ) =-\sum_{a,b \sim A, B} p(a,b) \log \frac{p(a,b)}{p(a)} \\ =\sum f(a,b) \\= p(b = 0)\sum f(a,b=0) + p(b=1)\sum f(a,b=1) \\= -p(b=0) \cdot\sum_{a,b=0}p(a,b=0)\log \frac{p(a,b=0)}{p(a)} -p(b=1) \cdot\sum_{a,b=0}p(a,b=1)\log \frac{p(a,b=0)}{p(a)}\\ =P(B = 0) \cdot \textbf{H}(A\mid B = 0) + P(B = 1) \cdot \textbf{H}(A \mid B = 1)

$\textbf{Lemma H. } \textbf{H}(X_1,\ldots,X_n) = \textbf{H}(X_1)+\textbf{H}(X_2 \mid X_1) + \cdots = \sum_{i=1}^n \textbf{H}(X_i \mid X_{j<i})$ where $X_{j<i} = X_{i-1},\ldots,X_1$

\textbf{Lemm I. } \textbf{I}(X_1,X_2,\ldots,X_n: Y) = \textbf{I}(X_1: Y) + \textbf{I}(X_2: Y \mid X_1) + \textbf{I}(X_3: Y \mid X_2, X_1)+\cdots = \\ \textbf{H}(X_1,X_2,\ldots,X_n) - \textbf{H}(X_1,X_2,\ldots,X_n \mid Y) =\\ \sum_{i=1}^n\textbf{I}(X_i : Y \mid X_{j<i})

Important Inequalities

For entropy,

$\textbf{H}(A\mid B) \leq \textbf{H}(A)$ .

For mutual information, dropping condition does not increase/decrease the value. It depends.

Other Tips

\textbf{H}(A\mid B, C) = \textbf{H}(A\mid B)

if $A\perp C \mid B$ or ( $C\perp A \mid B$ )

There are a lot of similar tips. Here we show the key by the following proof.

\textbf{H}(A \mid B) = \textbf{H}(A)

if $A\perp B$ .

Then,

\textbf{H}(A\mid B, C) = \textbf{H}(A\mid B\mid C) = \textbf{H}(A\mid B)

if $A\perp C \mid B$ or ( $C\perp A \mid B$ )