Grocery on Distributed Algorithms

T1: Basics about Entropy

Reminder: This post contains 3405 words · 10 min read · by Xianbin

This post give some basic knowledge on entropy.

Conditional Entropy

\[\textbf{H}(Y\mid X) = -\sum_{x\in X} H(Y\mid x) p(x) = - \sum_{x\in X}p(x) \sum_{y\in Y}p(y\mid x) \log p(y\mid x) \\= -\sum_{x\in X}\sum_{y\in Y} p(y\mid x) \log p(y\mid x) = - \mathbb{E}\big(\log p(Y \mid X)\big)\]

Joint Entropy

Let \(X,Y\) be two random variables. \(\textup{H}(X,Y) =-\sum_x\sum_yp(x,y)\log p(x,y) = -\mathbb{E}\log p(X,Y)\)

To have an intuitive meaning, we first introduce the following formula.

\[\textup{H}(X,Y) = \textup{H}(X) + \textup{H}(Y) - \textup{I}(X,Y)\]

It means that the joint entropy of \((X,Y)\) is the all information we can get from \(X,Y\). Also, we can say that \(I(X,Y)\) is the information leaked from each other.

Chain rule for entropy

For two random variables \(X,Y\), we have

\[\textup{H}(X,Y) =\textup{H}(X) + \textup{H}(Y\mid X) = \textup{H}(Y)+\textup{H}(X\mid Y)\]

Generally, we have

\[\textbf{H}(X_1,\ldots,X_n) = \textbf{H}(X_n \mid X_{n-1},\ldots,X_1)+ \cdots + \textbf{H}(X_1)\]

It is easy to prove the above equality.

For any \((x_1,\ldots, x_n)\) such that \(p(x_1,\ldots,x_n) > 0\), we have

\[p(x_1,\ldots, x_n) = p(x_n \mid x_{n-1}, \ldots,x_1) \cdots p(x_2\mid x_1) \cdot p(x_1)\]

Then,

\[\textbf{H}(X_1,\ldots,X_n) = -\mathbb{E} \log P(X_1,\ldots,X_n) = \\-\mathbb{E} \log P(X_n \mid X_{n-1},\ldots,X_1)-\cdots-\mathbb{E} \log P(X_1) = \\ \textbf{H}(X_1) + \cdots + \textbf{H}(X_n \mid X_{n-1}\ldots X_1)\]

Some helpful Inequalities

\(\textbf{Lemma 1}.\). Given two random variables \(X,Y\), we have

\[\textup{H}(X) \geq \textup{H}(X\mid Y).\]

\(\textbf{Proof}\). We prove this as follows.

\[\textup{H}(X\mid Y) - \textup{H}(X) = \sum_y p(y) \sum_xp(x\mid y)\log \frac{1}{p(x\mid y)} - \sum_x p(x) \log \frac{1}{p(x)} \\ =\sum_y p(y) \sum_yp(x\mid y)\log \frac{1}{p(x\mid y)} - \sum_x p(x) \log \frac{1}{p(x)}\sum_y p(y\mid x)\\ =\sum_{x,y}p(x,y)\left( \log \frac{1}{p(x\mid y)} - \log \frac{1}{p(x)}\right)\\ =\sum_{x,y}p(x,y)\left(\log \frac{p(x)p(y)}{p(x,y)} \right)\]

Generalized Jensen’s inequality:

\(f(E[Z]) \leq E(f(z))\) if \(f\) is convex;

\(f(E[Z]) \geq E(f(z))\) if \(f\) is concave.

Here \(\log\) is a concave function.

Let \(Z = \frac{p(x)p(y)}{p(x,y)}\). Then, we obtain that

\[\sum_Z p(Z) \log Z \leq \log \sum_Z Zp(Z)\]

Then,

\[\sum_{x,y}p(x,y)\left(\log \frac{p(x)p(y)}{p(x,y)} \right) \leq \log \left(\sum_{x,y}\frac{p(x)p(y)}{p(x,y)}p(x,y)\right) = 0.\]

Then it is proved.

Some useful inequalities

  1. \(\textbf{H}(A) \leq \textbf{H}(A \mid C ) + \textbf{H}(C)\).

  2. \(\textbf{H}(A\mid B) = p(B = 0) \cdot \textbf{H}(A\mid B = 0) + p(B = 1) \cdot \textbf{H}(A \mid B = 1)\) for \(B \in \{0,1\}\).

For 2, let us see the details.

\[\textbf{H}(A \mid B ) =-\sum_{a,b \sim A, B} p(a,b) \log \frac{p(a,b)}{p(a)} \\ =\sum f(a,b) \\= p(b = 0)\sum f(a,b=0) + p(b=1)\sum f(a,b=1) \\= -p(b=0) \cdot\sum_{a,b=0}p(a,b=0)\log \frac{p(a,b=0)}{p(a)} -p(b=1) \cdot\sum_{a,b=0}p(a,b=1)\log \frac{p(a,b=0)}{p(a)}\\ =P(B = 0) \cdot \textbf{H}(A\mid B = 0) + P(B = 1) \cdot \textbf{H}(A \mid B = 1)\]

\(\textbf{Lemma H. } \textbf{H}(X_1,\ldots,X_n) = \textbf{H}(X_1)+\textbf{H}(X_2 \mid X_1) + \cdots = \sum_{i=1}^n \textbf{H}(X_i \mid X_{j<i})\) where \(X_{j<i} = X_{i-1},\ldots,X_1\)

\[\textbf{Lemm I. } \textbf{I}(X_1,X_2,\ldots,X_n: Y) = \textbf{I}(X_1: Y) + \textbf{I}(X_2: Y \mid X_1) + \textbf{I}(X_3: Y \mid X_2, X_1)+\cdots = \\ \textbf{H}(X_1,X_2,\ldots,X_n) - \textbf{H}(X_1,X_2,\ldots,X_n \mid Y) =\\ \sum_{i=1}^n\textbf{I}(X_i : Y \mid X_{j<i})\]

Important Inequalities

For entropy,

\(\textbf{H}(A\mid B) \leq \textbf{H}(A)\).

For mutual information, dropping condition does not increase/decrease the value. It depends.

Example 1:

\[\textbf{I}(X: Y \mid Z) > \textbf{I}(X: Y)\]

When \(Z\) happens, \(X\) and \(Y\) are (almost) independent.

Example 2:

\[\textbf{I}(X: Y \mid Z) < \textbf{I}(X: Y)\]

When \(Z \to X, Z\to Y\) (\(Z\) causes \(X\) and \(Y\)).