Toolbox

T1: Basics about Entropy

Reminder: This post contains 3510 words · 11 min read · by Xianbin

This post give some basic knowledge on entropy.

Conditional Entropy

H(YX)=xXH(Yx)p(x)=xXp(x)yYp(yx)logp(yx)=xXyYp(x,y)logp(yx)=E(logp(YX))\textbf{H}(Y\mid X) = \sum_{x\in X} H(Y\mid x) p(x) = - \sum_{x\in X}p(x) \sum_{y\in Y}p(y\mid x) \log p(y\mid x) \\= -\sum_{x\in X}\sum_{y\in Y} p(x,y) \log p(y\mid x) = - \mathbb{E}\big(\log p(Y \mid X)\big)

Joint Entropy

Let X,YX,Y be two random variables. H(X,Y)=xyp(x,y)logp(x,y)=Elogp(X,Y)\textup{H}(X,Y) =-\sum_x\sum_yp(x,y)\log p(x,y) = -\mathbb{E}\log p(X,Y)

To have an intuitive meaning, we first introduce the following formula.

H(X,Y)=H(X)+H(Y)I(X,Y)\textup{H}(X,Y) = \textup{H}(X) + \textup{H}(Y) - \textup{I}(X,Y)

It means that the joint entropy of (X,Y)(X,Y) is the all information we can get from X,YX,Y. Also, we can say that I(X,Y)I(X,Y) is the information leaked from each other.

Chain rule for entropy

For two random variables X,YX,Y, we have

H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)\textup{H}(X,Y) =\textup{H}(X) + \textup{H}(Y\mid X) = \textup{H}(Y)+\textup{H}(X\mid Y)

Generally, we have

H(X1,,Xn)=H(XnXn1,,X1)++H(X1)\textbf{H}(X_1,\ldots,X_n) = \textbf{H}(X_n \mid X_{n-1},\ldots,X_1)+ \cdots + \textbf{H}(X_1)

It is easy to prove the above equality.

For any (x1,,xn)(x_1,\ldots, x_n) such that p(x1,,xn)>0p(x_1,\ldots,x_n) > 0, we have

p(x1,,xn)=p(xnxn1,,x1)p(x2x1)p(x1)p(x_1,\ldots, x_n) = p(x_n \mid x_{n-1}, \ldots,x_1) \cdots p(x_2\mid x_1) \cdot p(x_1)

Then,

H(X1,,Xn)=ElogP(X1,,Xn)=ElogP(XnXn1,,X1)ElogP(X1)=H(X1)++H(XnXn1X1)\textbf{H}(X_1,\ldots,X_n) = -\mathbb{E} \log P(X_1,\ldots,X_n) = \\-\mathbb{E} \log P(X_n \mid X_{n-1},\ldots,X_1)-\cdots-\mathbb{E} \log P(X_1) = \\ \textbf{H}(X_1) + \cdots + \textbf{H}(X_n \mid X_{n-1}\ldots X_1)

Some helpful Inequalities

Lemma 1.\textbf{Lemma 1}.. Given two random variables X,YX,Y, we have

H(X)H(XY).\textup{H}(X) \geq \textup{H}(X\mid Y).

Proof\textbf{Proof}. We prove this as follows.

H(XY)H(X)=yp(y)xp(xy)log1p(xy)xp(x)log1p(x)=yp(y)xp(xy)log1p(xy)xp(x)log1p(x)yp(yx)=x,yp(x,y)(log1p(xy)log1p(x))=x,yp(x,y)(logp(x)p(y)p(x,y))\textup{H}(X\mid Y) - \textup{H}(X) = \sum_y p(y) \sum_xp(x\mid y)\log \frac{1}{p(x\mid y)} - \sum_x p(x) \log \frac{1}{p(x)} \\ =\sum_y p(y) \sum_x p(x\mid y)\log \frac{1}{p(x\mid y)} - \sum_x p(x) \log \frac{1}{p(x)}\sum_y p(y\mid x)\\ =\sum_{x,y}p(x,y)\left( \log \frac{1}{p(x\mid y)} - \log \frac{1}{p(x)}\right)\\ =\sum_{x,y}p(x,y)\left(\log \frac{p(x)p(y)}{p(x,y)} \right)

Generalized Jensen’s inequality:

f(E[Z])E(f(z))f(E[Z]) \leq E(f(z)) if ff is convex;

f(E[Z])E(f(z))f(E[Z]) \geq E(f(z)) if ff is concave.

Here log\log is a concave function.

Let Z=p(x)p(y)p(x,y)Z = \frac{p(x)p(y)}{p(x,y)}. Then, we obtain that

Zp(Z)logZlogZZp(Z)\sum_Z p(Z) \log Z \leq \log \sum_Z Zp(Z)

Then,

x,yp(x,y)(logp(x)p(y)p(x,y))log(x,yp(x)p(y)p(x,y)p(x,y))=0.\sum_{x,y}p(x,y)\left(\log \frac{p(x)p(y)}{p(x,y)} \right) \leq \log \left(\sum_{x,y}\frac{p(x)p(y)}{p(x,y)}p(x,y)\right) = 0.

Then it is proved.

Some useful inequalities

  1. H(A)H(AC)+H(C)\textbf{H}(A) \leq \textbf{H}(A \mid C ) + \textbf{H}(C).

  2. H(AB)=p(B=0)H(AB=0)+p(B=1)H(AB=1)\textbf{H}(A\mid B) = p(B = 0) \cdot \textbf{H}(A\mid B = 0) + p(B = 1) \cdot \textbf{H}(A \mid B = 1) for B{0,1}B \in \{0,1\}.

For 2, let us see the details.

H(AB)=a,bA,Bp(a,b)logp(a,b)p(a)=f(a,b)=p(b=0)f(a,b=0)+p(b=1)f(a,b=1)=p(b=0)a,b=0p(a,b=0)logp(a,b=0)p(a)p(b=1)a,b=0p(a,b=1)logp(a,b=0)p(a)=P(B=0)H(AB=0)+P(B=1)H(AB=1)\textbf{H}(A \mid B ) =-\sum_{a,b \sim A, B} p(a,b) \log \frac{p(a,b)}{p(a)} \\ =\sum f(a,b) \\= p(b = 0)\sum f(a,b=0) + p(b=1)\sum f(a,b=1) \\= -p(b=0) \cdot\sum_{a,b=0}p(a,b=0)\log \frac{p(a,b=0)}{p(a)} -p(b=1) \cdot\sum_{a,b=0}p(a,b=1)\log \frac{p(a,b=0)}{p(a)}\\ =P(B = 0) \cdot \textbf{H}(A\mid B = 0) + P(B = 1) \cdot \textbf{H}(A \mid B = 1)

Lemma H. H(X1,,Xn)=H(X1)+H(X2X1)+=i=1nH(XiXj<i)\textbf{Lemma H. } \textbf{H}(X_1,\ldots,X_n) = \textbf{H}(X_1)+\textbf{H}(X_2 \mid X_1) + \cdots = \sum_{i=1}^n \textbf{H}(X_i \mid X_{j<i}) where Xj<i=Xi1,,X1X_{j<i} = X_{i-1},\ldots,X_1

Lemm I. I(X1,X2,,Xn:Y)=I(X1:Y)+I(X2:YX1)+I(X3:YX2,X1)+=H(X1,X2,,Xn)H(X1,X2,,XnY)=i=1nI(Xi:YXj<i)\textbf{Lemm I. } \textbf{I}(X_1,X_2,\ldots,X_n: Y) = \textbf{I}(X_1: Y) + \textbf{I}(X_2: Y \mid X_1) + \textbf{I}(X_3: Y \mid X_2, X_1)+\cdots = \\ \textbf{H}(X_1,X_2,\ldots,X_n) - \textbf{H}(X_1,X_2,\ldots,X_n \mid Y) =\\ \sum_{i=1}^n\textbf{I}(X_i : Y \mid X_{j<i})

Important Inequalities

For entropy,

H(AB)H(A)\textbf{H}(A\mid B) \leq \textbf{H}(A).

For mutual information, dropping condition does not increase/decrease the value. It depends.

Other Tips

H(AB,C)=H(AB)\textbf{H}(A\mid B, C) = \textbf{H}(A\mid B)

if ACBA\perp C \mid B or (CABC\perp A \mid B)

There are a lot of similar tips. Here we show the key by the following proof.

H(AB)=H(A)\textbf{H}(A \mid B) = \textbf{H}(A)

if ABA\perp B.

Then,

H(AB,C)=H(ABC)=H(AB)\textbf{H}(A\mid B, C) = \textbf{H}(A\mid B\mid C) = \textbf{H}(A\mid B)

if ACBA\perp C \mid B or (CABC\perp A \mid B)