Toolbox

T1: Coding Theory (1)

Reminder: This post contains 7616 words · 22 min read · by Xianbin

To encode some information, we use bits. In general, if there are MM symbols, we need log2M\log_2 M bits. It seems that we cannot do better.

However, if the symbols have some particular properties, we can do better. Now, things become interesting. What kind of properties we can use such that the size of encoding is less than log2M\log_2 M?

Example 1:

Imagine that we have a file with 2000 and there are four symbols “00,01,10,00” with probability 1/2,1/4,1/8,1/81/2,1/4,1/8,1/8 respectively. (It is easy to learn this property by property testing algorithms). It is not optimal. We can use “0, 10,110,111” to represent these four symbols and the number of bits we use is smaller. In short, the main idea is to encode the frequent symbols with less bits. We call such a strategy variable-length coding.

Issues:

Of course, there are problems.

  1. The first one says that, if we apply some variable-length coding, the decoding may have more than one result.
  2. The second one says that even there is a unique decoding, we have to wait for some codeword decoding first. If it is prefix-free, we can directly decode an codeword.

Useful Inequalities

Kraft’s inequality shows an exact condition for the existence of a prefix code in terms of the lengths of codewords.

Lemma 1\textbf{Lemma 1}[Kraft’s Inequality]. A binary instaneous/prefix code with codeword length {i}i=1N\{\ell_i\}_{i=1}^N exists iff i=1N2i1\sum_{i=1}^N 2^{-\ell_i} \leq 1

Proof\textbf{Proof}. We first prove the this condition is necessary. Imagine that we expand the binary prefix code in the form of binary tree. For example, “000111” (0 -> go left; 1 -> go right). Then we build a binary tree. Let CC be a prefix code with the codewords w1,,wNw_1,\ldots,w_N of length 1,,N\ell_1,\ldots,\ell_N. Each codeword wiw_i corresponds to an interval I(wi).I(w_i). Since it is a prefix code, these intervals will not overlap.

i=1NI(wi)=i=1N2i1\sum_{i=1}^N \lvert I(w_i) \rvert= \sum_{i=1}^N 2^{-\ell_i} \leq 1

Next, we prove this condition is sufficient. That is to say that given a set of codeword length {i}\{\ell_i\}, we can construct a prefix code. To do this this, we only need to give an algorithm. After sorting, we obtain 1N\ell_1 \leq \cdots \leq \ell_N(relabeled). We divide intervals as follows.

[li,ri)=[j=1i12j,j=1i2j)[l_i,r_i) = \left[ \sum_{j=1}^{i-1} 2^{-\ell_j}, \sum_{j=1}^i 2^{-\ell_j}\right)

where i[N]i\in[N].

(Notice that the length of [li,ri)[l_i,r_i) is at most 2i2^{-\ell_i}. By the condition, we can design this for all ii.)

Then, [li,ri)=I(wi)[l_i,r_i) = I(w_i) for the binary string wiw_i with length i\ell_i. Because these intervals does not overlap, the binary strings are codewords for a prefix code.

Lemma 2\textbf{Lemma 2} (Improved) Lemma 1 holds for a uniquely decodable binary code.

Shannon’s Source Encoding Theorem

Above discussion makes us think of the following question.

What is the minimum number of bits to encode words such that we can decode them uniquely and directly?

Now, let us the brilliant Shannon’s Source Coding Theorem.

Theorem 1\textbf{Theorem 1} Given a collection of nn i.i.d random variables, each of which has entropy H(X)H(X), we can compress these variables into nH(X)n H(X) bits on average with negligible errors as nn\to \infin. Also, no uniquely decodable code can compress them using less thatn nH(X)nH(X) bits without loss of information.

The idea behind Shannon’s source encoding theorem is to encode only typical messages. As we talked before, we can save bits by encoding them. We will show that non-typical string tend to 0.

Intuitive Proof

Step 1.

We define a random message of length NN as x1,,xNx_1,\ldots,x_N each letter of which is drawn an alphabet A={a1,,aT}\mathcal{A} = \{a_1,\ldots,a_T\} with probabilities p(ai)=pi(0,1],i=1,,Tp(a_i) = p_i \in(0,1], i = 1,\dots, T where pi=1\sum p_i = 1.

Now, let us look at y=x1,xNy = x_1\ldots,x_N. The probability that yy appears is p(x1,,xN)=p(x1)p(xN)p(x_1,\ldots,x_N) = p(x_1)\ldots p(x_N).

Step 2.

Next, we consider a long enough string xx. We let aia_i to denote the frequency NiNpiN_i \approx N p_i. Then, the probability of existence of xx is as follows:

p(x)ptyp=p1N1piNip(x) \approx p_{typ} = p_1^{N_1}\ldots p_i^{N_i}

The typical messages are uniformly distributed by ptypp_{typ} which indicates that the set SS has typical messages of size

S1ptyp\lvert S \rvert\approx\frac{1}{p_{typ}}

If we encode each member of SS by a binary string we need

IN=log2S=Ni=1Tpilogpi=NH(X)I_N = \log_2 \lvert S\rvert = -N\sum_{i=1}^T p_i \log p_i = NH(X) bits.

Therefore, the average number of bits per letter is

I=IN/N=H(X).I = I_N/N = H(X).

Therefore, we can see that

INNH(X)I_N \approx N H(X)

and

ptyp2NH(X).p_{typ} \approx 2^{-NH(X)}.

Formal Proof

The above proof is not clear. At least, what is the typical message? Why can we think they are uniformly distributed?

1. Random Variable:

Given a letter ensemble SS, the function f:SRf: S\to \mathbb{R} defines a discrete random variable. The realization f(S)f(S) are the real numbers f(x),xSf(x), x\in S. The average of f(S)f(S) is defined as follows.

f(S):=xSp(x)f(x)=i=1Tpif(ai)\overline {f(S)}:= \sum_{x\in S}p(x)f(x) = \sum_{i=1}^T p_i f(a_i)

2. Typical Set TpTp of random variables

Roughly speaking, the typical set is some good sampling of the source. Recall that

p(x)ptyp=p1N1piNip(x) \approx p_{typ} = p_1^{N_1}\ldots p_i^{N_i}

It says that a string xx is typical if the frequency of each symbol in the string is the same as the probability of the symbol in the source.

We formally show the above property as follows.

Let xnx^n be a sequence with elements drawn from finite alphabet A\mathcal{A}. Let π(xxn)={i:xi=x}n\pi(x \mid x^n) = \frac{\lvert \{i: x_i = x \}\rvert}{n} for all xAx\in \mathcal{A} be the empirical probability mass function of xnx^n. For example, if xn=(0,1,1,1,1,1,1,0)x^n = (0,1,1,1,1,1,1,0), then

π(xxn)=6/8((x=1);π(xxn)=2/8(x=0);\pi(x \mid x^n) = 6/8((x = 1);\\ \pi(x \mid x^n) = 2/8 (x = 0);

Let X1,X2,X_1, X_2, \ldots be a sequence of i.i.d random variables with XiPX(xi)X_i\to P_X(x_i) (do experiments).

Lemma\textbf{Lemma} (Weak Law of Large Numbers (WLLN)). Let X1,X2,,XNX_1,X_2, \ldots,X_N be i.i.d random variables with a finite expected value μ\mu, then for any ϵ>0\epsilon > 0, we have limn+P(Xˉμ)=0\lim_{n\to +\infin} P(\lvert \bar X - \mu \rvert) = 0

By WLLN, for each xAx\in \mathcal{A}, we have

π(xXn)p(x)\pi(x \mid X^n) \to p(x)

Thus, with high probability, we can use π(xXn)\pi(x \mid X^n) to replace the true probability p(x)p(x). Therefore, for Xp(x)X \sim p(x) and ϵ(0,1)\epsilon \in (0,1), we define ϵ\epsilon-typical nn-sequences xnx^n as

Tϵn(X)={xn:π(xxn)p(x)ϵp(x)}T^n_\epsilon(X) = \{x^n:\lvert \pi(x \mid x^n) -p(x) \rvert \leq \epsilon p(x) \} for all xAx\in \mathcal{A}.

Then, we can easily obtain the following lemma by applying the above definition.

Typlical Set Lemma\textbf{Typlical Set Lemma}. Let xnTϵn(X)x^n \in T^n_\epsilon(X), Then for any non-negative function f(x)f(x), we have (1ϵ)E[f(X)]1ni=1nf(xi)(1+ϵ)E[f(X)](1-\epsilon)E[f(X)] \leq \frac{1}{n} \sum_{i=1}^n f(x_i) \leq (1+\epsilon)E[f(X)]

So, we define the typical set TT of a random sequence SS as the set of realizations x^:=x1,,xN\hat x :=x_1,\cdots,x_N such that

f(S)δ1Ni=1Nf(xn)f(S)+δ\overline {f(S)} - \delta \leq \frac{1}{N}\sum_{i=1}^Nf(x_n) \leq \overline {f(S)} + \delta

It means that the typical set contains all very possible showing strings!!!!

Then,

P(XnTp)1ϵP( X^n \in T_p) \geq 1-\epsilon

Now, we use PTP_T to represent the probability for a randomly chosen sequence x^\hat x to be in the typical set TpT_p.

PT=x^Tpp(x^)1ϵP_T = \sum_{\hat x\in T_p} p(\hat x) \geq 1-\epsilon

The above can be proved by Chebyshev’s Inequality.

Chebyshev’s Inequality\textbf{Chebyshev's Inequality}. Let XX be any random variable with μ=E[X]\mu = \mathbb{E}[X] and finite variance Var(X)Var(X). Then, for any real number α\alpha, we have P(Xμα)Var(X)α2P(\lvert X - \mu\rvert \geq \alpha) \leq \frac{Var(X)}{\alpha^2}

Next, consider the function

f(S):=logp(S)f(S): = -\log p(S)

The average of f(S)f(S) is

f(S)=xAp(x)logp(x)=H(S)\overline {f(S)} = -\sum_{x\in \mathcal{A}}p(x) \log p(x) = H(S)

Replace f(x)f(x) by logf(S)-\log f(S), we have H(S)δ1Ni=1Nlogp(xi)H(S)+δH(S) - \delta \leq - \frac{1}{N}\sum_{i=1}^N \log p(x_i) \leq H(S) +\delta

Amazing!!!!!!

The above inequality can be transformed as follows ( H=H(S)H = H(S) in short).

2N(H+δ)p(x^)2N(Hδ)2^{-N(H+\delta)} \leq p(\hat x) \leq 2^{-N(H-\delta)}

Then,

Tp2N(Hδ)x^Tpp(x^)Tp2N(H+δ)\lvert T_p \rvert 2^{-N(H-\delta)} \geq \sum_{\hat x\in T_p} p(\hat x) \geq \lvert T_p \rvert 2^{-N(H+\delta)}

Therefore,

Tp2N(Hδ)1ϵTp2N(H+δ)\lvert T_p \rvert 2^{-N(H-\delta)} \geq 1-\epsilon \geq \lvert T_p \rvert 2^{-N(H+\delta)}

Then,

(1ϵ)2N(H+δ)Tp(1ϵ)2N(Hδ)(1-\epsilon)2^{N(H+\delta)} \geq \lvert T_p \rvert \geq (1-\epsilon) 2^{N(H-\delta)}

When ϵ,δ0\epsilon, \delta \to 0, we obtain that Tp2NH\lvert T_p \rvert \to 2^{NH}

So, INNHI_N \to NH.


Many days ago, I wanted to write a blog about Shannon’s source coding theorem. Now, I finally do it. I feel the most important thing is the conception of the typical set. It is really a genius conception.

Reference

[1]. Shannon’s Source Coding Theorem. Kim Boström

[2]. C. E. Shannon and W. Weaver. A mathematical Theory of communication, The Bell System Technical Journal, 27, 379–423,623–656, (1948)

[3]. El Gamal, Abbas, and Young-Han Kim. Network information theory. Cambridge university press, 2011.