Reminder: This post contains 7616 words
· 22 min read
· by Xianbin
To encode some information, we use bits. In general, if there are \(M\) symbols, we need \(\log_2 M\) bits. It seems that we cannot do better.
However, if the symbols have some particular properties, we can do better. Now, things become interesting. What kind of properties we can use such that the size of encoding is less than \(\log_2 M\)?
Example 1:
Imagine that we have a file with 2000 and there are four symbols “00,01,10,00” with probability \(1/2,1/4,1/8,1/8\) respectively. (It is easy to learn this property by property testing algorithms). It is not optimal. We can use “0, 10,110,111” to represent these four symbols and the number of bits we use is smaller. In short, the main idea is to encode the frequent symbols with less bits. We call such a strategy variable-length coding.
Issues:
Of course, there are problems.
- unique decoding.
- instantaneous/prefix codes.
- The first one says that, if we apply some variable-length coding, the decoding may have more than one result.
- The second one says that even there is a unique decoding, we have to wait for some codeword decoding first. If it is prefix-free, we can directly decode an codeword.
Useful Inequalities
Kraft’s inequality shows an exact condition for the existence of a prefix code in terms of the lengths of codewords.
\(\textbf{Lemma 1}\)[Kraft’s Inequality]. A binary instaneous/prefix code with codeword length \(\{\ell_i\}_{i=1}^N\) exists iff
\(\sum_{i=1}^N 2^{-\ell_i} \leq 1\)
\(\textbf{Proof}\). We first prove the this condition is necessary. Imagine that we expand the binary prefix code in the form of binary tree. For example, “000111” (0 -> go left; 1 -> go right). Then we build a binary tree. Let \(C\) be a prefix code with the codewords \(w_1,\ldots,w_N\) of length \(\ell_1,\ldots,\ell_N\). Each codeword \(w_i\) corresponds to an interval \(I(w_i).\) Since it is a prefix code, these intervals will not overlap.
\[\sum_{i=1}^N \lvert I(w_i) \rvert= \sum_{i=1}^N 2^{-\ell_i} \leq 1\]
Next, we prove this condition is sufficient. That is to say that given a set of codeword length \(\{\ell_i\}\), we can construct a prefix code. To do this this, we only need to give an algorithm. After sorting, we obtain \(\ell_1 \leq \cdots \leq \ell_N\)(relabeled). We divide intervals as follows.
\[[l_i,r_i) = \left[ \sum_{j=1}^{i-1} 2^{-\ell_j}, \sum_{j=1}^i 2^{-\ell_j}\right)\]
where \(i\in[N]\).
(Notice that the length of \([l_i,r_i)\) is at most \(2^{-\ell_i}\). By the condition, we can design this for all \(i\).)
Then, \([l_i,r_i) = I(w_i)\) for the binary string \(w_i\) with length \(\ell_i\). Because these intervals does not overlap, the binary strings are codewords for a prefix code.
\(\textbf{Lemma 2}\) (Improved) Lemma 1 holds for a uniquely decodable binary code.
Shannon’s Source Encoding Theorem
Above discussion makes us think of the following question.
What is the minimum number of bits to encode words such that we can decode them uniquely and directly?
Now, let us the brilliant Shannon’s Source Coding Theorem.
\(\textbf{Theorem 1}\) Given a collection of \(n\) i.i.d random variables, each of which has entropy \(H(X)\), we can compress these variables into \(n H(X)\) bits on average with negligible errors as \(n\to \infin\). Also, no uniquely decodable code can compress them using less thatn \(nH(X)\) bits without loss of information.
The idea behind Shannon’s source encoding theorem is to encode only typical messages. As we talked before, we can save bits by encoding them. We will show that non-typical string tend to 0.
Intuitive Proof
Step 1.
We define a random message of length \(N\) as \(x_1,\ldots,x_N\) each letter of which is drawn an alphabet \(\mathcal{A} = \{a_1,\ldots,a_T\}\) with probabilities
\(p(a_i) = p_i \in(0,1], i = 1,\dots, T\)
where \(\sum p_i = 1\).
Now, let us look at \(y = x_1\ldots,x_N\). The probability that \(y\) appears is \(p(x_1,\ldots,x_N) = p(x_1)\ldots p(x_N)\).
Step 2.
Next, we consider a long enough string \(x\). We let \(a_i\) to denote the frequency \(N_i \approx N p_i\). Then, the probability of existence of \(x\) is as follows:
\[p(x) \approx p_{typ} = p_1^{N_1}\ldots p_i^{N_i}\]
The typical messages are uniformly distributed by \(p_{typ}\) which indicates that the set \(S\) has typical messages of size
\[\lvert S \rvert\approx\frac{1}{p_{typ}}\]
If we encode each member of \(S\) by a binary string we need
\(I_N = \log_2 \lvert S\rvert
= -N\sum_{i=1}^T p_i \log p_i = NH(X)\)
bits.
Therefore, the average number of bits per letter is
\[I = I_N/N = H(X).\]
Therefore, we can see that
\[I_N \approx N H(X)\]
and
\[p_{typ} \approx 2^{-NH(X)}.\]
The above proof is not clear. At least, what is the typical message? Why can we think they are uniformly distributed?
1. Random Variable:
Given a letter ensemble \(S\), the function \(f: S\to \mathbb{R}\) defines a discrete random variable. The realization \(f(S)\) are the real numbers \(f(x), x\in S\). The average of \(f(S)\) is defined as follows.
\[\overline {f(S)}:= \sum_{x\in S}p(x)f(x) = \sum_{i=1}^T p_i f(a_i)\]
2. Typical Set \(Tp\) of random variables
Roughly speaking, the typical set is some good sampling of the source. Recall that
\[p(x) \approx p_{typ} = p_1^{N_1}\ldots p_i^{N_i}\]
It says that a string \(x\) is typical if the frequency of each symbol in the string is the same as the probability of the symbol in the source.
We formally show the above property as follows.
Let \(x^n\) be a sequence with elements drawn from finite alphabet \(\mathcal{A}\).
Let \(\pi(x \mid x^n) = \frac{\lvert \{i: x_i = x \}\rvert}{n}\) for all \(x\in \mathcal{A}\) be the empirical probability mass function of \(x^n\).
For example, if \(x^n = (0,1,1,1,1,1,1,0)\), then
\[\pi(x \mid x^n) = 6/8((x = 1);\\
\pi(x \mid x^n) = 2/8 (x = 0);\]
Let \(X_1, X_2, \ldots\) be a sequence of i.i.d random variables with \(X_i\to P_X(x_i)\) (do experiments).
\(\textbf{Lemma}\) (Weak Law of Large Numbers (WLLN)). Let \(X_1,X_2, \ldots,X_N\) be i.i.d random variables with a finite expected value \(\mu\), then for any \(\epsilon > 0\), we have
\(\lim_{n\to +\infin} P(\lvert \bar X - \mu \rvert) = 0\)
By WLLN, for each \(x\in \mathcal{A}\), we have
\[\pi(x \mid X^n) \to p(x)\]
Thus, with high probability, we can use \(\pi(x \mid X^n)\) to replace the true probability \(p(x)\). Therefore, for \(X \sim p(x)\) and \(\epsilon \in (0,1)\), we define \(\epsilon\)-typical \(n\)-sequences \(x^n\) as
\(T^n_\epsilon(X) = \{x^n:\lvert \pi(x \mid x^n) -p(x) \rvert \leq \epsilon p(x) \}\)
for all \(x\in \mathcal{A}\).
Then, we can easily obtain the following lemma by applying the above definition.
\(\textbf{Typlical Set Lemma}\). Let \(x^n \in T^n_\epsilon(X)\), Then for any non-negative function \(f(x)\), we have \((1-\epsilon)E[f(X)] \leq \frac{1}{n} \sum_{i=1}^n f(x_i) \leq (1+\epsilon)E[f(X)]\)
So, we define the typical set \(T\) of a random sequence \(S\) as the set of realizations \(\hat x :=x_1,\cdots,x_N\) such that
\[\overline {f(S)} - \delta \leq \frac{1}{N}\sum_{i=1}^Nf(x_n) \leq \overline {f(S)} + \delta\]
It means that the typical set contains all very possible showing strings!!!!
Then,
\[P( X^n \in T_p) \geq 1-\epsilon\]
Now, we use \(P_T\) to represent the probability for a randomly chosen sequence \(\hat x\) to be in the typical set \(T_p\).
\[P_T = \sum_{\hat x\in T_p} p(\hat x) \geq 1-\epsilon\]
The above can be proved by Chebyshev’s Inequality.
\(\textbf{Chebyshev's Inequality}\). Let \(X\) be any random variable with \(\mu = \mathbb{E}[X]\) and finite variance \(Var(X)\). Then, for any real number \(\alpha\), we have
\(P(\lvert X - \mu\rvert \geq \alpha) \leq \frac{Var(X)}{\alpha^2}\)
Next, consider the function
\[f(S): = -\log p(S)\]
The average of \(f(S)\) is
\[\overline {f(S)} = -\sum_{x\in \mathcal{A}}p(x) \log p(x) = H(S)\]
Replace \(f(x)\) by \(-\log f(S)\), we have
\(H(S) - \delta \leq - \frac{1}{N}\sum_{i=1}^N \log p(x_i) \leq H(S) +\delta\)
Amazing!!!!!!
The above inequality can be transformed as follows ( \(H = H(S)\) in short).
\[2^{-N(H+\delta)} \leq p(\hat x) \leq 2^{-N(H-\delta)}\]
Then,
\[\lvert T_p \rvert 2^{-N(H-\delta)} \geq \sum_{\hat x\in T_p} p(\hat x) \geq \lvert T_p \rvert 2^{-N(H+\delta)}\]
Therefore,
\[\lvert T_p \rvert 2^{-N(H-\delta)} \geq 1-\epsilon \geq \lvert T_p \rvert 2^{-N(H+\delta)}\]
Then,
\[(1-\epsilon)2^{N(H+\delta)} \geq \lvert T_p \rvert \geq (1-\epsilon) 2^{N(H-\delta)}\]
When \(\epsilon, \delta \to 0\), we obtain that
\(\lvert T_p \rvert \to 2^{NH}\)
So, \(I_N \to NH\).
Many days ago, I wanted to write a blog about Shannon’s source coding theorem. Now, I finally do it. I feel the most important thing is the conception of the typical set. It is really a genius conception.
Reference
[1]. Shannon’s Source Coding Theorem. Kim Boström
[2]. C. E. Shannon and W. Weaver. A mathematical Theory
of communication, The Bell System Technical Journal, 27,
379–423,623–656, (1948)
[3]. El Gamal, Abbas, and Young-Han Kim. Network information theory. Cambridge university press, 2011.