T1: Coding Theory (1)
To encode some information, we use bits. In general, if there are symbols, we need bits. It seems that we cannot do better.
However, if the symbols have some particular properties, we can do better. Now, things become interesting. What kind of properties we can use such that the size of encoding is less than ?
Example 1:
Imagine that we have a file with 2000 and there are four symbols “00,01,10,00” with probability respectively. (It is easy to learn this property by property testing algorithms). It is not optimal. We can use “0, 10,110,111” to represent these four symbols and the number of bits we use is smaller. In short, the main idea is to encode the frequent symbols with less bits. We call such a strategy variable-length coding.
Issues:
Of course, there are problems.
- unique decoding.
- instantaneous/prefix codes.
- The first one says that, if we apply some variable-length coding, the decoding may have more than one result.
- The second one says that even there is a unique decoding, we have to wait for some codeword decoding first. If it is prefix-free, we can directly decode an codeword.
Useful Inequalities
Kraft’s inequality shows an exact condition for the existence of a prefix code in terms of the lengths of codewords.
[Kraft’s Inequality]. A binary instaneous/prefix code with codeword length exists iff
. We first prove the this condition is necessary. Imagine that we expand the binary prefix code in the form of binary tree. For example, “000111” (0 -> go left; 1 -> go right). Then we build a binary tree. Let be a prefix code with the codewords of length . Each codeword corresponds to an interval Since it is a prefix code, these intervals will not overlap.
Next, we prove this condition is sufficient. That is to say that given a set of codeword length , we can construct a prefix code. To do this this, we only need to give an algorithm. After sorting, we obtain (relabeled). We divide intervals as follows.
where .
(Notice that the length of is at most . By the condition, we can design this for all .)
Then, for the binary string with length . Because these intervals does not overlap, the binary strings are codewords for a prefix code.
(Improved) Lemma 1 holds for a uniquely decodable binary code.
Shannon’s Source Encoding Theorem
Above discussion makes us think of the following question.
What is the minimum number of bits to encode words such that we can decode them uniquely and directly?
Now, let us the brilliant Shannon’s Source Coding Theorem.
Given a collection of i.i.d random variables, each of which has entropy , we can compress these variables into bits on average with negligible errors as . Also, no uniquely decodable code can compress them using less thatn bits without loss of information.
The idea behind Shannon’s source encoding theorem is to encode only typical messages. As we talked before, we can save bits by encoding them. We will show that non-typical string tend to 0.
Intuitive Proof
Step 1.
We define a random message of length as each letter of which is drawn an alphabet with probabilities where .
Now, let us look at . The probability that appears is .
Step 2.
Next, we consider a long enough string . We let to denote the frequency . Then, the probability of existence of is as follows:
The typical messages are uniformly distributed by which indicates that the set has typical messages of size
If we encode each member of by a binary string we need
bits.
Therefore, the average number of bits per letter is
Therefore, we can see that
and
Formal Proof
The above proof is not clear. At least, what is the typical message? Why can we think they are uniformly distributed?
1. Random Variable:
Given a letter ensemble , the function defines a discrete random variable. The realization are the real numbers . The average of is defined as follows.
2. Typical Set of random variables
Roughly speaking, the typical set is some good sampling of the source. Recall that
It says that a string is typical if the frequency of each symbol in the string is the same as the probability of the symbol in the source.
We formally show the above property as follows.
Let be a sequence with elements drawn from finite alphabet . Let for all be the empirical probability mass function of . For example, if , then
Let be a sequence of i.i.d random variables with (do experiments).
(Weak Law of Large Numbers (WLLN)). Let be i.i.d random variables with a finite expected value , then for any , we have
By WLLN, for each , we have
Thus, with high probability, we can use to replace the true probability . Therefore, for and , we define -typical -sequences as
for all .
Then, we can easily obtain the following lemma by applying the above definition.
. Let , Then for any non-negative function , we have
So, we define the typical set of a random sequence as the set of realizations such that
It means that the typical set contains all very possible showing strings!!!!
Then,
Now, we use to represent the probability for a randomly chosen sequence to be in the typical set .
The above can be proved by Chebyshev’s Inequality.
. Let be any random variable with and finite variance . Then, for any real number , we have
Next, consider the function
The average of is
Replace by , we have
Amazing!!!!!!
The above inequality can be transformed as follows ( in short).
Then,
Therefore,
Then,
When , we obtain that
So, .
Many days ago, I wanted to write a blog about Shannon’s source coding theorem. Now, I finally do it. I feel the most important thing is the conception of the typical set. It is really a genius conception.
Reference
[1]. Shannon’s Source Coding Theorem. Kim Boström
[2]. C. E. Shannon and W. Weaver. A mathematical Theory of communication, The Bell System Technical Journal, 27, 379–423,623–656, (1948)
[3]. El Gamal, Abbas, and Young-Han Kim. Network information theory. Cambridge university press, 2011.