Skip to content
Home » Elements Of Information Theory Solution | Most Complete Solution Manual For Elements Of Information Theory 2Nd Edition Thomas M. Cover.Wmv 빠른 답변

Elements Of Information Theory Solution | Most Complete Solution Manual For Elements Of Information Theory 2Nd Edition Thomas M. Cover.Wmv 빠른 답변

당신은 주제를 찾고 있습니까 “elements of information theory solution – Most Complete Solution manual for Elements of Information Theory 2nd Edition Thomas M. Cover.wmv“? 다음 카테고리의 웹사이트 https://ro.taphoamini.com 에서 귀하의 모든 질문에 답변해 드립니다: ro.taphoamini.com/wiki. 바로 아래에서 답을 찾을 수 있습니다. 작성자 examgo 이(가) 작성한 기사에는 조회수 3,154회 및 좋아요 없음 개의 좋아요가 있습니다.

Table of Contents

elements of information theory solution 주제에 대한 동영상 보기

여기에서 이 주제에 대한 비디오를 시청하십시오. 주의 깊게 살펴보고 읽고 있는 내용에 대한 피드백을 제공하세요!

d여기에서 Most Complete Solution manual for Elements of Information Theory 2nd Edition Thomas M. Cover.wmv – elements of information theory solution 주제에 대한 세부정보를 참조하세요

Most Complete Solution manual for Elements of Information Theory 2nd Edition ISBN-10: 0471241954 # ISBN-13: 978-0471241959 Thomas M. Cover , Joy A. Thomas\r
\r
www.examgo.blogspot.com\r
contact us : [email protected]

elements of information theory solution 주제에 대한 자세한 내용은 여기를 참조하세요.

Elements of Information Theory Second Edition Solutions to …

Here we have the solutions to all the problems in the second edition of Elements of Information. Theory. First a word about how the problems and solutions …

+ 여기를 클릭

Source: cpb-us-w2.wpmucdn.com

Date Published: 5/21/2022

View: 803

Elements of Information Theory Second Edition Solutions to …

Here we have the solutions to all the problems in the second edition of Elements of Information. Theory. First a word about how the problems and solutions …

+ 여기에 보기

Source: fuentes.inta.gatech.edu

Date Published: 8/30/2021

View: 3993

[397 p. COMPLETE SOLUTIONS] Elements of Information …

Preface Here we have the solutions to all the problems in the second edition of Elements of Information Theory. First a word about how the problems and …

+ 여기에 자세히 보기

Source: pdfcoffee.com

Date Published: 7/22/2021

View: 1498

[397 p. COMPLETE SOLUTIONS] Elements … – DOKUMEN.TIPS

Here we have the solutions to all the problems in the second edition of Elements of InformationTheory. First a word about how the problems and solutions …

+ 더 읽기

Source: dokumen.tips

Date Published: 12/30/2022

View: 3154

ECE 534: Elements of Information Theory Solutions to Midterm …

ECE 534: Elements of Information Theory. Solutions to Mterm Exam (Spring 2006). Problem 1 [20 pts.] A discrete memoryless source has an alphabet of three …

+ 자세한 내용은 여기를 클릭하십시오

Source: www.cs.uic.edu

Date Published: 5/23/2021

View: 2772

Solution Manual of Elements of Information Theory – Scribd

Ss Elements of Information Theory Solutions Manual Thomas M. Cover Joy A. Thomas Department of Electrical Engineering IBM T.J. Watson Research Center …

+ 더 읽기

Source: fr.scribd.com

Date Published: 8/10/2022

View: 7081

주제와 관련된 이미지 elements of information theory solution

주제와 관련된 더 많은 사진을 참조하십시오 Most Complete Solution manual for Elements of Information Theory 2nd Edition Thomas M. Cover.wmv. 댓글에서 더 많은 관련 이미지를 보거나 필요한 경우 더 많은 관련 기사를 볼 수 있습니다.

Most Complete Solution manual for Elements of Information Theory 2nd Edition Thomas M. Cover.wmv
Most Complete Solution manual for Elements of Information Theory 2nd Edition Thomas M. Cover.wmv

주제에 대한 기사 평가 elements of information theory solution

  • Author: examgo
  • Views: 조회수 3,154회
  • Likes: 좋아요 없음
  • Date Published: 2011. 10. 27.
  • Video Url link: https://www.youtube.com/watch?v=C1arF1DaO8k

What are the elements of information theory explain?

All the essential topics in information theory are covered in detail, including entropy, data compression, channel capacity, rate distortion, network information theory, and hypothesis testing. The authors provide readers with a solid understanding of the underlying theory and applications.

What is the information theory equation?

This is called “Shannon information,” “self-information,” or simply the “information,” and can be calculated for a discrete event x as follows: information(x) = -log( p(x) )

What is the goal of information theory?

Information theory studies the transmission, processing, extraction, and utilization of information. Abstractly, information can be thought of as the resolution of uncertainty.

What is the minimum value of H p1 PN H P as P ranges over the set of n dimensional probability vectors find all P’s that achieve this minimum?

What is the minimum value of H(p1,…,pn) = H(p) as p ranges over the set of n-dimensional probability vectors? Find all p’s which achieve this minimum. Solution: Since H(p) ≥ 0 and ∑i pi = 1, then the minimum value for H(p) is which is achieved when pi = 1 and pj = 0, j = i.

How do we measure information?

Information can be measured in terms of a basic unit, (a set consisting of one or more algorithms and heuristics plus data) which when implemented results in work equivalent to one joule of energy. The joule, an international system (SI) unit, can be translated into other standard units of energy.

What do you think is the importance of information theory application in a system?

Information theory provides a means for measuring redundancy or efficiency of symbolic representation within a given language.

How does information theory work?

Information theory is a branch of mathematics that overlaps into communications engineering, biology, medical science, sociology, and psychology. The theory is devoted to the discovery and exploration of mathematical laws that govern the behavior of data as it is transferred, stored, or retrieved.

What is information theory diagram?

An information diagram is a type of Venn diagram used in information theory to illustrate relationships among Shannon’s basic measures of information: entropy, joint entropy, conditional entropy and mutual information.

What is a symbol in information theory?

Information, in Shannon’s theory of information, is viewed stochastically, or probabilistically. It is carried discretely as symbols, which are selected from a set of possible symbols.
h(p) is continuous for 0 <= p <= 1 Fairly intuitive that this should be so.
h(pi) = 0 if pi = 1 No surprise if I win a sure bet.

What are the major theories of information?

The five types of theories are classified: “(i) theory for analysing, (ii) theory for explaining, (iii) theory for predicting, (iv) theory for explaining and predicting, and (v) theory for design and action” (Gregor, 2002, p.

Who studies information theory?

Claude Shannon wrote a master’s thesis that jump-started digital circuit design, and a decade later he wrote his seminal paper on information theory, “A Mathematical Theory of Communication.” Next, Shannon set his sights on an even bigger target: communication. Communication is one of the most basic human needs.

What is code word in information theory?

The bit strings used to represent the symbols are called the codewords for the symbols. The coding problem is to assign codewords for each of the symbols s1,…,sM using as few bits per symbol as possible.

What is joint and conditional entropy?

joint entropy is the amount of information in two (or more) random variables; conditional entropy is the amount of information in one random variable given we already know the other.

What is entropy and mutual information?

Entropy then becomes the self-information. of a random variable. Mutual. information is a special case of a more general quantity called relative. entropy, which is a measure of the distance between two probability.

Why is mutual information symmetric?

The mutual information (MI) between two random variables captures how much information entropy is obtained about one random variable by observing the other. Since that definition does not specify which is the observed random variable, we might suspect this is a symmetric quantity.

What is information theory coding?

Information theory is a mathematical approach to the study of coding of information along with the quantification, storage, and communication of information.

What is entropy in information theory and coding?

Entropy. When we observe the possibilities of the occurrence of an event, how surprising or uncertain it would be, it means that we are trying to have an idea on the average content of the information from the source of the event. Entropy can be defined as a measure of the average information content per source symbol.

Elements of Information Theory, 2nd Edition

Download Product Flyer is to download PDF in new tab. This is a dummy description. Download Product Flyer is to download PDF in new tab. This is a dummy description. Download Product Flyer is to download PDF in new tab. This is a dummy description. Download Product Flyer is to download PDF in new tab. This is a dummy description.

Description

The latest edition of this classic is updated with new problem sets and material

The Second Edition of this fundamental textbook maintains the book’s tradition of clear, thought-provoking instruction. Readers are provided once again with an instructive mix of mathematics, physics, statistics, and information theory.

All the essential topics in information theory are covered in detail, including entropy, data compression, channel capacity, rate distortion, network information theory, and hypothesis testing. The authors provide readers with a solid understanding of the underlying theory and applications. Problem sets and a telegraphic summary at the end of each chapter further assist readers. The historical notes that follow each chapter recap the main points.

The Second Edition features:

* Chapters reorganized to improve teaching

* 200 new problems

* New material on source coding, portfolio theory, and feedback capacity

* Updated references

Now current and enhanced, the Second Edition of Elements of Information Theory remains the ideal textbook for upper-level undergraduate and graduate courses in electrical engineering, statistics, and telecommunications.

A Gentle Introduction to Information Entropy

Tweet Tweet Share Share

Last Updated on July 13, 2020

Information theory is a subfield of mathematics concerned with transmitting data across a noisy channel.

A cornerstone of information theory is the idea of quantifying how much information there is in a message. More generally, this can be used to quantify the information in an event and a random variable, called entropy, and is calculated using probability.

Calculating information and entropy is a useful tool in machine learning and is used as the basis for techniques such as feature selection, building decision trees, and, more generally, fitting classification models. As such, a machine learning practitioner requires a strong understanding and intuition for information and entropy.

In this post, you will discover a gentle introduction to information entropy.

After reading this post, you will know:

Information theory is concerned with data compression and transmission and builds upon probability and supports machine learning.

Information provides a way to quantify the amount of surprise for an event measured in bits.

Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.

Kick-start your project with my new book Probability for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Nov/2019: Added example of probability vs information and more on the intuition for entropy.

Overview

This tutorial is divided into three parts; they are:

What Is Information Theory? Calculate the Information for an Event Calculate the Entropy for a Random Variable

What Is Information Theory?

Information theory is a field of study concerned with quantifying information for communication.

It is a subfield of mathematics and is concerned with topics like data compression and the limits of signal processing. The field was proposed and developed by Claude Shannon while working at the US telephone company Bell Labs.

Information theory is concerned with representing data in a compact fashion (a task known as data compression or source coding), as well as with transmitting and storing it in a way that is robust to errors (a task known as error correction or channel coding).

— Page 56, Machine Learning: A Probabilistic Perspective, 2012.

A foundational concept from information is the quantification of the amount of information in things like events, random variables, and distributions.

Quantifying the amount of information requires the use of probabilities, hence the relationship of information theory to probability.

Measurements of information are widely used in artificial intelligence and machine learning, such as in the construction of decision trees and the optimization of classifier models.

As such, there is an important relationship between information theory and machine learning and a practitioner must be familiar with some of the basic concepts from the field.

Why unify information theory and machine learning? Because they are two sides of the same coin. […] Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.

— Page v, Information Theory, Inference, and Learning Algorithms, 2003.

Want to Learn Probability for Machine Learning Take my free 7-day email crash course now (with sample code). Click to sign-up and also get a free PDF Ebook version of the course. Download Your FREE Mini-Course

Calculate the Information for an Event

Quantifying information is the foundation of the field of information theory.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).

Low Probability Event : High Information (surprising).

: High Information (surprising). High Probability Event: Low Information (unsurprising).

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

— Page 73, Deep Learning, 2016.

Rare events are more uncertain or more surprising and require more information to represent them than common events.

We can calculate the amount of information there is in an event using the probability of the event. This is called “Shannon information,” “self-information,” or simply the “information,” and can be calculated for a discrete event x as follows:

information(x) = -log( p(x) )

Where log() is the base-2 logarithm and p(x) is the probability of the event x.

The choice of the base-2 logarithm means that the units of the information measure is in bits (binary digits). This can be directly interpreted in the information processing sense as the number of bits required to represent the event.

The calculation of information is often written as h(); for example:

h(x) = -log( p(x) )

The negative sign ensures that the result is always positive or zero.

Information will be zero when the probability of an event is 1.0 or a certainty, e.g. there is no surprise.

Let’s make this concrete with some examples.

Consider a flip of a single fair coin. The probability of heads (and tails) is 0.5. We can calculate the information for flipping a head in Python using the log2() function.

# calculate the information for a coin flip from math import log2 # probability of the event p = 0.5 # calculate information for event h = -log2(p) # print the result print(‘p(x)=%.3f, information: %.3f bits’ % (p, h)) 1 2 3 4 5 6 7 8 # calculate the information for a coin flip from math import log2 # probability of the event p = 0.5 # calculate information for event h = – log2 ( p ) # print the result print ( ‘p(x)=%.3f, information: %.3f bits’ % ( p , h ) )

Running the example prints the probability of the event as 50% and the information content for the event as 1 bit.

p(x)=0.500, information: 1.000 bits 1 p(x)=0.500, information: 1.000 bits

If the same coin was flipped n times, then the information for this sequence of flips would be n bits.

If the coin was not fair and the probability of a head was instead 10% (0.1), then the event would be more rare and would require more than 3 bits of information.

p(x)=0.100, information: 3.322 bits 1 p(x)=0.100, information: 3.322 bits

We can also explore the information in a single roll of a fair six-sided dice, e.g. the information in rolling a 6.

We know the probability of rolling any number is 1/6, which is a smaller number than 1/2 for a coin flip, therefore we would expect more surprise or a larger amount of information.

# calculate the information for a dice roll from math import log2 # probability of the event p = 1.0 / 6.0 # calculate information for event h = -log2(p) # print the result print(‘p(x)=%.3f, information: %.3f bits’ % (p, h)) 1 2 3 4 5 6 7 8 # calculate the information for a dice roll from math import log2 # probability of the event p = 1.0 / 6.0 # calculate information for event h = – log2 ( p ) # print the result print ( ‘p(x)=%.3f, information: %.3f bits’ % ( p , h ) )

Running the example, we can see that our intuition is correct and that indeed, there is more than 2.5 bits of information in a single roll of a fair die.

p(x)=0.167, information: 2.585 bits 1 p(x)=0.167, information: 2.585 bits

Other logarithms can be used instead of the base-2. For example, it is also common to use the natural logarithm that uses base-e (Euler’s number) in calculating the information, in which case the units are referred to as “nats.”

We can further develop the intuition that low probability events have more information.

To make this clear, we can calculate the information for probabilities between 0 and 1 and plot the corresponding information for each. We can then create a plot of probability vs information. We would expect the plot to curve downward from low probabilities with high information to high probabilities with low information.

The complete example is listed below.

# compare probability vs information entropy from math import log2 from matplotlib import pyplot # list of probabilities probs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # calculate information info = [-log2(p) for p in probs] # plot probability vs information pyplot.plot(probs, info, marker=’.’) pyplot.title(‘Probability vs Information’) pyplot.xlabel(‘Probability’) pyplot.ylabel(‘Information’) pyplot.show() 1 2 3 4 5 6 7 8 9 10 11 12 13 # compare probability vs information entropy from math import log2 from matplotlib import pyplot # list of probabilities probs = [ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 1.0 ] # calculate information info = [ – log2 ( p ) for p in probs ] # plot probability vs information pyplot . plot ( probs , info , marker = ‘.’ ) pyplot . title ( ‘Probability vs Information’ ) pyplot . xlabel ( ‘Probability’ ) pyplot . ylabel ( ‘Information’ ) pyplot . show ( )

Running the example creates the plot of probability vs information in bits.

We can see the expected relationship where low probability events are more surprising and carry more information, and the complement of high probability events carry less information.

We can also see that this relationship is not linear, it is in-fact slightly sub-linear. This makes sense given the use of the log function.

Calculate the Entropy for a Random Variable

We can also quantify how much information there is in a random variable.

For example, if we wanted to calculate the information for a random variable X with probability distribution p, this might be written as a function H(); for example:

H(X)

In effect, calculating the information for a random variable is the same as calculating the information for the probability distribution of the events for the random variable.

Calculating the information for a random variable is called “information entropy,” “Shannon entropy,” or simply “entropy“. It is related to the idea of entropy from physics by analogy, in that both are concerned with uncertainty.

The intuition for entropy is that it is the average number of bits required to represent or transmit an event drawn from the probability distribution for the random variable.

… the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits […] needed on average to encode symbols drawn from a distribution P.

— Page 74, Deep Learning, 2016.

Entropy can be calculated for a random variable X with k in K discrete states as follows:

H(X) = -sum(each k in K p(k) * log(p(k)))

That is the negative of the sum of the probability of each event multiplied by the log of the probability of each event.

Like information, the log() function uses base-2 and the units are bits. A natural logarithm can be used instead and the units will be nats.

The lowest entropy is calculated for a random variable that has a single event with a probability of 1.0, a certainty. The largest entropy for a random variable will be if all events are equally likely.

We can consider a roll of a fair die and calculate the entropy for the variable. Each outcome has the same probability of 1/6, therefore it is a uniform probability distribution. We therefore would expect the average information to be the same information for a single event calculated in the previous section.

# calculate the entropy for a dice roll from math import log2 # the number of events n = 6 # probability of one event p = 1.0 /n # calculate entropy entropy = -sum([p * log2(p) for _ in range(n)]) # print the result print(‘entropy: %.3f bits’ % entropy) 1 2 3 4 5 6 7 8 9 10 # calculate the entropy for a dice roll from math import log2 # the number of events n = 6 # probability of one event p = 1.0 / n # calculate entropy entropy = – sum ( [ p * log2 ( p ) for _ in range ( n ) ] ) # print the result print ( ‘entropy: %.3f bits’ % entropy )

Running the example calculates the entropy as more than 2.5 bits, which is the same as the information for a single outcome. This makes sense, as the average information is the same as the lower bound on information as all outcomes are equally likely.

entropy: 2.585 bits 1 entropy: 2.585 bits

If we know the probability for each event, we can use the entropy() SciPy function to calculate the entropy directly.

For example:

# calculate the entropy for a dice roll from scipy.stats import entropy # discrete probabilities p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] # calculate entropy e = entropy(p, base=2) # print the result print(‘entropy: %.3f bits’ % e) 1 2 3 4 5 6 7 8 # calculate the entropy for a dice roll from scipy . stats import entropy # discrete probabilities p = [ 1 / 6 , 1 / 6 , 1 / 6 , 1 / 6 , 1 / 6 , 1 / 6 ] # calculate entropy e = entropy ( p , base = 2 ) # print the result print ( ‘entropy: %.3f bits’ % e )

Running the example reports the same result that we calculated manually.

entropy: 2.585 bits 1 entropy: 2.585 bits

We can further develop the intuition for entropy of probability distributions.

Recall that entropy is the number of bits required to represent a randomly drawn even from the distribution, e.g. an average event. We can explore this for a simple distribution with two events, like a coin flip, but explore different probabilities for these two events and calculate the entropy for each.

In the case where one event dominates, such as a skewed probability distribution, then there is less surprise and the distribution will have a lower entropy. In the case where no event dominates another, such as equal or approximately equal probability distribution, then we would expect larger or maximum entropy.

Skewed Probability Distribution (unsurprising): Low entropy.

(unsurprising): Low entropy. Balanced Probability Distribution (surprising): High entropy.

If we transition from skewed to equal probability of events in the distribution we would expect entropy to start low and increase, specifically from the lowest entropy of 0.0 for events with impossibility/certainty (probability of 0 and 1 respectively) to the largest entropy of 1.0 for events with equal probability.

The example below implements this, creating each probability distribution in this transition, calculating the entropy for each and plotting the result.

# compare probability distributions vs entropy from math import log2 from matplotlib import pyplot # calculate entropy def entropy(events, ets=1e-15): return -sum([p * log2(p + ets) for p in events]) # define probabilities probs = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] # create probability distribution dists = [[p, 1.0 – p] for p in probs] # calculate entropy for each distribution ents = [entropy(d) for d in dists] # plot probability distribution vs entropy pyplot.plot(probs, ents, marker=’.’) pyplot.title(‘Probability Distribution vs Entropy’) pyplot.xticks(probs, [str(d) for d in dists]) pyplot.xlabel(‘Probability Distribution’) pyplot.ylabel(‘Entropy (bits)’) pyplot.show() 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # compare probability distributions vs entropy from math import log2 from matplotlib import pyplot # calculate entropy def entropy ( events , ets = 1e – 15 ) : return – sum ( [ p * log2 ( p + ets ) for p in events ] ) # define probabilities probs = [ 0.0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 ] # create probability distribution dists = [ [ p , 1.0 – p ] for p in probs ] # calculate entropy for each distribution ents = [ entropy ( d ) for d in dists ] # plot probability distribution vs entropy pyplot . plot ( probs , ents , marker = ‘.’ ) pyplot . title ( ‘Probability Distribution vs Entropy’ ) pyplot . xticks ( probs , [ str ( d ) for d in dists ] ) pyplot . xlabel ( ‘Probability Distribution’ ) pyplot . ylabel ( ‘Entropy (bits)’ ) pyplot . show ( )

Running the example creates the 6 probability distributions with [0,1] probability through to [0.5,0.5] probabilities.

As expected, we can see that as the distribution of events changes from skewed to balanced, the entropy increases from minimal to maximum values.

That is, if the average event drawn from a probability distribution is not surprising we get a lower entropy, whereas if it is surprising, we get a larger entropy.

We can see that the transition is not linear, that it is super linear. We can also see that this curve is symmetrical if we continued the transition to [0.6, 0.4] and onward to [1.0, 0.0] for the two events, forming an inverted parabola-shape.

Note we had to add a tiny value to the probability when calculating the entropy to avoid calculating the log of a zero value, which would result in an infinity on not a number.

Calculating the entropy for a random variable provides the basis for other measures such as mutual information (information gain).

Entropy also provides the basis for calculating the difference between two probability distributions with cross-entropy and the KL-divergence.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Chapters

API

Articles

Summary

In this post, you discovered a gentle introduction to information entropy.

Specifically, you learned:

Information theory is concerned with data compression and transmission and builds upon probability and supports machine learning.

Information provides a way to quantify the amount of surprise for an event measured in bits.

Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Probability for Machine Learning! Develop Your Understanding of Probability …with just a few lines of python code …with just a few lines of python code Discover how in my new Ebook:

Probability for Machine Learning It provides self-study tutorials and end-to-end projects on:

Bayes Theorem, Bayesian Optimization, Distributions, Maximum Likelihood, Cross-Entropy, Calibrating Models

and much more… Finally Harness Uncertainty in Your Projects Skip the Academics. Just Results. Skip the Academics. Just Results. See What’s Inside

Information theory

Scientific study of digital information

Not to be confused with Information science

Information theory is the scientific study of the quantification, storage, and communication of digital information.[1] The field was fundamentally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s.[2]: vii The field is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.

A key measure in information theory is entropy. Entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process. For example, identifying the outcome of a fair coin flip (with two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a die (with six equally likely outcomes). Some other important measures in information theory are mutual information, channel capacity, error exponents, and relative entropy. Important sub-fields of information theory include source coding, algorithmic complexity theory, algorithmic information theory and information-theoretic security.

Applications of fundamental topics of information theory include source coding/data compression (e.g. for ZIP files), and channel coding/error detection and correction (e.g. for DSL). Its impact has been crucial to the success of the Voyager missions to deep space, the invention of the compact disc, the feasibility of mobile phones and the development of the Internet. The theory has also found applications in other areas, including statistical inference,[3] cryptography, neurobiology,[4] perception,[5] linguistics, the evolution[6] and function[7] of molecular codes (bioinformatics), thermal physics,[8] molecular dynamics,[9] quantum computing, black holes, information retrieval, intelligence gathering, plagiarism detection,[10] pattern recognition, anomaly detection[11] and even art creation.

Overview [ edit ]

Information theory studies the transmission, processing, extraction, and utilization of information. Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept was formalized in 1948 by Claude Shannon in a paper entitled A Mathematical Theory of Communication, in which information is thought of as a set of possible messages, and the goal is to send these messages over a noisy channel, and to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon’s main result, the noisy-channel coding theorem showed that, in the limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent.[4]

Coding theory is concerned with finding explicit methods, called codes, for increasing the efficiency and reducing the error rate of data communication over noisy channels to near the channel capacity. These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques. In the latter case, it took many years to find the methods Shannon’s work proved were possible.

A third class of information theory codes are cryptographic algorithms (both codes and ciphers). Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis. See the article ban (unit) for a historical application.

Historical background [ edit ]

The landmark event establishing the discipline of information theory and bringing it to immediate worldwide attention was the publication of Claude E. Shannon’s classic paper “A Mathematical Theory of Communication” in the Bell System Technical Journal in July and October 1948.

Prior to this paper, limited information-theoretic ideas had been developed at Bell Labs, all implicitly assuming events of equal probability. Harry Nyquist’s 1924 paper, Certain Factors Affecting Telegraph Speed, contains a theoretical section quantifying “intelligence” and the “line speed” at which it can be transmitted by a communication system, giving the relation W = K log m (recalling the Boltzmann constant), where W is the speed of transmission of intelligence, m is the number of different voltage levels to choose from at each time step, and K is a constant. Ralph Hartley’s 1928 paper, Transmission of Information, uses the word information as a measurable quantity, reflecting the receiver’s ability to distinguish one sequence of symbols from any other, thus quantifying information as H = log Sn = n log S, where S was the number of possible symbols, and n the number of symbols in a transmission. The unit of information was therefore the decimal digit, which since has sometimes been called the hartley in his honor as a unit or scale or measure of information. Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war Enigma ciphers.

Much of the mathematics behind information theory with events of different probabilities were developed for the field of thermodynamics by Ludwig Boltzmann and J. Willard Gibbs. Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by Rolf Landauer in the 1960s, are explored in Entropy in thermodynamics and information theory.

In Shannon’s revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion:

“The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point.”

With it came the ideas of

the information entropy and redundancy of a source, and its relevance through the source coding theorem;

the mutual information, and the channel capacity of a noisy channel, including the promise of perfect loss-free communication given by the noisy-channel coding theorem;

the practical result of the Shannon–Hartley law for the channel capacity of a Gaussian channel; as well as

the bit—a new way of seeing the most fundamental unit of information.

Quantities of information [ edit ]

Information theory is based on probability theory and statistics, where quantified information is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called entropy, which forms the building block of many other measures. Entropy allows quantification of measure of information in a single random variable. Another useful concept is mutual information defined on two random variables, which describes the measure of information in common between those variables, which can be used to describe their correlation. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with the given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution.

The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. A common unit of information is the bit, based on the binary logarithm. Other units include the nat, which is based on the natural logarithm, and the decimal digit, which is based on the common logarithm.

In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p = 0. This is justified because lim p → 0 + p log ⁡ p = 0 {\displaystyle \lim _{p\rightarrow 0+}p\log p=0} for any logarithmic base.

Entropy of an information source [ edit ]

Based on the probability mass function of each source symbol to be communicated, the Shannon entropy H, in units of bits (per symbol), is given by

H = − ∑ i p i log 2 ⁡ ( p i ) {\displaystyle H=-\sum _{i}p_{i}\log _{2}(p_{i})}

where p i is the probability of occurrence of the i-th possible value of the source symbol. This equation gives the entropy in the units of “bits” (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called the shannon in his honor. Entropy is also commonly computed using the natural logarithm (base e, where e is Euler’s number), which produces a measurement of entropy in nats per symbol and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used. For example, a logarithm of base 28 = 256 will produce a measurement in bytes per symbol, and a logarithm of base 10 will produce a measurement in decimal digits (or hartleys) per symbol.

Intuitively, the entropy H X of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X when only its distribution is known.

The entropy of a source that emits a sequence of N symbols that are independent and identically distributed (iid) is N ⋅ H bits (per message of N symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length N will be less than N ⋅ H.

H b (p) . The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss. The entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function . The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.

If one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If X {\displaystyle \mathbb {X} } is the set of all messages {x 1 , …, x n } that X could be, and p(x) is the probability of some x ∈ X {\displaystyle x\in \mathbb {X} } , then the entropy, H, of X is defined:[12]

H ( X ) = E X [ I ( x ) ] = − ∑ x ∈ X p ( x ) log ⁡ p ( x ) . {\displaystyle H(X)=\mathbb {E} _{X}[I(x)]=-\sum _{x\in \mathbb {X} }p(x)\log p(x).}

(Here, I(x) is the self-information, which is the entropy contribution of an individual message, and E X {\displaystyle \mathbb {E} _{X}} is the expected value.) A property of entropy is that it is maximized when all the messages in the message space are equiprobable p(x) = 1/n; i.e., most unpredictable, in which case H(X) = log n.

The special case of information entropy for a random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2, thus having the shannon (Sh) as unit:

H b ( p ) = − p log 2 ⁡ p − ( 1 − p ) log 2 ⁡ ( 1 − p ) . {\displaystyle H_{\mathrm {b} }(p)=-p\log _{2}p-(1-p)\log _{2}(1-p).}

Joint entropy [ edit ]

The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X, Y). This implies that if X and Y are independent, then their joint entropy is the sum of their individual entropies.

For example, if (X, Y) represents the position of a chess piece—X the row and Y the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.

H ( X , Y ) = E X , Y [ − log ⁡ p ( x , y ) ] = − ∑ x , y p ( x , y ) log ⁡ p ( x , y ) {\displaystyle H(X,Y)=\mathbb {E} _{X,Y}[-\log p(x,y)]=-\sum _{x,y}p(x,y)\log p(x,y)\,}

Despite similar notation, joint entropy should not be confused with cross entropy.

Conditional entropy (equivocation) [ edit ]

The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation of X about Y) is the average conditional entropy over Y:[13]

H ( X | Y ) = E Y [ H ( X | y ) ] = − ∑ y ∈ Y p ( y ) ∑ x ∈ X p ( x | y ) log ⁡ p ( x | y ) = − ∑ x , y p ( x , y ) log ⁡ p ( x | y ) . {\displaystyle H(X|Y)=\mathbb {E} _{Y}[H(X|y)]=-\sum _{y\in Y}p(y)\sum _{x\in X}p(x|y)\log p(x|y)=-\sum _{x,y}p(x,y)\log p(x|y).}

Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that:

H ( X | Y ) = H ( X , Y ) − H ( Y ) . {\displaystyle H(X|Y)=H(X,Y)-H(Y).\,}

Mutual information (transinformation) [ edit ]

Mutual information measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of X relative to Y is given by:

I ( X ; Y ) = E X , Y [ S I ( x , y ) ] = ∑ x , y p ( x , y ) log ⁡ p ( x , y ) p ( x ) p ( y ) {\displaystyle I(X;Y)=\mathbb {E} _{X,Y}[SI(x,y)]=\sum _{x,y}p(x,y)\log {\frac {p(x,y)}{p(x)\,p(y)}}}

where SI (Specific mutual Information) is the pointwise mutual information.

A basic property of the mutual information is that

I ( X ; Y ) = H ( X ) − H ( X | Y ) . {\displaystyle I(X;Y)=H(X)-H(X|Y).\,}

That is, knowing Y, we can save an average of I(X; Y) bits in encoding X compared to not knowing Y.

Mutual information is symmetric:

I ( X ; Y ) = I ( Y ; X ) = H ( X ) + H ( Y ) − H ( X , Y ) . {\displaystyle I(X;Y)=I(Y;X)=H(X)+H(Y)-H(X,Y).\,}

Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the posterior probability distribution of X given the value of Y and the prior distribution on X:

I ( X ; Y ) = E p ( y ) [ D K L ( p ( X | Y = y ) ‖ p ( X ) ) ] . {\displaystyle I(X;Y)=\mathbb {E} _{p(y)}[D_{\mathrm {KL} }(p(X|Y=y)\|p(X))].}

In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:

I ( X ; Y ) = D K L ( p ( X , Y ) ‖ p ( X ) p ( Y ) ) . {\displaystyle I(X;Y)=D_{\mathrm {KL} }(p(X,Y)\|p(X)p(Y)).}

Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson’s χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.

Kullback–Leibler divergence (information gain) [ edit ]

The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions: a “true” probability distribution p ( X ) {\displaystyle p(X)} , and an arbitrary probability distribution q ( X ) {\displaystyle q(X)} . If we compress data in a manner that assumes q ( X ) {\displaystyle q(X)} is the distribution underlying some data, when, in reality, p ( X ) {\displaystyle p(X)} is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined

D K L ( p ( X ) ‖ q ( X ) ) = ∑ x ∈ X − p ( x ) log ⁡ q ( x ) − ∑ x ∈ X − p ( x ) log ⁡ p ( x ) = ∑ x ∈ X p ( x ) log ⁡ p ( x ) q ( x ) . {\displaystyle D_{\mathrm {KL} }(p(X)\|q(X))=\sum _{x\in X}-p(x)\log {q(x)}\,-\,\sum _{x\in X}-p(x)\log {p(x)}=\sum _{x\in X}p(x)\log {\frac {p(x)}{q(x)}}.}

Although it is sometimes used as a ‘distance metric’, KL divergence is not a true metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric).

Another interpretation of the KL divergence is the “unnecessary surprise” introduced by a prior from the truth: suppose a number X is about to be drawn randomly from a discrete set with probability distribution p ( x ) {\displaystyle p(x)} . If Alice knows the true distribution p ( x ) {\displaystyle p(x)} , while Bob believes (has a prior) that the distribution is q ( x ) {\displaystyle q(x)} , then Bob will be more surprised than Alice, on average, upon seeing the value of X. The KL divergence is the (objective) expected value of Bob’s (subjective) surprisal minus Alice’s surprisal, measured in bits if the log is in base 2. In this way, the extent to which Bob’s prior is “wrong” can be quantified in terms of how “unnecessarily surprised” it is expected to make him.

Other quantities [ edit ]

Other important information theoretic quantities include Rényi entropy (a generalization of entropy), differential entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information.

Coding theory [ edit ]

A picture showing scratches on the readable surface of a CD-R. Music and data CDs are coded using error correcting codes and thus can still be read even if they have minor scratches using error detection and correction

Coding theory is one of the most important and direct applications of information theory. It can be subdivided into source coding theory and channel coding theory. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.

Data compression (source coding): There are two formulations for the compression problem: lossless data compression: the data must be reconstructed exactly; lossy data compression: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function. This subset of information theory is called rate–distortion theory .

Error-correcting codes (channel coding): While data compression removes as much redundancy as possible, an error-correcting code adds just the right kind of redundancy (i.e., error correction) needed to transmit the data efficiently and faithfully across a noisy channel.

This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary “helpers” (the relay channel), or more general networks, compression followed by transmission may no longer be optimal.

Source theory [ edit ]

Any process that generates successive messages can be considered a source of information. A memoryless source is one in which each message is an independent identically distributed random variable, whereas the properties of ergodicity and stationarity impose less restrictive constraints. All such sources are stochastic. These terms are well studied in their own right outside information theory.

Rate [ edit ]

Information rate is the average entropy per symbol. For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is

r = lim n → ∞ H ( X n | X n − 1 , X n − 2 , X n − 3 , … ) ; {\displaystyle r=\lim _{n\to \infty }H(X_{n}|X_{n-1},X_{n-2},X_{n-3},\ldots );}

that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the average rate is

r = lim n → ∞ 1 n H ( X 1 , X 2 , … X n ) ; {\displaystyle r=\lim _{n\to \infty }{\frac {1}{n}}H(X_{1},X_{2},\dots X_{n});}

that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.[14]

Information rate is defined as

r = lim n → ∞ 1 n I ( X 1 , X 2 , … X n ; Y 1 , Y 2 , … Y n ) ; {\displaystyle r=\lim _{n\to \infty }{\frac {1}{n}}I(X_{1},X_{2},\dots X_{n};Y_{1},Y_{2},\dots Y_{n});}

It is common in information theory to speak of the “rate” or “entropy” of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of source coding.

Channel capacity [ edit ]

Communications over a channel is the primary motivation of information theory. However, channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality.

Consider the communications process over a discrete channel. A simple model of the process is shown below:

→ Message W Encoder f n → E n c o d e d s e q u e n c e X n Channel p ( y | x ) → R e c e i v e d s e q u e n c e Y n Decoder g n → E s t i m a t e d m e s s a g e W ^ {\displaystyle {\xrightarrow[{\text{Message}}]{W}}{\begin{array}{|c| }\hline {\text{Encoder}}\\f_{n}\\\hline \end{array}}{\xrightarrow[{\mathrm {Encoded \atop sequence} }]{X^{n}}}{\begin{array}{|c| }\hline {\text{Channel}}\\p(y|x)\\\hline \end{array}}{\xrightarrow[{\mathrm {Received \atop sequence} }]{Y^{n}}}{\begin{array}{|c| }\hline {\text{Decoder}}\\g_{n}\\\hline \end{array}}{\xrightarrow[{\mathrm {Estimated \atop message} }]{\hat {W}}}}

Here X represents the space of messages transmitted, and Y the space of messages received during a unit time over our channel. Let p(y|x) be the conditional probability distribution function of Y given X. We will consider p(y|x) to be an inherent fixed property of our communications channel (representing the nature of the noise of our channel). Then the joint distribution of X and Y is completely determined by our channel and by our choice of f(x), the marginal distribution of messages we choose to send over the channel. Under these constraints, we would like to maximize the rate of information, or the signal, we can communicate over the channel. The appropriate measure for this is the mutual information, and this maximum mutual information is called the channel capacity and is given by:

C = max f I ( X ; Y ) . {\displaystyle C=\max _{f}I(X;Y).\!}

This capacity has the following property related to communicating at information rate R (where R is usually bits per symbol). For any information rate R < C and coding error ε > 0, for large enough N, there exists a code of length N and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε; that is, it is always possible to transmit with arbitrarily small block error. In addition, for any rate R > C, it is impossible to transmit with arbitrarily small block error.

Channel coding is concerned with finding such nearly optimal codes that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity.

Capacity of particular channel models [ edit ]

A continuous-time analog communications channel subject to Gaussian noise—see Shannon–Hartley theorem.

A binary symmetric channel (BSC) with crossover probability p is a binary input, binary output channel that flips the input bit with probability p. The BSC has a capacity of 1 − H b (p) bits per channel use, where H b is the binary entropy function to the base-2 logarithm:

A binary erasure channel (BEC) with erasure probability p is a binary input, ternary output channel. The possible channel outputs are 0, 1, and a third symbol ‘e’ called an erasure. The erasure represents complete loss of information about an input bit. The capacity of the BEC is 1 − p bits per channel use.

Channels with memory and directed information [ edit ]

In practice many channels have memory. Namely, at time i {\displaystyle i} the channel is given by the conditional probability P ( y i | x i , x i − 1 , x i − 2 , . . . , x 1 , y i − 1 , y i − 2 , . . . , y 1 ) . {\displaystyle P(y_{i}|x_{i},x_{i-1},x_{i-2},…,x_{1},y_{i-1},y_{i-2},…,y_{1}).} . It is often more comfortable to use the notation x i = ( x i , x i − 1 , x i − 2 , . . . , x 1 ) {\displaystyle x^{i}=(x_{i},x_{i-1},x_{i-2},…,x_{1})} and the channel become P ( y i | x i , y i − 1 ) . {\displaystyle P(y_{i}|x^{i},y^{i-1}).} . In such a case the capacity is given by the mutual information rate when there is no feedback available and the Directed information rate in the case that either there is feedback or not[15][16] (if there is no feedback the directed informationj equals the mutual information).

Applications to other fields [ edit ]

Intelligence uses and secrecy applications [ edit ]

Information theoretic concepts apply to cryptography and cryptanalysis. Turing’s information unit, the ban, was used in the Ultra project, breaking the German Enigma machine code and hastening the end of World War II in Europe. Shannon himself defined an important concept now called the unicity distance. Based on the redundancy of the plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.

Information theory leads us to believe it is much more difficult to keep secrets than it might first appear. A brute force attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key algorithms (sometimes called secret key algorithms), such as block ciphers. The security of all such methods currently comes from the assumption that no known attack can break them in a practical amount of time.

Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force attacks. In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned on the key) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications. In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of key material.

Pseudorandom number generation [ edit ]

Pseudorandom number generators are widely available in computer language libraries and application programs. They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software. A class of improved random number generators is termed cryptographically secure pseudorandom number generators, but even they require random seeds external to the software to work as intended. These can be obtained via extractors, if done carefully. The measure of sufficient randomness in extractors is min-entropy, a value related to Shannon entropy through Rényi entropy; Rényi entropy is also used in evaluating randomness in cryptographic systems. Although related, the distinctions among these measures mean that a random variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.

Seismic exploration [ edit ]

One early commercial application of information theory was in the field of seismic oil exploration. Work in this field made it possible to strip off and separate the unwanted noise from the desired seismic signal. Information theory and digital signal processing offer a major improvement of resolution and image clarity over previous analog methods.[17]

Semiotics [ edit ]

Semioticians Doede Nauta and Winfried Nöth both considered Charles Sanders Peirce as having created a theory of information in his works on semiotics.[18]: 171 [19]: 137 Nauta defined semiotic information theory as the study of “the internal processes of coding, filtering, and information processing.”[18]: 91

Concepts from information theory such as redundancy and code control have been used by semioticians such as Umberto Eco and Ferruccio Rossi-Landi to explain ideology as a form of message transmission whereby a dominant social class emits its message by using signs that exhibit a high degree of redundancy such that only one message is decoded among a selection of competing ones.[20]

Integrated process organization of neural information [ edit ]

Quantitative information theoretic methods have been applied in cognitive science to analyze the integrated process organization of neural information in the context of the binding problem in cognitive neuroscience.[21] In this context, either an information-theoretical measure, such as “functional clusters” (G.M. Edelman’s and G. Tononi’s “Functional Clustering Model” and “Dynamic Core Hypothesis (DCH)”[22]) or “effective information” (G. Tononi’s and O. Sporn‘s “Information Integration Theory (ITT) of Consciousness”[23][24][25] (see also “integrated information theory”)), is defined (on the basis of a reentrant process organization, i.e. the synchronization of neurophysiological activity between groups of neuronal populations), or the measure of the minimization of “free energy” on the basis of statistical methods (K.J. Friston’s “free energy principle (FEP)”, an information-theoretical measure which states that every adaptive change in a self-organized system leads to a minimization of “free energy”, and the “Bayesian Brain Hypothesis”[26][27][28][29][30]).

Miscellaneous applications [ edit ]

Information theory also has applications in gambling, black holes, and bioinformatics.

See also [ edit ]

Applications [ edit ]

History [ edit ]

Theory [ edit ]

Concepts [ edit ]

References [ edit ]

Further reading [ edit ]

The classic work [ edit ]

Other journal articles [ edit ]

Textbooks on information theory [ edit ]

Other books [ edit ]

[397 p. COMPLETE SOLUTIONS] Elements of Information Theory 2nd Edition – COMPLETE solutions manual (chapters 1-17)

Preface

Here we have the solutions to all the problems in the second edition of Elements of InformationTheory. First a word about how the problems and solutions were generated.

The problems arose over the many years the authors taught this course. At first thehomework problems and exam problems were generated each week. After a few years ofthis double duty, the homework problems were rolled forward from previous years and onlythe exam problems were fresh. So each year, the midterm and final exam problems becamecandidates for addition to the body of homework problems that you see in the text. Theexam problems are necessarily brief, with a point, and reasonable free from time consumingcalculation, so the problems in the text for the most part share these properties.

The solutions to the problems were generated by the teaching assistants and graders forthe weekly homework assignments and handed back with the graded homeworks in the classimmediately following the date the assignment was due. Homeworks were optional and didnot enter into the course grade. Nonetheless most students did the homework. A list of themany students who contributed to the solutions is given in the book acknowledgment. Inparticular, we would like to thank Laura Ekroot, Will Equitz, Don Kimber, Mitchell Trott,Andrew Nobel, Jim Roche, Vittorio Castelli, Mitchell Oslick, Chien-Wen Tseng, Michael Mor-rell, Marc Goldberg, George Gemelos, Navid Hassanpour, Young-Han Kim, Charles Mathis,Styrmir Sigurjonsson, Jon Yard, Michael Baer, Mung Chiang, Suhas Diggavi, Elza Erkip,Paul Fahn, Garud Iyengar, David Julian, Yiannis Kontoyiannis, Amos Lapidoth, Erik Or-dentlich, Sandeep Pombra, Arak Sutivong, Josh Sweetkind-Singer and Assaf Zeevi. We wouldlike to thank Prof. John Gill and Prof. Abbas El Gamal for many interesting problems andsolutions.

The solutions therefore show a wide range of personalities and styles, although some ofthem have been smoothed out over the years by the authors. The best way to look at thesolutions is that they offer more than you need to solve the problems. And the solutions insome cases may be awkward or inefficient. We view that as a plus. An instructor can see theextent of the problem by examining the solution but can still improve his or her own version.

The solution manual comes to some 400 pages. We are making electronic copies availableto course instructors in PDF. We hope that all the solutions are not put up on an insecurewebsiteit will not be useful to use the problems in the book for homeworks and exams if thesolutions can be obtained immediately with a quick Google search. Instead, we will put up asmall selected subset of problem solutions on our website, http://www.elementsofinformationtheory.com,available to all. These will be problems that have particularly elegant or long solutions thatwould not be suitable homework or exam problems.

5

Elements of Information Theory Second Edition Solutions to Problems

Transcript

Elements of Information Theory Second Edition Solutions to Problems Thomas M. Cover Joy A. Thomas August 23, 2007 1 COPYRIGHT 2006 Thomas Cover Joy Thomas All rights reserved 2 Contents 1 Introduction 7 2 Entropy, Relative Entropy and Mutual Information 9 3 The Asymptotic Equipartition Property 49 4 Entropy Rates of a Stochastic Process 61 5 Data Compression 97 6 Gambling and Data Compression 139 7 Channel Capacity 163 8 Differential Entropy 203 9 Gaussian channel 217 10 Rate Distortion Theory 241 11 Information Theory and Statistics 273 12 Maximum Entropy 307 13 Universal Source Coding 323 14 Kolmogorov Complexity 337 15 Network Information Theory 347 16 Information Theory and Portfolio Theory 393 17 Inequalities in Information Theory 407 3 4 CONTENTS Preface Here we have the solutions to all the problems in the second edition of Elements of Information Theory. First a word about how the problems and solutions were generated. The problems arose over the many years the authors taught this course. At first the homework problems and exam problems were generated each week. After a few years of this double duty, the homework problems were rolled forward from previous years and only the exam problems were fresh. So each year, the midterm and final exam problems became candidates for addition to the body of homework problems that you see in the text. The exam problems are necessarily brief, with a point, and reasonable free from time consuming calculation, so the problems in the text for the most part share these properties. The solutions to the problems were generated by the teaching assistants and graders for the weekly homework assignments and handed back with the graded homeworks in the class immediately following the date the assignment was due. Homeworks were optional and did not enter into the course grade. Nonetheless most students did the homework. A list of the many students who contributed to the solutions is given in the book acknowledgment. In particular, we would like to thank Laura Ekroot, Will Equitz, Don Kimber, Mitchell Trott, Andrew Nobel, Jim Roche, Vittorio Castelli, Mitchell Oslick, Chien-Wen Tseng, Michael Morrell, Marc Goldberg, George Gemelos, Navid Hassanpour, Young-Han Kim, Charles Mathis, Styrmir Sigurjonsson, Jon Yard, Michael Baer, Mung Chiang, Suhas Diggavi, Elza Erkip, Paul Fahn, Garud Iyengar, David Julian, Yiannis Kontoyiannis, Amos Lapidoth, Erik Ordentlich, Sandeep Pombra, Arak Sutivong, Josh Sweetkind-Singer and Assaf Zeevi. We would like to thank Prof. John Gill and Prof. Abbas El Gamal for many interesting problems and solutions. The solutions therefore show a wide range of personalities and styles, although some of them have been smoothed out over the years by the authors. The best way to look at the solutions is that they offer more than you need to solve the problems. And the solutions in some cases may be awkward or inefficient. We view that as a plus. An instructor can see the extent of the problem by examining the solution but can still improve his or her own version. The solution manual comes to some 400 pages. We are making electronic copies available to course instructors in PDF. We hope that all the solutions are not put up on an insecure website—it will not be useful to use the problems in the book for homeworks and exams if the solutions can be obtained immediately with a quick Google search. Instead, we will put up a small selected subset of problem solutions on our website, http://www.elementsofinformationtheory.com, available to all. These will be problems that have particularly elegant or long solutions that would not be suitable homework or exam problems. 5 6 CONTENTS We have also seen some people trying to sell the solutions manual on Amazon or Ebay. Please note that the Solutions Manual for Elements of Information Theory is copyrighted and any sale or distribution without the permission of the authors is not permitted. We would appreciate any comments, suggestions and corrections to this solutions manual. Tom Cover Durand 121, Information Systems Lab Stanford University Stanford, CA 94305. Ph. 650-723-4505 FAX: 650-723-8473 Email: [email protected] Joy Thomas Stratify 701 N Shoreline Avenue Mountain View, CA 94043. Ph. 650-210-2722 FAX: 650-988-2159 Email: [email protected] Chapter 1 Introduction 7 8 Introduction Chapter 2 Entropy, Relative Entropy and Mutual Information 1. Coin flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips required. (a) Find the entropy H(X) in bits. The following expressions may be useful: ∞ ! ∞ ! 1 r = , 1−r n=0 n nr n = n=0 r . (1 − r)2 (b) A random variable X is drawn according to this distribution. Find an “efficient” sequence of yes-no questions of the form, “Is X contained in the set S ?” Compare H(X) to the expected number of questions required to determine X . Solution: (a) The number X of tosses till the first head appears has the geometric distribution with parameter p = 1/2 , where P (X = n) = pq n−1 , n ∈ {1, 2, . . .} . Hence the entropy of X is H(X) = − = − ∞ ! pq n−1 log(pq n−1 ) n=1 “∞ ! n=0 pq log p + n n=0 −p log p pq log q − 1−q p2 −p log p − q log q = p = H(p)/p bits. = If p = 1/2 , then H(X) = 2 bits. 9 ∞ ! npq log q n # 10 Entropy, Relative Entropy and Mutual Information (b) Intuitively, it seems clear that the best questions are those that have equally likely chances of receiving a yes or a no answer. Consequently, one possible guess is that the most “efficient” series of questions is: Is X = 1 ? If not, is X = 2 ? If not, is X = 3 ? . . . with a resulting expected number of questions equal to $∞ n n=1 n(1/2 ) = 2. This should reinforce the intuition that H(X) is a measure of the uncertainty of X . Indeed in this case, the entropy is exactly the same as the average number of questions needed to define X , and in general E(# of questions) ≥ H(X) . This problem has an interpretation as a source coding problem. Let 0 = no, 1 = yes, X = Source, and Y = Encoded Source. Then the set of questions in the above procedure can be written as a collection of (X, Y ) pairs: (1, 1) , (2, 01) , (3, 001) , etc. . In fact, this intuitively derived code is the optimal (Huffman) code minimizing the expected number of questions. 2. Entropy of functions. Let X be a random variable taking on a finite number of values. What is the (general) inequality relationship of H(X) and H(Y ) if (a) Y = 2X ? (b) Y = cos X ? Solution: Let y = g(x) . Then ! p(y) = p(x). x: y=g(x) Consider any set of x ’s that map onto a single y . For this set ! x: y=g(x) p(x) log p(x) ≤ ! p(x) log p(y) = p(y) log p(y), x: y=g(x) $ since log is a monotone increasing function and p(x) ≤ x: y=g(x) p(x) = p(y) . Extending this argument to the entire range of X (and Y ), we obtain H(X) = − = − ≥ − ! p(x) log p(x) x ! ! p(x) log p(x) y x: y=g(x) ! p(y) log p(y) y = H(Y ), with equality iff g is one-to-one with probability one. (a) Y = 2X is one-to-one and hence the entropy, which is just a function of the probabilities (and not the values of a random variable) does not change, i.e., H(X) = H(Y ) . (b) Y = cos(X) is not necessarily one-to-one. Hence all that we can say is that H(X) ≥ H(Y ) , with equality if cosine is one-to-one on the range of X . Entropy, Relative Entropy and Mutual Information 11 3. Minimum entropy. What is the minimum value of H(p 1 , …, pn ) = H(p) as p ranges over the set of n -dimensional probability vectors? Find all p ’s which achieve this minimum. Solution: We wish to find all probability vectors p = (p 1 , p2 , . . . , pn ) which minimize H(p) = − ! pi log pi . i Now −pi log pi ≥ 0 , with equality iff pi = 0 or 1 . Hence the only possible probability vectors which minimize H(p) are those with p i = 1 for some i and pj = 0, j %= i . There are n such vectors, i.e., (1, 0, . . . , 0) , (0, 1, 0, . . . , 0) , . . . , (0, . . . , 0, 1) , and the minimum value of H(p) is 0. 4. Entropy of functions of a random variable. Let X be a discrete random variable. Show that the entropy of a function of X is less than or equal to the entropy of X by justifying the following steps: H(X, g(X)) (a) = (b) = H(X, g(X)) (c) = (d) ≥ Thus H(g(X)) ≤ H(X). H(X) + H(g(X) | X) (2.1) H(X); (2.2) H(g(X)) + H(X | g(X)) (2.3) H(g(X)). (2.4) Solution: Entropy of functions of a random variable. (a) H(X, g(X)) = H(X) + H(g(X)|X) by the chain rule for entropies. (b) H(g(X)|X) = 0 since for any particular value of X, g(X) is fixed, and hence $ $ H(g(X)|X) = x p(x)H(g(X)|X = x) = x 0 = 0 . (c) H(X, g(X)) = H(g(X)) + H(X|g(X)) again by the chain rule. (d) H(X|g(X)) ≥ 0 , with equality iff X is a function of g(X) , i.e., g(.) is one-to-one. Hence H(X, g(X)) ≥ H(g(X)) . Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)) . 5. Zero conditional entropy. Show that if H(Y |X) = 0 , then Y is a function of X , i.e., for all x with p(x) > 0 , there is only one possible value of y with p(x, y) > 0 . Solution: Zero Conditional Entropy. Assume that there exists an x , say x 0 and two different values of y , say y1 and y2 such that p(x0 , y1 ) > 0 and p(x0 , y2 ) > 0 . Then p(x0 ) ≥ p(x0 , y1 ) + p(x0 , y2 ) > 0 , and p(y1 |x0 ) and p(y2 |x0 ) are not equal to 0 or 1. Thus H(Y |X) = − ! x p(x) ! p(y|x) log p(y|x) ≥ p(x0 )(−p(y1 |x0 ) log p(y1 |x0 ) − p(y2 |x0 ) log p(y2 |x0 )) > > 0, (2.5) y (2.6) (2.7) 12 Entropy, Relative Entropy and Mutual Information since −t log t ≥ 0 for 0 ≤ t ≤ 1 , and is strictly positive for t not equal to 0 or 1. Therefore the conditional entropy H(Y |X) is 0 if and only if Y is a function of X . 6. Conditional mutual information vs. unconditional mutual information. Give examples of joint random variables X , Y and Z such that (a) I(X; Y | Z) < I(X; Y ) , (b) I(X; Y | Z) > I(X; Y ) . Solution: Conditional mutual information vs. unconditional mutual information. (a) The last corollary to Theorem 2.8.1 in the text states that if X → Y → Z that is, if p(x, y | z) = p(x | z)p(y | z) then, I(X; Y ) ≥ I(X; Y | Z) . Equality holds if and only if I(X; Z) = 0 or X and Z are independent. A simple example of random variables satisfying the inequality conditions above is, X is a fair binary random variable and Y = X and Z = Y . In this case, I(X; Y ) = H(X) − H(X | Y ) = H(X) = 1 and, I(X; Y | Z) = H(X | Z) − H(X | Y, Z) = 0. So that I(X; Y ) > I(X; Y | Z) . (b) This example is also given in the text. Let X, Y be independent fair binary random variables and let Z = X + Y . In this case we have that, I(X; Y ) = 0 and, I(X; Y | Z) = H(X | Z) = 1/2. So I(X; Y ) < I(X; Y | Z) . Note that in this case X, Y, Z are not markov. 7. Coin weighing. Suppose one has n coins, among which there may or may not be one counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than the other coins. The coins are to be weighed by a balance. (a) Find an upper bound on the number of coins n so that k weighings will find the counterfeit coin (if any) and correctly declare it to be heavier or lighter. (b) (Difficult) What is the coin weighing strategy for k = 3 weighings and 12 coins? Solution: Coin weighing. (a) For n coins, there are 2n + 1 possible situations or “states”. • One of the n coins is heavier. • One of the n coins is lighter. • They are all of equal weight. Entropy, Relative Entropy and Mutual Information 13 Each weighing has three possible outcomes - equal, left pan heavier or right pan heavier. Hence with k weighings, there are 3 k possible outcomes and hence we can distinguish between at most 3k different “states”. Hence 2n + 1 ≤ 3k or n ≤ (3k − 1)/2 . Looking at it from an information theoretic viewpoint, each weighing gives at most log2 3 bits of information. There are 2n + 1 possible “states”, with a maximum entropy of log2 (2n + 1) bits. Hence in this situation, one would require at least log2 (2n + 1)/ log 2 3 weighings to extract enough information for determination of the odd coin, which gives the same result as above. (b) There are many solutions to this problem. We will give one which is based on the ternary number system. We may express the numbers {−12, −11, . . . , −1, 0, 1, . . . , 12} in a ternary number system with alphabet {−1, 0, 1} . For example, the number 8 is (-1,0,1) where −1 × 30 + 0 × 31 + 1 × 32 = 8 . We form the matrix with the representation of the positive numbers as its columns. 1 2 3 4 5 6 7 8 9 10 11 12 0 3 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 0 31 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 2 2 3 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8 Note that the row sums are not all zero. We can negate some columns to make the row sums zero. For example, negating columns 7,9,11 and 12, we obtain 1 2 3 4 5 6 7 8 9 10 11 12 0 3 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 0 31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 0 32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0 Now place the coins on the balance according to the following rule: For weighing #i , place coin n • On left pan, if ni = −1 . • Aside, if ni = 0 . • On right pan, if ni = 1 . The outcome of the three weighings will find the odd coin if any and tell whether it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings give the ternary expansion of the index of the odd coin. If the expansion is the same as the expansion in the matrix, it indicates that the coin is heavier. If the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1) indicates (0)30 +(−1)3+(−1)32 = −12 , hence coin #12 is heavy, (1,0,-1) indicates #8 is light, (0,0,0) indicates no odd coin. Why does this scheme work? It is a single error correcting Hamming code for the ternary alphabet (discussed in Section 8.11 in the book). Here are some details. First note a few properties of the matrix above that was used for the scheme. All the columns are distinct and no two columns add to (0,0,0). Also if any coin 14 Entropy, Relative Entropy and Mutual Information is heavier, it will produce the sequence of weighings that matches its column in the matrix. If it is lighter, it produces the negative of its column as a sequence of weighings. Combining all these facts, we can see that any single odd coin will produce a unique sequence of weighings, and that the coin can be determined from the sequence. One of the questions that many of you had whether the bound derived in part (a) was actually achievable. For example, can one distinguish 13 coins in 3 weighings? No, not with a scheme like the one above. Yes, under the assumptions under which the bound was derived. The bound did not prohibit the division of coins into halves, neither did it disallow the existence of another coin known to be normal. Under both these conditions, it is possible to find the odd coin of 13 coins in 3 weighings. You could try modifying the above scheme to these cases. 8. Drawing with and without replacement. An urn contains r red, w white, and b black balls. Which has higher entropy, drawing k ≥ 2 balls from the urn with replacement or without replacement? Set it up and show why. (There is both a hard way and a relatively simple way to do this.) Solution: Drawing with and without replacement. Intuitively, it is clear that if the balls are drawn with replacement, the number of possible choices for the i -th ball is larger, and therefore the conditional entropy is larger. But computing the conditional distributions is slightly involved. It is easier to compute the unconditional entropy. • With replacement. In this case the conditional distribution of each draw is the same for every draw. Thus    red and therefore r with prob. r+w+b w white with prob. r+w+b Xi =   b black with prob. r+w+b (2.8) H(Xi |Xi−1 , . . . , X1 ) = H(Xi ) (2.9) r w b = log(r + w + b) − log r − log w − log(2.10) b. r+w+b r+w+b r+w+b • Without replacement. The unconditional probability of the i -th ball being red is still r/(r + w + b) , etc. Thus the unconditional entropy H(X i ) is still the same as with replacement. The conditional entropy H(X i |Xi−1 , . . . , X1 ) is less than the unconditional entropy, and therefore the entropy of drawing without replacement is lower. 9. A metric. A function ρ(x, y) is a metric if for all x, y , • ρ(x, y) ≥ 0 • ρ(x, y) = ρ(y, x) Entropy, Relative Entropy and Mutual Information 15 • ρ(x, y) = 0 if and only if x = y • ρ(x, y) + ρ(y, z) ≥ ρ(x, z) . (a) Show that ρ(X, Y ) = H(X|Y ) + H(Y |X) satisfies the first, second and fourth properties above. If we say that X = Y if there is a one-to-one function mapping from X to Y , then the third property is also satisfied, and ρ(X, Y ) is a metric. (b) Verify that ρ(X, Y ) can also be expressed as ρ(X, Y ) = H(X) + H(Y ) − 2I(X; Y ) = H(X, Y ) − I(X; Y ) = 2H(X, Y ) − H(X) − H(Y ). (2.11) (2.12) (2.13) Solution: A metric (a) Let ρ(X, Y ) = H(X|Y ) + H(Y |X). (2.14) Then • Since conditional entropy is always ≥ 0 , ρ(X, Y ) ≥ 0 . • The symmetry of the definition implies that ρ(X, Y ) = ρ(Y, X) . • By problem 2.6, it follows that H(Y |X) is 0 iff Y is a function of X and H(X|Y ) is 0 iff X is a function of Y . Thus ρ(X, Y ) is 0 iff X and Y are functions of each other - and therefore are equivalent up to a reversible transformation. • Consider three random variables X , Y and Z . Then H(X|Y ) + H(Y |Z) ≥ H(X|Y, Z) + H(Y |Z) (2.15) = H(X|Z) + H(Y |X, Z) (2.17) = H(X, Y |Z) from which it follows that ≥ H(X|Z), ρ(X, Y ) + ρ(Y, Z) ≥ ρ(X, Z). (2.16) (2.18) (2.19) Note that the inequality is strict unless X → Y → Z forms a Markov Chain and Y is a function of X and Z . (b) Since H(X|Y ) = H(X) − I(X; Y ) , the first equation follows. The second relation follows from the first equation and the fact that H(X, Y ) = H(X) + H(Y ) − I(X; Y ) . The third follows on substituting I(X; Y ) = H(X) + H(Y ) − H(X, Y ) . 10. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn according to probability mass functions p 1 (·) and p2 (·) over the respective alphabets X1 = {1, 2, . . . , m} and X2 = {m + 1, . . . , n}. Let X= ) X1 , with probability α, X2 , with probability 1 − α. 16 Entropy, Relative Entropy and Mutual Information (a) Find H(X) in terms of H(X1 ) and H(X2 ) and α. (b) Maximize over α to show that 2H(X) ≤ 2H(X1 ) + 2H(X2 ) and interpret using the notion that 2H(X) is the effective alphabet size. Solution: Entropy. We can do this problem by writing down the definition of entropy and expanding the various terms. Instead, we will use the algebra of entropies for a simpler proof. Since X1 and X2 have disjoint support sets, we can write X= ) Define a function of X , X1 X2 with probability α with probability 1 − α θ = f (X) = ) 1 2 when X = X1 when X = X2 Then as in problem 1, we have H(X) = H(X, f (X)) = H(θ) + H(X|θ) = H(θ) + p(θ = 1)H(X|θ = 1) + p(θ = 2)H(X|θ = 2) = H(α) + αH(X1 ) + (1 − α)H(X2 ) where H(α) = −α log α − (1 − α) log(1 − α) . 11. A measure of correlation. Let X1 and X2 be identically distributed, but not necessarily independent. Let ρ=1− (a) Show ρ = H(X2 | X1 ) . H(X1 ) I(X1 ;X2 ) H(X1 ) . (b) Show 0 ≤ ρ ≤ 1. (c) When is ρ = 0 ? (d) When is ρ = 1 ? Solution: A measure of correlation. X1 and X2 are identically distributed and ρ=1− H(X2 |X1 ) H(X1 ) (a) ρ = = = H(X1 ) − H(X2 |X1 ) H(X1 ) H(X2 ) − H(X2 |X1 ) (since H(X1 ) = H(X2 )) H(X1 ) I(X1 ; X2 ) . H(X1 ) 17 Entropy, Relative Entropy and Mutual Information (b) Since 0 ≤ H(X2 |X1 ) ≤ H(X2 ) = H(X1 ) , we have 0≤ H(X2 |X1 ) ≤1 H(X1 ) 0 ≤ ρ ≤ 1. (c) ρ = 0 iff I(X1 ; X2 ) = 0 iff X1 and X2 are independent. (d) ρ = 1 iff H(X2 |X1 ) = 0 iff X2 is a function of X1 . By symmetry, X1 is a function of X2 , i.e., X1 and X2 have a one-to-one relationship. 12. Example of joint entropy. Let p(x, y) be given by ! Y ! X ! 0 1 0 1 3 1 3 1 0 1 3 Find (a) H(X), H(Y ). (b) H(X | Y ), H(Y | X). (c) H(X, Y ). (d) H(Y ) − H(Y | X). (e) I(X; Y ) . (f) Draw a Venn diagram for the quantities in (a) through (e). Solution: Example of joint entropy (a) H(X) = 2 3 log (b) H(X|Y ) = (c) H(X, Y ) = 3 2 + 1 3 log 3 = 0.918 bits = H(Y ) . 1 2 3 H(X|Y = 0) + 3 H(X|Y 3 × 13 log 3 = 1.585 bits. = 1) = 0.667 bits = H(Y |X) . (d) H(Y ) − H(Y |X) = 0.251 bits. (e) I(X; Y ) = H(Y ) − H(Y |X) = 0.251 bits. (f) See Figure 1. 13. Inequality. Show ln x ≥ 1 − 1 x for x > 0. Solution: Inequality. Using the Remainder form of the Taylor expansion of ln(x) about x = 1 , we have for some c between 1 and x ln(x) = ln(1) + * + 1 t t=1 (x − 1) + * −1 t2 + t=c (x − 1)2 ≤ x−1 2 18 Entropy, Relative Entropy and Mutual Information Figure 2.1: Venn diagram to illustrate the relationships of entropy and relative entropy H(Y) H(X) H(X|Y) I(X;Y) H(Y|X) since the second term is always negative. Hence letting y = 1/x , we obtain − ln y ≤ 1 −1 y or ln y ≥ 1 − 1 y with equality iff y = 1 . 14. Entropy of a sum. Let X and Y be random variables that take on values x 1 , x2 , . . . , xr and y1 , y2 , . . . , ys , respectively. Let Z = X + Y. (a) Show that H(Z|X) = H(Y |X). Argue that if X, Y are independent, then H(Y ) ≤ H(Z) and H(X) ≤ H(Z). Thus the addition of independent random variables adds uncertainty. (b) Give an example of (necessarily dependent) random variables in which H(X) > H(Z) and H(Y ) > H(Z). (c) Under what conditions does H(Z) = H(X) + H(Y ) ? Solution: Entropy of a sum. (a) Z = X + Y . Hence p(Z = z|X = x) = p(Y = z − x|X = x) . H(Z|X) = ! = − = ! x = p(x)H(Z|X = x) ! ! p(x) x p(x) ! ! y p(Z = z|X = x) log p(Z = z|X = x) z p(Y = z − x|X = x) log p(Y = z − x|X = x) p(x)H(Y |X = x) = H(Y |X). Entropy, Relative Entropy and Mutual Information 19 If X and Y are independent, then H(Y |X) = H(Y ) . Since I(X; Z) ≥ 0 , we have H(Z) ≥ H(Z|X) = H(Y |X) = H(Y ) . Similarly we can show that H(Z) ≥ H(X) . (b) Consider the following joint distribution for X and Y Let X = −Y = ) 1 0 with probability 1/2 with probability 1/2 Then H(X) = H(Y ) = 1 , but Z = 0 with prob. 1 and hence H(Z) = 0 . (c) We have H(Z) ≤ H(X, Y ) ≤ H(X) + H(Y ) because Z is a function of (X, Y ) and H(X, Y ) = H(X) + H(Y |X) ≤ H(X) + H(Y ) . We have equality iff (X, Y ) is a function of Z and H(Y ) = H(Y |X) , i.e., X and Y are independent. 15. Data processing. Let X1 → X2 → X3 → · · · → Xn form a Markov chain in this order; i.e., let p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 ) · · · p(xn |xn−1 ). Reduce I(X1 ; X2 , . . . , Xn ) to its simplest form. Solution: Data Processing. By the chain rule for mutual information, I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ) + I(X1 ; X3 |X2 ) + · · · + I(X1 ; Xn |X2 , . . . , Xn−2 ). (2.20) By the Markov property, the past and the future are conditionally independent given the present and hence all terms except the first are zero. Therefore I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ). (2.21) 16. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necks down to k < n states, and then fans back to m > k states. Thus X 1 → X2 → X3 , i.e., p(x1 , x2 , x3 ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) , for all x1 ∈ {1, 2, . . . , n} , x2 ∈ {1, 2, . . . , k} , x3 ∈ {1, 2, . . . , m} . (a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving that I(X1 ; X3 ) ≤ log k. (b) Evaluate I(X1 ; X3 ) for k = 1 , and conclude that no dependence can survive such a bottleneck. Solution: Bottleneck. 20 Entropy, Relative Entropy and Mutual Information (a) From the data processing inequality, and the fact that entropy is maximum for a uniform distribution, we get I(X1 ; X3 ) ≤ I(X1 ; X2 ) = H(X2 ) − H(X2 | X1 ) ≤ H(X2 ) ≤ log k. Thus, the dependence between X1 and X3 is limited by the size of the bottleneck. That is I(X1 ; X3 ) ≤ log k . (b) For k = 1 , I(X1 ; X3 ) ≤ log 1 = 0 and since I(X1 , X3 ) ≥ 0 , I(X1 , X3 ) = 0 . Thus, for k = 1 , X1 and X3 are independent. 17. Pure randomness and bent coins. Let X 1 , X2 , . . . , Xn denote the outcomes of independent flips of a bent coin. Thus Pr {X i = 1} = p, Pr {Xi = 0} = 1 − p , where p is unknown. We wish to obtain a sequence Z 1 , Z2 , . . . , ZK of fair coin flips from X1 , X2 , . . . , Xn . Toward this end let f : X n → {0, 1}∗ , (where {0, 1}∗ = {Λ, 0, 1, 00, 01, . . .} is the set of all finite length binary sequences), be a mapping f (X1 , X2 , . . . , Xn ) = (Z1 , Z2 , . . . , ZK ) , where Zi ∼ Bernoulli ( 12 ) , and K may depend on (X1 , . . . , Xn ) . In order that the sequence Z1 , Z2 , . . . appear to be fair coin flips, the map f from bent coin flips to fair flips must have the property that all 2 k sequences (Z1 , Z2 , . . . , Zk ) of a given length k have equal probability (possibly 0), for k = 1, 2, . . . . For example, for n = 2 , the map f (01) = 0 , f (10) = 1 , f (00) = f (11) = Λ (the null string), has the property that Pr{Z1 = 1|K = 1} = Pr{Z1 = 0|K = 1} = 12 . Give reasons for the following inequalities: nH(p) (a) = (b) H(X1 , . . . , Xn ) ≥ H(Z1 , Z2 , . . . , ZK , K) (d) H(K) + E(K) (c) = = (e) ≥ H(K) + H(Z1 , . . . , ZK |K) EK. Thus no more than nH(p) fair coin tosses can be derived from (X 1 , . . . , Xn ) , on the average. Exhibit a good map f on sequences of length 4. Solution: Pure randomness and bent coins. nH(p) (a) = (b) ≥ H(X1 , . . . , Xn ) H(Z1 , Z2 , . . . , ZK ) Entropy, Relative Entropy and Mutual Information (c) = H(Z1 , Z2 , . . . , ZK , K) (d) H(K) + H(Z1 , . . . , ZK |K) = (e) = (f ) ≥ 21 H(K) + E(K) EK . (a) Since X1 , X2 , . . . , Xn are i.i.d. with probability of Xi = 1 being p , the entropy H(X1 , X2 , . . . , Xn ) is nH(p) . (b) Z1 , . . . , ZK is a function of X1 , X2 , . . . , Xn , and since the entropy of a function of a random variable is less than the entropy of the random variable, H(Z 1 , . . . , ZK ) ≤ H(X1 , X2 , . . . , Xn ) . (c) K is a function of Z1 , Z2 , . . . , ZK , so its conditional entropy given Z1 , Z2 , . . . , ZK is 0. Hence H(Z1 , Z2 , . . . , ZK , K) = H(Z1 , . . . , ZK ) + H(K|Z1 , Z2 , . . . , ZK ) = H(Z1 , Z2 , . . . , ZK ). (d) Follows from the chain rule for entropy. (e) By assumption, Z1 , Z2 , . . . , ZK are pure random bits (given K ), with entropy 1 bit per symbol. Hence H(Z1 , Z2 , . . . , ZK |K) = = ! k ! p(K = k)H(Z1 , Z2 , . . . , Zk |K = k) (2.22) p(k)k (2.23) k = EK. (2.24) (f) Follows from the non-negativity of discrete entropy. (g) Since we do not know p , the only way to generate pure random bits is to use the fact that all sequences with the same number of ones are equally likely. For example, the sequences 0001,0010,0100 and 1000 are equally likely and can be used to generate 2 pure random bits. An example of a mapping to generate random bits is 0000 → Λ 0001 → 00 0010 → 01 0100 → 10 1000 → 11 0011 → 00 0110 → 01 1100 → 10 1001 → 11 1010 → 0 0101 → 1 1110 → 11 1101 → 10 1011 → 01 0111 → 00 1111 → Λ (2.25) EK = 4pq 3 × 2 + 4p2 q 2 × 2 + 2p2 q 2 × 1 + 4p3 q × 2 (2.26) The resulting expected number of bits is = 8pq + 10p q + 8p q. 3 2 2 3 (2.27) 22 Entropy, Relative Entropy and Mutual Information For example, for p ≈ 12 , the expected number of pure random bits is close to 1.625. This is substantially less then the 4 pure random bits that could be generated if p were exactly 12 . We will now analyze the efficiency of this scheme of generating random bits for long sequences of bent coin flips. Let n be the number of bent coin flips. The algorithm that we will use is the obvious extension of the above method of generating pure bits using the fact that all sequences with the same number of ones are equally likely. , – Consider all sequences with k ones. There are nk such sequences, which are ,n,nall equally likely. If k were a power of 2, then we could generate log pure k ,nrandom bits from such a set. However, in the general case, is not a power of k ,n2 and the best we can to is the divide the set of k elements into subset of sizes n which are powers of 2. The largest set would have a size 2 $log (k )% and could be , – used to generate *log nk + random bits. We could divide the remaining elements into the largest set which is a power of 2, etc. The worst case would occur when ,n= 2l+1 − 1 , in which case the subsets would be of sizes 2 l , 2l−1 , 2l−2 , . . . , 1 . k Instead of analyzing the scheme exactly, we will just find a lower bound on number ,n,nof random bits generated from a set of size k . Let l = *log k + . Then at least half of the elements belong to a set of size 2 l and would generate l random bits, at least 14 th belong to a set of size 2l−1 and generate l − 1 random bits, etc. On the average, the number of bits generated is 1 1 1 l + (l − 1) + · · · + l 1 2 4* 2 + 1 1 2 3 l−1 = l− 1 + + + + · · · + l−2 4 2 4 8 2 ≥ l − 1, E[K|k 1’s in sequence] ≥ (2.28) (2.29) (2.30) since the infinite series sums to 1. , – Hence the fact that nk is not a power of 2 will cost at most 1 bit on the average in the number of random bits that are produced. Hence, the expected number of pure random bits produced by this algorithm is EK ≥ ≥ = n ! k=0 . n ! k=0 . n ! k=0 ≥ . / . / n k n−k n p q *log − 1+ k k / . . / (2.31) / n n k n−k p q log −2 k k (2.32) n k n−k n p q log −2 k k (2.33) / ! n(p−!)≤k≤n(p+!) . / . / . / n k n−k n p q log − 2. k k (2.34) 23 Entropy, Relative Entropy and Mutual Information Now for sufficiently large n , the probability that the number of 1’s in the sequence is close to np is near 1 (by the weak law of large numbers). For such sequences, k n is close to p and hence there exists a δ such that . / k n ≥ 2n(H( n )−δ) ≥ 2n(H(p)−2δ) k (2.35) using Stirling’s approximation for the binomial coefficients and the continuity of the entropy function. If we assume that n is large enough so that the probability that n(p − %) ≤ k ≤ n(p + %) is greater than 1 − % , then we see that EK ≥ (1 − %)n(H(p) − 2δ) − 2 , which is very good since nH(p) is an upper bound on the number of pure random bits that can be produced from the bent coin sequence. 18. World Series. The World Series is a seven-game series that terminates as soon as either team wins four games. Let X be the random variable that represents the outcome of a World Series between teams A and B; possible values of X are AAAA, BABABAB, and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7. Assuming that A and B are equally matched and that the games are independent, calculate H(X) , H(Y ) , H(Y |X) , and H(X|Y ) . Solution: World Series. Two teams play until one of them has won 4 games. There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability (1/2)4 . There are 8 = 2 ,4- There are 20 = 2 There are 40 = 2 3 ,5- World Series with 5 games. Each happens with probability (1/2) 5 . 3 World Series with 6 games. Each happens with probability (1/2) 6 . 3 World Series with 7 games. Each happens with probability (1/2) 7 . ,6- The probability of a 4 game series ( Y = 4 ) is 2(1/2) 4 = 1/8 . The probability of a 5 game series ( Y = 5 ) is 8(1/2) 5 = 1/4 . The probability of a 6 game series ( Y = 6 ) is 20(1/2) 6 = 5/16 . The probability of a 7 game series ( Y = 7 ) is 40(1/2) 7 = 5/16 . 1 p(x) = 2(1/16) log 16 + 8(1/32) log 32 + 20(1/64) log 64 + 40(1/128) log 128 H(X) = ! p(x)log = 5.8125 1 p(y) = 1/8 log 8 + 1/4 log 4 + 5/16 log(16/5) + 5/16 log(16/5) H(Y ) = ! p(y)log = 1.924 24 Entropy, Relative Entropy and Mutual Information Y is a deterministic function of X, so if you know X there is no randomness in Y. Or, H(Y |X) = 0 . Since H(X) + H(Y |X) = H(X, Y ) = H(Y ) + H(X|Y ) , it is easy to determine H(X|Y ) = H(X) + H(Y |X) − H(Y ) = 3.889 19. Infinite entropy. This problem shows that the entropy of a discrete random variable $ 2 −1 . (It is easy to show that A is finite by can be infinite. Let A = ∞ n=2 (n log n) bounding the infinite sum by the integral of (x log 2 x)−1 .) Show that the integervalued random variable X defined by Pr(X = n) = (An log 2 n)−1 for n = 2, 3, . . . , has H(X) = +∞ . Solution: Infinite entropy. By definition, p n = Pr(X = n) = 1/An log 2 n for n ≥ 2 . Therefore H(X) = − = − = = ∞ ! p(n) log p(n) n=2 ∞ 0 ! 1 0 1/An log 2 n log 1/An log 2 n n=2 ∞ ! log(An log 2 n) 1 An log2 n n=2 ∞ ! log A + log n + 2 log log n An log2 n n=2 = log A + ∞ ! ∞ ! 1 2 log log n + . An log n n=2 An log2 n n=2 The first term is finite. For base 2 logarithms, all the elements in the sum in the last term are nonnegative. (For any other base, the terms of the last sum eventually all become positive.) So all we have to do is bound the middle sum, which we do by comparing with an integral. ∞ ! 1 > An log n n=2 2 We conclude that H(X) = +∞ . 2 ∞ 3∞ 1 3 dx = K ln ln x 3 = +∞ . 2 Ax log x 20. Run length coding. Let X1 , X2 , . . . , Xn be (possibly dependent) binary random variables. Suppose one calculates the run lengths R = (R 1 , R2 , . . .) of this sequence (in order as they occur). For example, the sequence X = 0001100100 yields run lengths R = (3, 2, 2, 1, 2) . Compare H(X1 , X2 , . . . , Xn ) , H(R) and H(Xn , R) . Show all equalities and inequalities, and bound all the differences. Solution: Run length coding. Since the run lengths are a function of X 1 , X2 , . . . , Xn , H(R) ≤ H(X) . Any Xi together with the run lengths determine the entire sequence 25 Entropy, Relative Entropy and Mutual Information X1 , X2 , . . . , Xn . Hence H(X1 , X2 , . . . , Xn ) = H(Xi , R) (2.36) = H(R) + H(Xi |R) (2.37) ≤ H(R) + 1. (2.39) ≤ H(R) + H(Xi ) (2.38) 21. Markov’s inequality for probabilities. Let p(x) be a probability mass function. Prove, for all d ≥ 0 , * + 1 Pr {p(X) ≤ d} log ≤ H(X). (2.40) d Solution: Markov inequality applied to entropy. P (p(X) < d) log 1 d ! = p(x) log 1 d (2.41) p(x) log 1 p(x) (2.42) x:p(x) 0 , and therefore f (x) is strictly convex. Therefore a local minimum of the function is also a global minimum. The function has a local minimum at the point where f ‘ (x) = 0 , i.e., when x = 1 . Therefore f (x) ≥ f (1) , i.e., x − 1 − ln x ≥ 1 − 1 − ln 1 = 0 (2.49) which gives us the desired inequality. Equality occurs only if x = 1 . (b) We let A be the set of x such that p(x) > 0 . −De (p||q) = ! x∈A ! p(x)ln * q(x) p(x) (2.50) + q(x) ≤ p(x) −1 p(x) x∈A (2.51) = (2.52) ! x∈A ≤ 0 q(x) − ! p(x) x∈A (2.53) The first step follows from the definition of D , the second step follows from the inequality ln t ≤ t − 1 , the third step from expanding the sum, and the last step from the fact that the q(A) ≤ 1 and p(A) = 1 . 29 Entropy, Relative Entropy and Mutual Information (c) What are the conditions for equality? We have equality in the inequality ln t ≤ t − 1 if and only if t = 1 . Therefore we have equality in step 2 of the chain iff q(x)/p(x) = 1 for all x ∈ A . This implies that p(x) = q(x) for all x , and we have equality in the last step as well. Thus the condition for equality is that p(x) = q(x) for all x . 27. Grouping rule for entropy: Let p = (p 1 , p2 , . . . , pm ) be a probability distribution $ on m elements, i.e, pi ≥ 0 , and m i=1 pi = 1 . Define a new distribution q on m − 1 elements as q1 = p1 , q2 = p2 ,. . . , qm−2 = pm−2 , and qm−1 = pm−1 + pm , i.e., the distribution q is the same as p on {1, 2, . . . , m − 2} , and the probability of the last element in q is the sum of the last two probabilities of p . Show that H(p) = H(q) + (pm−1 + pm )H * + pm−1 pm , . pm−1 + pm pm−1 + pm (2.54) Solution: H(p) = − = − = − m ! pi log pi i=1 m−2 ! i=1 m−2 ! i=1 (2.55) pi log pi − pm−1 log pm−1 − pm log pm pi log pi − pm−1 log pm pm−1 − pm log pm−1 + pm pm−1 + pm (2.56) (2.57) −(pm−1 + pm ) log(pm−1 + pm ) (2.58) pm−1 pm = H(q) − pm−1 log − pm log (2.59) pm−1 + pm pm−1 + pm * + pm−1 pm−1 pm pm = H(q) − (pm−1 + pm ) log − log (2.60) pm−1 + pm pm−1 + pm pm−1 + pm pm−1 + pm * + pm−1 pm = H(q) + (pm−1 + pm )H2 , , (2.61) pm−1 + pm pm−1 + pm where H2 (a, b) = −a log a − b log b . 28. Mixing increases entropy. Show that the entropy of the probability distribution, (p1 , . . . , pi , . . . , pj , . . . , pm ) , is less than the entropy of the distribution p +p p +p (p1 , . . . , i 2 j , . . . , i 2 j , . . . , pm ) . Show that in general any transfer of probability that makes the distribution more uniform increases the entropy. Solution: Mixing increases entropy. This problem depends on the convexity of the log function. Let P1 = (p1 , . . . , pi , . . . , pj , . . . , pm ) pi + p j pj + p i P2 = (p1 , . . . , ,…, , . . . , pm ) 2 2 30 Entropy, Relative Entropy and Mutual Information Then, by the log sum inequality, pi + p j pi + p j ) log( ) + pi log pi + pj log pj 2 2 pi + p j = −(pi + pj ) log( ) + pi log pi + pj log pj 2 ≥ 0. H(P2 ) − H(P1 ) = −2( Thus, H(P2 ) ≥ H(P1 ). 29. Inequalities. Let X , Y and Z be joint random variables. Prove the following inequalities and find conditions for equality. (a) H(X, Y |Z) ≥ H(X|Z) . (b) I(X, Y ; Z) ≥ I(X; Z) . (c) H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X) . (d) I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z) . Solution: Inequalities. (a) Using the chain rule for conditional entropy, H(X, Y |Z) = H(X|Z) + H(Y |X, Z) ≥ H(X|Z), with equality iff H(Y |X, Z) = 0 , that is, when Y is a function of X and Z . (b) Using the chain rule for mutual information, I(X, Y ; Z) = I(X; Z) + I(Y ; Z|X) ≥ I(X; Z), with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally independent given X . (c) Using first the chain rule for entropy and then the definition of conditional mutual information, H(X, Y, Z) − H(X, Y ) = H(Z|X, Y ) = H(Z|X) − I(Y ; Z|X) ≤ H(Z|X) = H(X, Z) − H(X) , with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally independent given X . (d) Using the chain rule for mutual information, I(X; Z|Y ) + I(Z; Y ) = I(X, Y ; Z) = I(Z; Y |X) + I(X; Z) , and therefore I(X; Z|Y ) = I(Z; Y |X) − I(Z; Y ) + I(X; Z) . We see that this inequality is actually an equality in all cases. 31 Entropy, Relative Entropy and Mutual Information 30. Maximum entropy. Find the probability mass function p(x) that maximizes the entropy H(X) of a non-negative integer-valued random variable X subject to the constraint EX = ∞ ! np(n) = A n=0 for a fixed value A > 0 . Evaluate this maximum H(X) . Solution: Maximum entropy Recall that, − ∞ ! i=0 pi log pi ≤ − ∞ ! pi log qi . i=0 Let qi = α(β)i . Then we have that, − ∞ ! i=0 pi log pi ≤ − ∞ ! pi log qi i=0 . = − log(α) ∞ ! pi + log(β) i=0 ∞ ! i=0 = − log α − A log β ipi / Notice that the final right hand side expression is independent of {p i } , and that the inequality, − holds for all α, β such that, ∞ ! i=0 pi log pi ≤ − log α − A log β ∞ ! αβ i = 1 = α i=0 1 . 1−β The constraint on the expected value also requires that, ∞ ! iαβ i = A = α i=0 β . (1 − β)2 Combining the two constraints we have, α β (1 − β)2 * α 1−β β = 1−β = A, = +* β 1−β + 32 Entropy, Relative Entropy and Mutual Information which implies that, A A+1 1 . A+1 β = α = So the entropy maximizing distribution is, 1 pi = A+1 * A A+1 +i . Plugging these values into the expression for the maximum entropy, − log α − A log β = (A + 1) log(A + 1) − A log A. The general form of the distribution, pi = αβ i can be obtained either by guessing or by Lagrange multipliers where, F (pi , λ1 , λ2 ) = − ∞ ! ∞ ! pi log pi + λ1 ( i=0 i=0 is the function whose gradient we set to 0. ∞ ! pi − 1) + λ2 ( i=0 ipi − A) To complete the argument with Lagrange multipliers, it is necessary to show that the local maximum is the global maximum. One possible argument is based on the fact that −H(p) is convex, it has only one local minima, no local maxima and therefore Lagrange multiplier actually gives the global maximum for H(p) . 31. Conditional entropy. Under what conditions does H(X | g(Y )) = H(X | Y ) ? Solution: (Conditional Entropy). If H(X|g(Y )) = H(X|Y ) , then H(X)−H(X|g(Y )) = H(X) − H(X|Y ) , i.e., I(X; g(Y )) = I(X; Y ) . This is the condition for equality in the data processing inequality. From the derivation of the inequality, we have equality iff X → g(Y ) → Y forms a Markov chain. Hence H(X|g(Y )) = H(X|Y ) iff X → g(Y ) → Y . This condition includes many special cases, such as g being oneto-one, and X and Y being independent. However, these two special cases do not exhaust all the possibilities. 32. Fano. We are given the following joint distribution on (X, Y ) Y X 1 2 3 a b c 1 6 1 12 1 12 1 12 1 6 1 12 1 12 1 12 1 6 Let X̂(Y ) be an estimator for X (based on Y) and let P e = Pr{X̂(Y ) %= X}. Entropy, Relative Entropy and Mutual Information 33 (a) Find the minimum probability of error estimator X̂(Y ) and the associated Pe . (b) Evaluate Fano’s inequality for this problem and compare. Solution: (a) From inspection we see that X̂(y) =    1 y=a 2 y=b y=c   3 Hence the associated Pe is the sum of P (1, b), P (1, c), P (2, a), P (2, c), P (3, a) and P (3, b). Therefore, Pe = 1/2. (b) From Fano’s inequality we know H(X|Y ) − 1 . log |X | Pe ≥ Here, H(X|Y ) = H(X|Y = a) Pr{y = a} + H(X|Y = b) Pr{y = b} + H(X|Y = c) Pr{y = c} * + * + * + 1 1 1 1 1 1 1 1 1 = H , , , , , , Pr{y = a} + H Pr{y = b} + H Pr{y = c} 2 4 4 2 4 4 2 4 4 + * 1 1 1 , , (Pr{y = a} + Pr{y = b} + Pr{y = c}) = H 2 4 4 * + 1 1 1 = H , , 2 4 4 = 1.5 bits. Hence Pe ≥ 1.5 − 1 = .316. log 3 Hence our estimator X̂(Y ) is not very close to Fano’s bound in this form. If X̂ ∈ X , as it does here, we can use the stronger form of Fano’s inequality to get Pe ≥ H(X|Y ) − 1 . log(|X |-1) Pe ≥ 1.5 − 1 1 = . log 2 2 and Therefore our estimator X̂(Y ) is actually quite good. 33. Fano’s inequality. Let Pr(X = i) = p i , i = 1, 2, . . . , m and let p1 ≥ p2 ≥ p3 ≥ · · · ≥ pm . The minimal probability of error predictor of X is X̂ = 1 , with resulting probability of error Pe = 1 − p1 . Maximize H(p) subject to the constraint 1 − p 1 = Pe 34 Entropy, Relative Entropy and Mutual Information to find a bound on Pe in terms of H . This is Fano’s inequality in the absence of conditioning. Solution: (Fano’s Inequality.) The minimal probability of error predictor when there is no information is X̂ = 1 , the most probable value of X . The probability of error in this case is Pe = 1 − p1 . Hence if we fix Pe , we fix p1 . We maximize the entropy of X for a given Pe to obtain an upper bound on the entropy for a given P e . The entropy, H(p) = −p1 log p1 − = −p1 log p1 − m ! i=2 m ! i=2 * pi log pi Pe pi pi log − Pe log Pe Pe Pe p2 p3 pm , ,…, Pe Pe Pe ≤ H(Pe ) + Pe log(m − 1), = H(Pe ) + Pe H (2.62) + (2.63) (2.64) (2.65) 1 0 since the maximum of H Pp2e , Pp3e , . . . , pPme is attained by an uniform distribution. Hence any X that can be predicted with a probability of error P e must satisfy H(X) ≤ H(Pe ) + Pe log(m − 1), (2.66) which is the unconditional form of Fano’s inequality. We can weaken this inequality to obtain an explicit lower bound for Pe , Pe ≥ H(X) − 1 . log(m − 1) (2.67) 34. Entropy of initial conditions. Prove that H(X 0 |Xn ) is non-decreasing with n for any Markov chain. Solution: Entropy of initial conditions. For a Markov chain, by the data processing theorem, we have I(X0 ; Xn−1 ) ≥ I(X0 ; Xn ). (2.68) Therefore H(X0 ) − H(X0 |Xn−1 ) ≥ H(X0 ) − H(X0 |Xn ) (2.69) or H(X0 |Xn ) increases with n . 35. Relative entropy is not symmetric: Let the random variable X have three possible outcomes {a, b, c} . Consider two distributions on this random variable Symbol a b c p(x) 1/2 1/4 1/4 q(x) 1/3 1/3 1/3 Calculate H(p) , H(q) , D(p||q) and D(q||p) . Verify that in this case D(p||q) %= D(q||p) . 35 Entropy, Relative Entropy and Mutual Information Solution: 1 1 1 log 2 + log 4 + log 4 = 1.5 bits. 2 4 4 1 1 1 H(q) = log 3 + log 3 + log 3 = log 3 = 1.58496 bits. 3 3 3 1 3 1 3 1 3 D(p||q) = log + log + log = log(3) − 1.5 = 1.58496 − 1.5 = 0.08496 2 2 4 4 4 4 1 2 1 4 1 4 5 D(q||p) = log + log + log = −log(3) = 1.66666−1.58496 = 0.08170 3 3 3 3 3 3 3 H(p) = (2.70) (2.71) (2.72) (2.73) 36. Symmetric relative entropy: Though, as the previous example shows, D(p||q) %= D(q||p) in general, there could be distributions for which equality holds. Give an example of two distributions p and q on a binary alphabet such that D(p||q) = D(q||p) (other than the trivial case p = q ). Solution: A simple case for D((p, 1 − p)||(q, 1 − q)) = D((q, 1 − q)||(p, 1 − p)) , i.e., for p log is when q = 1 − p . 1−p q 1−q p + (1 − p) log = q log + (1 − q) log q 1−q p 1−p (2.74) 37. Relative entropy: Let X, Y, Z be three random variables with a joint probability mass function p(x, y, z) . The relative entropy between the joint distribution and the product of the marginals is 4 p(x, y, z) D(p(x, y, z)||p(x)p(y)p(z)) = E log p(x)p(y)p(z) 5 (2.75) Expand this in terms of entropies. When is this quantity zero? Solution: 4 5 p(x, y, z) (2.76) p(x)p(y)p(z) = E[log p(x, y, z)] − E[log p(x)] − E[log p(y)] − E[log(2.77) p(z)] D(p(x, y, z)||p(x)p(y)p(z)) = E log = −H(X, Y, Z) + H(X) + H(Y ) + H(Z) (2.78) We have D(p(x, y, z)||p(x)p(y)p(z)) = 0 if and only p(x, y, z) = p(x)p(y)p(z) for all (x, y, z) , i.e., if X and Y and Z are independent. 38. The value of a question Let X ∼ p(x) , x = 1, 2, . . . , m . We are given a set S ⊆ {1, 2, . . . , m} . We ask whether X ∈ S and receive the answer Y = ) 1, 0, if X ∈ S if X ∈ % S. Suppose Pr{X ∈ S} = α . Find the decrease in uncertainty H(X) − H(X|Y ) . 36 Entropy, Relative Entropy and Mutual Information Apparently any set S with a given α is as good as any other. Solution: The value of a question. H(X) − H(X|Y ) = I(X; Y ) = H(Y ) − H(Y |X) = H(α) − H(Y |X) = H(α) since H(Y |X) = 0 . 39. Entropy and pairwise independence. Let X, Y, Z be three binary Bernoulli ( 21 ) random variables that are pairwise independent, that is, I(X; Y ) = I(X; Z) = I(Y ; Z) = 0 . (a) Under this constraint, what is the minimum value for H(X, Y, Z) ? (b) Give an example achieving this minimum. Solution: (a) H(X, Y, Z) = H(X, Y ) + H(Z|X, Y ) (2.79) ≥ H(X, Y ) (2.80) = 2. (2.81) So the minimum value for H(X, Y, Z) is at least 2. To show that is is actually equal to 2, we show in part (b) that this bound is attainable. (b) Let X and Y be iid Bernoulli( 12 ) and let Z = X ⊕ Y , where ⊕ denotes addition mod 2 (xor). 40. Discrete entropies Let X and Y be two independent integer-valued random variables. Let X be uniformly distributed over {1, 2, . . . , 8} , and let Pr{Y = k} = 2 −k , k = 1, 2, 3, . . . (a) Find H(X) (b) Find H(Y ) (c) Find H(X + Y, X − Y ) . Solution: (a) For a uniform distribution, H(X) = log m = log 8 = 3 . (b) For a geometric distribution, H(Y ) = $ k k2−k = 2 . (See solution to problem 2.1 Entropy, Relative Entropy and Mutual Information 37 (c) Since (X, Y ) → (X +Y, X −Y ) is a one to one transformation, H(X +Y, X −Y ) = H(X, Y ) = H(X) + H(Y ) = 3 + 2 = 5 . 41. Random questions One wishes to identify a random object X ∼ p(x) . A question Q ∼ r(q) is asked at random according to r(q) . This results in a deterministic answer A = A(x, q) ∈ {a1 , a2 , . . .} . Suppose X and Q are independent. Then I(X; Q, A) is the uncertainty in X removed by the question-answer (Q, A) . (a) Show I(X; Q, A) = H(A|Q) . Interpret. (b) Now suppose that two i.i.d. questions Q 1 , Q2 , ∼ r(q) are asked, eliciting answers A1 and A2 . Show that two questions are less valuable than twice a single question in the sense that I(X; Q1 , A1 , Q2 , A2 ) ≤ 2I(X; Q1 , A1 ) . Solution: Random questions. (a) I(X; Q, A) = H(Q, A) − H(Q, A, |X) = H(Q) + H(A|Q) − H(Q|X) − H(A|Q, X) = H(Q) + H(A|Q) − H(Q) = H(A|Q) The interpretation is as follows. The uncertainty removed in X by (Q, A) is the same as the uncertainty in the answer given the question. (b) Using the result from part a and the fact that questions are independent, we can easily obtain the desired relationship. I(X; Q1 , A1 , Q2 , A2 ) (a) = (b) = (c) = = (d) = (e) ≤ (f ) = I(X; Q1 ) + I(X; A1 |Q1 ) + I(X; Q2 |A1 , Q1 ) + I(X; A2 |A1 , Q1 , Q2 ) I(X; A1 |Q1 ) + H(Q2 |A1 , Q1 ) − H(Q2 |X, A1 , Q1 ) + I(X; A2 |A1 , Q1 , Q2 ) I(X; A1 |Q1 ) + I(X; A2 |A1 , Q1 , Q2 ) I(X; A1 |Q1 ) + H(A2 |A1 , Q1 , Q2 ) − H(A2 |X, A1 , Q1 , Q2 ) I(X; A1 |Q1 ) + H(A2 |A1 , Q1 , Q2 ) I(X; A1 |Q1 ) + H(A2 |Q2 ) 2I(X; A1 |Q1 ) (a) Chain rule. (b) X and Q1 are independent. 38 Entropy, Relative Entropy and Mutual Information (c) Q2 are independent of X , Q1 , and A1 . (d) A2 is completely determined given Q2 and X . (e) Conditioning decreases entropy. (f) Result from part a. 42. Inequalities. Which of the following inequalities are generally ≥, =, ≤ ? Label each with ≥, =, or ≤ . (a) (b) (c) (d) H(5X) vs. H(X) I(g(X); Y ) vs. I(X; Y ) H(X0 |X−1 ) vs. H(X0 |X−1 , X1 ) H(X1 , X2 , . . . , Xn ) vs. H(c(X1 , X2 , . . . , Xn )) , where c(x1 , x2 , . . . , xn ) is the Huffman codeword assigned to (x1 , x2 , . . . , xn ) . (e) H(X, Y )/(H(X) + H(Y )) vs. 1 Solution: (a) (b) (c) (d) X → 5X is a one to one mapping, and hence H(X) = H(5X) . By data processing inequality, I(g(X); Y ) ≤ I(X; Y ) . Because conditioning reduces entropy, H(X 0 |X−1 ) ≥ H(X0 |X−1 , X1 ) . H(X, Y ) ≤ H(X) + H(Y ) , so H(X, Y )/(H(X) + H(Y )) ≤ 1 . 43. Mutual information of heads and tails. (a) Consider a fair coin flip. What is the mutual information between the top side and the bottom side of the coin? (b) A 6-sided fair die is rolled. What is the mutual information between the top side and the front face (the side most facing you)? Solution: Mutual information of heads and tails. To prove (a) observe that I(T ; B) = H(B) − H(B|T ) = log 2 = 1 since B ∼ Ber(1/2) , and B = f (T ) . Here B, T stand for Bottom and Top respectively. To prove (b) note that having observed a side of the cube facing us F , there are four possibilities for the top T , which are equally probable. Thus, I(T ; F ) = H(T ) − H(T |F ) = log 6 − log 4 = log 3 − 1 since T has uniform distribution on {1, 2, . . . , 6} . 39 Entropy, Relative Entropy and Mutual Information 44. Pure randomness We wish to use a 3-sided coin to generate a probability mass function    A, X= B,   C, fair coin toss. Let the coin X have pA pB pC where pA , pB , pC are unknown. (a) How would you use 2 independent flips X 1 , X2 to generate (if possible) a Bernoulli( 12 ) random variable Z ? (b) What is the resulting maximum expected number of fair bits generated? Solution: (a) The trick here is to notice that for any two letters Y and Z produced by two independent tosses of our bent three-sided coin, Y Z has the same probability as ZY . So we can produce B ∼ Bernoulli( 21 ) coin flips by letting B = 0 when we get AB , BC or AC , and B = 1 when we get BA , CB or CA (if we get AA , BB or CC we don’t assign a value to B .) (b) The expected number of bits generated by the above scheme is as follows. We get one bit, except when the two flips of the 3-sided coin produce the same symbol. So the expected number of fair bits generated is 0 ∗ [P (AA) + P (BB) + P (CC)] + 1 ∗ [1 − P (AA) − P (BB) − P (CC)], or, 1 − p2A − p2B − p2C . (2.82) (2.83) 45. Finite entropy. Show that for a discrete random variable X ∈ {1, 2, . . .} , if E log X < ∞ , then H(X) < ∞ . Solution: Let the distribution on the integers be p 1 , p2 , . . . . Then H(p) = − $ and E log X = pi logi = c < ∞ . $ pi logpi We will now find the maximum entropy distribution subject to the constraint on the expected logarithm. Using Lagrange multipliers or the results of Chapter 12, we have the following functional to optimize J(p) = − ! pi log pi − λ1 ! p i − λ2 ! pi log i (2.84) Differentiating with respect to p i and setting to zero, we find that the p i that maximizes $ the entropy set pi = aiλ , where a = 1/( iλ ) and λ chosed to meet the expected log constraint, i.e. ! ! iλ log i = c iλ (2.85) Using this value of pi , we can see that the entropy is finite. 40 Entropy, Relative Entropy and Mutual Information 46. Axiomatic definition of entropy. If we assume certain axioms for our measure of information, then we will be forced to use a logarithmic measure like entropy. Shannon used this to justify his initial definition of entropy. In this book, we will rely more on the other properties of entropy rather than its axiomatic derivation to justify its use. The following problem is considerably more difficult than the other problems in this section. If a sequence of symmetric functions H m (p1 , p2 , . . . , pm ) satisfies the following properties, • Normalization: H2 0 1 1 2, 2 1 = 1, • Continuity: H2 (p, 1 − p) is a continuous function of p , • Grouping: Hm (p1 , p2 , . . . , pm ) = Hm−1 (p1 +p2 , p3 , . . . , pm )+(p1 +p2 )H2 prove that Hm must be of the form Hm (p1 , p2 , . . . , pm ) = − m ! pi log pi , m = 2, 3, . . . . 0 p2 p1 p1 +p2 , p1 +p2 (2.86) i=1 There are various other axiomatic formulations which also result in the same definition of entropy. See, for example, the book by Csiszár and Körner[4]. Solution: Axiomatic definition of entropy. This is a long solution, so we will first outline what we plan to do. First we will extend the grouping axiom by induction and prove that Hm (p1 , p2 , . . . , pm ) = Hm−k (p1 + p2 + · · · + pk , pk+1 , . . . , pm ) + * pk p1 ,..., (. 2.87) +(p1 + p2 + · · · + pk )Hk p1 + p 2 + · · · + p k p1 + p 2 + · · · + p k Let f (m) be the entropy of a uniform distribution on m symbols, i.e., f (m) = Hm * + 1 1 1 , ,..., . m m m (2.88) We will then show that for any two integers r and s , that f (rs) = f (r) + f (s) . We use this to show that f (m) = log m . We then show for rational p = r/s , that H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p) . By continuity, we will extend it to irrational p and finally by induction and grouping, we will extend the result to H m for m ≥ 2 . To begin, we extend the grouping axiom. For convenience in notation, we will let Sk = k ! (2.89) pi i=1 and we will denote H2 (q, 1 − q) as h(q) . Then we can write the grouping axiom as * + p2 Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h . S2 (2.90) 1 , 41 Entropy, Relative Entropy and Mutual Information Applying the grouping axiom again, we have * + p2 S2 * + * + p3 p2 + S2 h = Hm−2 (S3 , p4 , . . . , pm ) + S3 h S3 S2 .. . Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h = Hm−(k−1) (Sk , pk+1 , . . . , pm ) + k ! Si h i=2 * + pi . Si (2.91) (2.92) (2.93) (2.94) Now, we apply the same grouping axiom repeatedly to H k (p1 /Sk , . . . , pk /Sk ) , to obtain Hk * pk p1 ,..., Sk Sk + = H2 = * Sk−1 pk , Sk Sk + + * + k 1 ! pi Si h . Sk i=2 Si k−1 ! i=2 * Si pi /Sk h Sk Si /Sk + (2.95) (2.96) From (2.94) and (2.96), it follows that Hm (p1 , . . . , pm ) = Hm−k (Sk , pk+1 , . . . , pm ) + Sk Hk * pk p1 ,..., Sk Sk + , (2.97) which is the extended grouping axiom. Now we need to use an axiom that is not explicitly stated in the text, namely that the function Hm is symmetric with respect to its arguments. Using this, we can combine any set of arguments of Hm using the extended grouping axiom. 1 1 1 Let f (m) denote Hm ( m , m, . . . , m ). Consider 1 1 1 , ,..., ). mn mn mn By repeatedly applying the extended grouping axiom, we have f (mn) = Hmn ( 1 1 1 , ,..., ) mn mn mn 1 1 1 1 1 1 = Hmn−n ( , ,..., ) + Hn ( , . . . , ) m mn mn m n n 1 1 1 1 2 1 1 = Hmn−2n ( , , ,..., ) + Hn ( , . . . , ) m m mn mn m n n .. . f (mn) = Hmn ( 1 1 1 1 = Hm ( , . . . . ) + H( , . . . , ) m m n n = f (m) + f (n). (2.98) (2.99) (2.100) (2.101) (2.102) (2.103) (2.104) 42 Entropy, Relative Entropy and Mutual Information We can immediately use this to conclude that f (m k ) = kf (m) . Now, we will argue that H2 (1, 0) = h(1) = 0 . We do this by expanding H 3 (p1 , p2 , 0) ( p1 + p2 = 1 ) in two different ways using the grouping axiom H3 (p1 , p2 , 0) = H2 (p1 , p2 ) + p2 H2 (1, 0) (2.105) = H2 (1, 0) + (p1 + p2 )H2 (p1 , p2 ) (2.106) Thus p2 H2 (1, 0) = H2 (1, 0) for all p2 , and therefore H(1, 0) = 0 . We will also need to show that f (m + 1) − f (m) → 0 as m → ∞ . To prove this, we use the extended grouping axiom and write 1 1 ,..., ) m+1 m+1 m 1 1 1 )+ Hm ( , . . . , ) = h( m+1 m+1 m m 1 m = h( )+ f (m) m+1 m+1 f (m + 1) = Hm+1 ( and therefore f (m + 1) − m 1 f (m) = h( ). m+1 m+1 (2.107) (2.108) (2.109) (2.110) m 1 Thus lim f (m + 1) − m+1 f (m) = lim h( m+1 ). But by the continuity of H2 , it follows 1 ) = 0. that the limit on the right is h(0) = 0 . Thus lim h( m+1 Let us define an+1 = f (n + 1) − f (n) (2.111) 1 bn = h( ). n (2.112) and Then 1 f (n) + bn+1 n+1 n 1 ! = − ai + bn+1 n + 1 i=2 an+1 = − and therefore (n + 1)bn+1 = (n + 1)an+1 + n ! (2.113) (2.114) (2.115) ai . i=2 Therefore summing over n , we have N ! n=2 nbn = N ! (nan + an−1 + . . . + a2 ) = N n=2 N ! n=2 ai . (2.116) 43 Entropy, Relative Entropy and Mutual Information Dividing both sides by $N n=1 n = N (N + 1)/2 , we obtain $ N N 2 ! nbn an = $n=2 N N + 1 n=2 n=2 n (2.117) Now by continuity of H2 and the definition of bn , it follows that bn → 0 as n → ∞ . Since the right hand side is essentially an average of the b n ’s, it also goes to 0 (This can be proved more precisely using % ’s and δ ’s). Thus the left hand side goes to 0. We can then see that N 1 ! aN +1 = bN +1 − an (2.118) N + 1 n=2 also goes to 0 as N → ∞ . Thus f (n + 1) − f (n) → 0 asn → ∞. (2.119) We will now prove the following lemma Lemma 2.0.1 Let the function f (m) satisfy the following assumptions: • f (mn) = f (m) + f (n) for all integers m , n . • limn→∞ (f (n + 1) − f (n)) = 0 • f (2) = 1 , then the function f (m) = log 2 m . Proof of the lemma: Let P be an arbitrary prime number and let g(n) = f (n) − f (P ) log2 n log2 P (2.120) Then g(n) satisfies the first assumption of the lemma. Also g(P ) = 0 . Also if we let αn = g(n + 1) − g(n) = f (n + 1) − f (n) + f (P ) n log2 log2 P n+1 (2.121) then the second assumption in the lemma implies that lim α n = 0 . For an integer n , define 6 7 n = . P (2.122) n = n(1) P + l (2.123) n (1) Then it follows that n(1) < n/P , and 44 Entropy, Relative Entropy and Mutual Information where 0 ≤ l < P . From the fact that g(P ) = 0 , it follows that g(P n (1) ) = g(n(1) ) , and g(n) = g(n(1) ) + g(n) − g(P n(1) ) = g(n(1) ) + n−1 ! αi (2.124) i=P n(1) Just as we have defined n(1) from n , we can define n(2) from n(1) . Continuing this process, we can then write g(n) = g(n(k) ) + k ! j=1 Since n(k) ≤ n/P k , after k= 6   (i−1) n! i=P n(i)  αi  . 7 log n +1 log P (2.125) (2.126) terms, we have n(k) = 0 , and g(0) = 0 (this follows directly from the additive property of g ). Thus we can write g(n) = tn ! (2.127) αi i=1 the sum of bn terms, where bn ≤ P Since αn → 0 , it follows that Thus it follows that g(n) log2 n * + log n +1 . log P (2.128) → 0 , since g(n) has at most o(log 2 n) terms αi . f (n) f (P ) = n→∞ log n log2 P 2 lim (2.129) Since P was arbitrary, it follows that f (P )/ log 2 P = c for every prime number P . Applying the third axiom in the lemma, it follows that the constant is 1, and f (P ) = log2 P . For composite numbers N = P1 P2 . . . Pl , we can apply the first property of f and the prime number factorization of N to show that f (N ) = ! f (Pi ) = ! log2 Pi = log2 N. (2.130) Thus the lemma is proved. The lemma can be simplified considerably, if instead of the second assumption, we replace it by the assumption that f (n) is monotone in n . We will now argue that the only function f (m) such that f (mn) = f (m) + f (n) for all integers m, n is of the form f (m) = log a m for some base a . Let c = f (2) . Now f (4) = f (2 × 2) = f (2) + f (2) = 2c . Similarly, it is easy to see that f (2k ) = kc = c log 2 2k . We will extend this to integers that are not powers of 2. 45 Entropy, Relative Entropy and Mutual Information For any integer m , let r > 0 , be another integer and let 2 k ≤ mr < 2k+1 . Then by the monotonicity assumption on f , we have kc ≤ rf (m) < (k + 1)c (2.131) or k k+1 ≤ f (m) < c r r Now by the monotonicity of log , we have c (2.132) k k+1 ≤ log2 m < r r (2.133) Combining these two equations, we obtain 3 3 3 3 3f (m) − log 2 m 3 < 1 3 c 3 r (2.134) Since r was arbitrary, we must have f (m) = log2 m c (2.135) and we can identify c = 1 from the last assumption of the lemma. Now we are almost done. We have shown that for any uniform distribution on m outcomes, f (m) = Hm (1/m, . . . , 1/m) = log 2 m . We will now show that H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p). (2.136) To begin, let p be a rational number, r/s , say. Consider the extended grouping axiom for Hs 1 1 1 1 s−r s−r f (s) = Hs ( , . . . , ) = H( , . . . , , )+ f (s − r) s s s s s? (2.137) r r s−r s s−r = H2 ( , ) + f (s) + f (s − r) s s r s (2.138) Substituting f (s) = log 2 s , etc, we obtain * + * + r s−r r r s−r s−r H2 ( , ) = − log2 − 1 − log2 1 − . s s s s s s (2.139) Thus (2.136) is true for rational p . By the continuity assumption, (2.136) is also true at irrational p . To complete the proof, we have to extend the definition from H 2 to Hm , i.e., we have to show that ! Hm (p1 , . . . , pm ) = − pi log pi (2.140) 46 Entropy, Relative Entropy and Mutual Information for all m . This is a straightforward induction. We have just shown that this is true for m = 2 . Now assume that it is true for m = n − 1 . By the grouping axiom, Hn (p1 , . . . , pn ) = Hn−1 (p1 + p2 , p3 , . . . , pn ) * + p1 p2 +(p1 + p2 )H2 , p1 + p 2 p1 + p 2 = −(p1 + p2 ) log(p1 + p2 ) − − = − n ! pi log pi (2.142) (2.143) i=3 p1 p2 p2 p1 log − log p1 + p 2 p1 + p 2 p1 + p 2 p1 + p 2 n ! (2.141) pi log pi . (2.144) (2.145) i=1 Thus the statement is true for m = n , and by induction, it is true for all m . Thus we have finally proved that the only symmetric function that satisfies the axioms is Hm (p1 , . . . , pm ) = − m ! pi log pi . (2.146) i=1 The proof above is due to Rényi[11] 47. The entropy of a missorted file. A deck of n cards in order 1, 2, . . . , n is provided. One card is removed at random then replaced at random. What is the entropy of the resulting deck? Solution: The entropy of a missorted file. The heart of this problem is simply carefully counting the possible outcome states. There are n ways to choose which card gets mis-sorted, and, once the card is chosen, there are again n ways to choose where the card is replaced in the deck. Each of these shuffling actions has probability 1/n 2 . Unfortunately, not all of these n 2 actions results in a unique mis-sorted file. So we need to carefully count the number of distinguishable outcome states. The resulting deck can only take on one of the following three cases. • The selected card is at its original location after a replacement. • The selected card is at most one location away from its original location after a replacement. • The selected card is at least two locations away from its original location after a replacement. To compute the entropy of the resulting deck, we need to know the probability of each case. Case 1 (resulting deck is the same as the original): There are n ways to achieve this outcome state, one for each of the n cards in the deck. Thus, the probability associated with case 1 is n/n2 = 1/n . Entropy, Relative Entropy and Mutual Information 47 Case 2 (adjacent pair swapping): There are n − 1 adjacent pairs, each of which will have a probability of 2/n2 , since for each pair, there are two ways to achieve the swap, either by selecting the left-hand card and moving it one to the right, or by selecting the right-hand card and moving it one to the left. Case 3 (typical situation): None of the remaining actions “collapses”. They all result in unique outcome states, each with probability 1/n 2 . Of the n2 possible shuffling actions, n2 − n − 2(n − 1) of them result in this third case (we’ve simply subtracted the case 1 and case 2 situations above). The entropy of the resulting deck can be computed as follows. H(X) = = 1 2 n2 1 log(n) + (n − 1) 2 log( ) + (n2 − 3n + 2) 2 log(n2 ) n n 2 n 2n − 1 2(n − 1) log(n) − n n2 48. Sequence length. How much information does the length of a sequence give about the content of a sequence? Suppose we consider a Bernoulli (1/2) process {X i }. Stop the process when the first 1 appears. Let N designate this stopping time. Thus X N is an element of the set of all finite length binary sequences {0, 1} ∗ = {0, 1, 00, 01, 10, 11, 000, . . .}. (a) Find I(N ; X N ). (b) Find H(X N |N ). (c) Find H(X N ). Let’s now consider a different stopping time. For this part, again assume X i ∼ Bernoulli (1/2) but stop at time N = 6 , with probability 1/3 and stop at time N = 12 with probability 2/3. Let this stopping time be independent of the sequence X 1 X2 . . . X12 . (d) Find I(N ; X N ). (e) Find H(X N |N ). (f) Find H(X N ). Solution: (a) I(X N ; N ) = = H(N ) − H(N |X N ) H(N ) − 0 48 Entropy, Relative Entropy and Mutual Information I(X ; N ) N (a) = E(N ) = 2 where (a) comes from the fact that the entropy of a geometric random variable is just the mean. (b) Since given N we know that Xi = 0 for all i < N and XN = 1, H(X N |N ) = 0. (c) H(X N ) = I(X N ; N ) + H(X N |N ) = I(X N ; N ) + 0 H(X N ) = 2. (d) I(X N ; N ) = H(N ) − H(N |X N ) = H(N ) − 0 I(X ; N ) = HB (1/3) N (e) 2 1 H(X 6 |N = 6) + H(X 12 |N = 12) 3 3 1 2 6 12 = H(X ) + H(X ) 3 3 1 2 = 6 + 12 3 3 H(X N |N ) = 10. H(X N |N ) = (f) H(X N ) = I(X N ; N ) + H(X N |N ) = I(X N ; N ) + 10 H(X N ) = H(1/3) + 10. Chapter 3 The Asymptotic Equipartition Property 1. Markov’s inequality and Chebyshev’s inequality. (a) (Markov’s inequality.) For any non-negative random variable X and any t > 0 , show that EX Pr {X ≥ t} ≤ . (3.1) t Exhibit a random variable that achieves this inequality with equality. (b) (Chebyshev’s inequality.) Let Y be a random variable with mean µ and variance σ 2 . By letting X = (Y − µ)2 , show that for any % > 0 , Pr {|Y − µ| > %} ≤ σ2 . %2 (3.2) (c) (The weak law of large numbers.) Let Z 1 , Z2 , . . . , Zn be a sequence of i.i.d. random $ variables with mean µ and variance σ 2 . Let Z n = n1 ni=1 Zi be the sample mean. Show that 3 @3 A σ2 3 3 Pr 3Z n − µ3 > % ≤ 2 . (3.3) n% @3 3 3 3 A Thus Pr 3Z n − µ3 > % → 0 as n → ∞ . This is known as the weak law of large numbers. Solution: Markov’s inequality and Chebyshev’s inequality. (a) If X has distribution F (x) , EX = 2 ∞ xdF 0 = 2 49 0 δ xdF + 2 δ ∞ xdF 50 The Asymptotic Equipartition Property ≥ ≥ 2 ∞ 2δ ∞ xdF δdF δ = δ Pr{X ≥ δ}. Rearranging sides and dividing by δ we get, Pr{X ≥ δ} ≤ EX . δ (3.4) One student gave a proof based on conditional expectations. It goes like EX = E(X|X ≤ δ) Pr{X ≥ δ} + E(X|X < δ) Pr{X < δ} ≥ E(X|X ≤ δ) Pr{X ≥ δ} ≥ δ Pr{X ≥ δ}, which leads to (3.4) as well. Given δ , the distribution achieving Pr{X ≥ δ} = is X= ) EX , δ δ with probability µδ 0 with probability 1 − µδ , where µ ≤ δ . (b) Letting X = (Y − µ)2 in Markov’s inequality, Pr{(Y − µ)2 > %2 } ≤ Pr{(Y − µ)2 ≥ %2 } E(Y − µ)2 ≤ %2 2 σ = , %2 and noticing that Pr{(Y − µ)2 > %2 } = Pr{|Y − µ| > %} , we get, Pr{|Y − µ| > %} ≤ σ2 . %2 (c) Letting Y in Chebyshev’s inequality from part (b) equal Z¯n , and noticing that 2 E Z¯n = µ and Var(Z¯n ) = σn (ie. Z¯n is the sum of n iid r.v.’s, Zni , each with 2 variance nσ2 ), we have, σ2 Pr{|Z¯n − µ| > %} ≤ 2 . n% 51 The Asymptotic Equipartition Property 2. AEP and mutual information. Let (Xi , Yi ) be i.i.d. ∼ p(x, y) . We form the log likelihood ratio of the hypothesis that X and Y are independent vs. the hypothesis that X and Y are dependent. What is the limit of 1 p(X n )p(Y n ) log ? n p(X n , Y n ) Solution: 1 p(X n )p(Y n ) log n p(X n , Y n ) = = n B 1 p(Xi )p(Yi ) log n p(Xi , Yi ) i=1 n 1! p(Xi )p(Yi ) log n i=i p(Xi , Yi ) p(Xi )p(Yi ) ) p(Xi , Yi ) −I(X; Y ) → E(log = n n )p(Y ) −nI(X;Y ) , which will converge to 1 if X and Y are indeed Thus, p(X p(X n ,Y n ) → 2 independent. 3. Piece of cake A cake is sliced roughly in half, the largest piece being chosen each time, the other pieces discarded. We will assume that a random cut creates pieces of proportions: P = ) ( 32 , 13 ) w.p. ( 25 , 35 ) w.p. 3 4 1 4 Thus, for example, the first cut (and choice of largest piece) may result in a piece of size 35 . Cutting and choosing from this piece might reduce it to size ( 35 )( 23 ) at time 2, and so on. How large, to first order in the exponent, is the piece of cake after n cuts? Solution: Let Ci be the fraction of the piece of cake that is cut at the i th cut, and let C Tn be the fraction of cake left after n cuts. Then we have T n = C1 C2 . . . Cn = ni=1 Ci . Hence, as in Question 2 of Homework Set #3, lim n 1 1! log Tn = lim log Ci n n i=1 = E[log C1 ] 3 2 1 3 = log + log . 4 3 4 5 52 The Asymptotic Equipartition Property 4. AEP $ Let Xi be iid ∼ p(x), x ∈ {1, 2, . . . , m} . Let µ = EX, and H = − p(x) log p(x). Let $ An = {xn ∈ X n : | − n1 log p(xn ) − H| ≤ %} . Let B n = {xn ∈ X n : | n1 ni=1 Xi − µ| ≤ %} . (a) Does Pr{X n ∈ An } −→ 1 ? (b) Does Pr{X n ∈ An ∩ B n } −→ 1 ? (c) Show |An ∩ B n | ≤ 2n(H+!) , for all n . (d) Show |An ∩ B n | ≥ ( 21 )2n(H−!) , for n sufficiently large. Solution: (a) Yes, by the AEP for discrete random variables the probability X n is typical goes to 1. (b) Yes, by the Strong Law of Large Numbers P r(X n ∈ B n ) → 1 . So there exists % > 0 and N1 such that P r(X n ∈ An ) > 1 − 2! for all n > N1 , and there exists N2 such that P r(X n ∈ B n ) > 1 − 2! for all n > N2 . So for all n > max(N1 , N2 ) : P r(X n ∈ An ∩ B n ) = P r(X n ∈ An ) + P r(X n ∈ B n ) − P r(X n ∈ An ∪ B n ) % % > 1− +1− −1 2 2 = 1−% So for any % > 0 there exists N = max(N1 , N2 ) such that P r(X n ∈ An ∩ B n ) > 1 − % for all n > N , therefore P r(X n ∈ An ∩ B n ) → 1 . $ n n n (c) By the law of total probability xn ∈An ∩B n p(x ) ≤ 1 . Also, for x ∈ A , from Theorem 3.1.2 in the text, p(xn ) ≥ 2−n(H+!) . Combining these two equations gives $ $ 1 ≥ xn ∈An ∩B n p(xn ) ≥ xn ∈An ∩B n 2−n(H+!) = |An ∩ B n |2−n(H+!) . Multiplying through by 2n(H+!) gives the result |An ∩ B n | ≤ 2n(H+!) . (d) Since from (b) P r{X n ∈ An ∩ B n } → 1 , there exists N such that P r{X n ∈ An ∩ B n } ≥ 21 for all n > N . From Theorem 3.1.2 in the text, for x n ∈ An , $ n p(xn ) ≤ 2−n(H−!) . So combining these two gives 21 ≤ xn ∈An ∩B n p(x ) ≤ $ −n(H−!) = |An ∩ B n |2−n(H−!) . Multiplying through by 2n(H−!) gives xn ∈An ∩B n 2 n the result |A ∩ B n | ≥ ( 21 )2n(H−!) for n sufficiently large. 5. Sets defined by probabilities. Let X1 , X2 , . . . be an i.i.d. sequence of discrete random variables with entropy H(X). Let Cn (t) = {xn ∈ X n : p(xn ) ≥ 2−nt } denote the subset of n -sequences with probabilities ≥ 2 −nt . (a) Show |Cn (t)| ≤ 2nt . (b) For what values of t does P ({X n ∈ Cn (t)}) → 1? Solution: 53 The Asymptotic Equipartition Property (a) Since the total probability of all sequences is less than 1, |C n (t)| minxn ∈Cn (t) p(xn ) ≤ 1 , and hence |Cn (t)| ≤ 2nt . (b) Since − n1 log p(xn ) → H , if t < H , the probability that p(x n ) > 2−nt goes to 0, and if t > H , the probability goes to 1. 6. An AEP-like limit. Let X1 , X2 , . . . be i.i.d. drawn according to probability mass function p(x). Find 1 lim [p(X1 , X2 , . . . , Xn )] n . n→∞ Solution: An AEP-like limit. X1 , X2 , . . . , i.i.d. ∼ p(x) . Hence log(Xi ) are also i.i.d. and 1 lim(p(X1 , X2 , . . . , Xn )) n 1 = lim 2log(p(X1 ,X2 ,…,Xn )) n 1 = 2lim n = 2 $ log p(Xi ) E(log(p(X))) −H(X) = 2 a.e. a.e. a.e. by the strong law of large numbers (assuming of course that H(X) exists). 7. The AEP and source coding. A discrete memoryless source emits a sequence of statistically independent binary digits with probabilities p(1) = 0.005 and p(0) = 0.995 . The digits are taken 100 at a time and a binary codeword is provided for every sequence of 100 digits containing three or fewer ones. (a) Assuming that all codewords are the same length, find the minimum length required to provide codewords for all sequences with three or fewer ones. (b) Calculate the probability of observing a source sequence for which no codeword has been assigned. (c) Use Chebyshev’s inequality to bound the probability of observing a source sequence for which no codeword has been assigned. Compare this bound with the actual probability computed in part (b). Solution: The AEP and source coding. (a) The number of 100-bit binary sequences with three or fewer ones is . / . / . / . 100 100 100 100 + + + 0 1 2 3 / = 1 + 100 + 4950 + 161700 = 166751 . The required codeword length is 2log 2 1667513 = 18 . (Note that H(0.005) = 0.0454 , so 18 is quite a bit larger than the 4.5 bits of entropy.) (b) The probability that a 100-bit sequence has three or fewer ones is . / 3 ! 100 i=0 i (0.005)i (0.995)100−i = 0.60577 + 0.30441 + 0.7572 + 0.01243 = 0.99833 54 The Asymptotic Equipartition Property Thus the probability that the sequence that is generated cannot be encoded is 1 − 0.99833 = 0.00167 . (c) In the case of a random variable S n that is the sum of n i.i.d. random variables X1 , X2 , . . . , Xn , Chebyshev’s inequality states that Pr(|Sn − nµ| ≥ %) ≤ nσ 2 , %2 where µ and σ 2 are the mean and variance of Xi . (Therefore nµ and nσ 2 are the mean and variance of Sn .) In this problem, n = 100 , µ = 0.005 , and σ 2 = (0.005)(0.995) . Note that S100 ≥ 4 if and only if |S100 − 100(0.005)| ≥ 3.5 , so we should choose % = 3.5 . Then Pr(S100 ≥ 4) ≤ 100(0.005)(0.995) ≈ 0.04061 . (3.5)2 This bound is much larger than the actual probability 0.00167. 8. Products. Let X=    1, 2,   3, 1 2 1 4 1 4 Let X1 , X2 , . . . be drawn i.i.d. according to this distribution. Find the limiting behavior of the product 1 (X1 X2 · · · Xn ) n . Solution: Products. Let 1 Pn = (X1 X2 . . . Xn ) n . Then log Pn = n 1! log Xi → E log X, n i=1 (3.5) (3.6) with probability 1, by the strong law of large numbers. Thus P n → 2E log X with prob. 1. We can easily calculate E log X = 12 log 1 + 14 log 2 + 41 log 3 = 14 log 6 , and therefore 1 Pn → 2 4 log 6 = 1.565 . 9. AEP. Let X1 , X2 , . . . be independent identically distributed random variables drawn according to the probability mass function p(x), x ∈ {1, 2, . . . , m} . Thus p(x 1 , x2 , . . . , xn ) = Cn that − n1 log p(X1 , X2 , . . . , Xn ) → H(X) in probability. Let i=1 p(xi ) . We know Cn q(x1 , x2 , . . . , xn ) = i=1 q(xi ), where q is another probability mass function on {1, 2, . . . , m} . (a) Evaluate lim − n1 log q(X1 , X2 , . . . , Xn ) , where X1 , X2 , . . . are i.i.d. ∼ p(x) . q(X1 ,…,Xn ) (b) Now evaluate the limit of the log likelihood ratio n1 log p(X when X1 , X2 , . . . 1 ,…,Xn ) are i.i.d. ∼ p(x) . Thus the odds favoring q are exponentially small when p is true. 55 The Asymptotic Equipartition Property Solution: (AEP). (a) Since the X1 , X2 , . . . , Xn are i.i.d., so are q(X1 ), q(X2 ), . . . , q(Xn ) , and hence we can apply the strong law of large numbers to obtain 1 1! lim − log q(X1 , X2 , . . . , Xn ) = lim − log q(Xi ) n n = −E(log q(X)) w.p. 1 = − ! ! p(x) log q(x) (3.7) (3.8) (3.9) p(x) ! − p(x) log p(x) (3.10) q(x) = D(p||q) + H(p). (3.11) = p(x) log (b) Again, by the strong law of large numbers, 1 q(X1 , X2 , . . . , Xn ) lim − log n p(X1 , X2 , . . . , Xn ) 1! q(Xi ) log n p(Xi ) q(X) −E(log ) w.p. 1 p(X) ! q(x) − p(x) log p(x) ! p(x) p(x) log q(x) D(p||q). = lim − (3.12) = (3.13) = = = (3.14) (3.15) (3.16) 10. Random box size. An n -dimensional rectangular box with sides X 1 , X2 , X3 , . . . , Xn C is to be constructed. The volume is Vn = ni=1 Xi . The edge length l of a n -cube 1/n with the same volume as the random box is l = V n . Let X1 , X2 , . . . be i.i.d. uniform 1/n random variables over the unit interval [0, 1]. Find lim n→∞ Vn , and compare to 1 (EVn ) n . Clearly the expected edge length does not capture the idea of the volume of the box. The geometric mean, rather than the arithmetic mean, characterizes the behavior of products. C Solution: Random box size. The volume V n = ni=1 Xi is a random variable, since the Xi are random variables uniformly distributed on [0, 1] . V n tends to 0 as n → ∞ . However 1 1 1! loge Vnn = loge Vn = loge Xi → E(log e (X)) a.e. n n by the Strong Law of Large Numbers, since X i and loge (Xi ) are i.i.d. and E(log e (X)) < ∞ . Now 2 E(loge (Xi )) = 1 0 loge (x) dx = −1 Hence, since ex is a continuous function, 1 1 lim Vnn = elimn→∞ n loge Vn = n→∞ 1 1 < . e 2 56 The Asymptotic Equipartition Property Thus the “effective” edge length of this solid is e −1 . Note that since the Xi ’s are C independent, E(Vn ) = E(Xi ) = ( 21 )n . Also 12 is the arithmetic mean of the random variable, and 1e is the geometric mean. 11. Proof of Theorem 3.3.1. This problem shows that the size of the smallest “probable” (n) set is about 2nH . Let X1 , X2 , . . . , Xn be i.i.d. ∼ p(x) . Let Bδ ⊂ X n such that (n) Pr(Bδ ) > 1 − δ . Fix % < 21 . (a) Given any two sets A , B such that Pr(A) > 1 − % 1 and Pr(B) > 1 − %2 , show (n) (n) that Pr(A ∩ B) > 1 − %1 − %2 . Hence Pr(A! ∩ Bδ ) ≥ 1 − % − δ. (b) Justify the steps in the chain of inequalities (n) (3.17) n p(x ) (3.18) 2−n(H−!) (3.19) (n) (3.20) 1 − % − δ ≤ Pr(A(n) ! ∩ Bδ ) = ! (n) (n) A! ∩Bδ ≤ ! (n) (n) A! ∩Bδ −n(H−!) = |A(n) ! ∩ Bδ |2 ≤ (c) Complete the proof of the theorem. (n) |Bδ |2−n(H−!) . (3.21) Solution: Proof of Theorem 3.3.1. (a) Let Ac denote the complement of A . Then P (Ac ∪ B c ) ≤ P (Ac ) + P (B c ). (3.22) Since P (A) ≥ 1 − %1 , P (Ac ) ≤ %1 . Similarly, P (B c ) ≤ %2 . Hence P (A ∩ B) = 1 − P (Ac ∪ B c ) ≥ 1 − P (A ) − P (B ) c c ≥ 1 − % 1 − %2 . (3.23) (3.24) (3.25) (b) To complete the proof, we have the following chain of inequalities 1−%−δ (a) ≤ (b) = (n) (3.26) p(xn ) (3.27) 2−n(H−!) (3.28) (n) (3.29) Pr(A(n) ! ∩ Bδ ) ! (n) (n) A! ∩Bδ (c) ≤ (d) = (e) ≤ ! (n) (n) A! ∩Bδ −n(H−!) |A(n) ! ∩ Bδ |2 (n) |Bδ |2−n(H−!) . (3.30) The Asymptotic Equipartition Property 57 where (a) follows from the previous part, (b) follows by definition of probability of a set, (c) follows from the fact that the probability of elements of the typical set are (n) (n) bounded by 2−n(H−!) , (d) from the definition of |A! ∩ Bδ | as the cardinality (n) (n) (n) (n) (n) of the set A! ∩ Bδ , and (e) from the fact that A! ∩ Bδ ⊆ Bδ . 12. Monotonic convergence of the empirical distribution. Let p̂ n denote the empirical probability mass function corresponding to X 1 , X2 , . . . , Xn i.i.d. ∼ p(x), x ∈ X . Specifically, n 1! I(Xi = x) p̂n (x) = n i=1 is the proportion of times that Xi = x in the first n samples, where I is the indicator function. (a) Show for X binary that ED(p̂2n 5 p) ≤ ED(p̂n 5 p). Thus the expected relative entropy “distance” from the empirical distribution to the true distribution decreases with sample size. Hint: Write p̂2n = 21 p̂n + 12 p̂’n and use the convexity of D . (b) Show for an arbitrary discrete X that ED(p̂n 5 p) ≤ ED(p̂n−1 5 p). Hint: Write p̂n as the average of n empirical mass functions with each of the n samples deleted in turn. Solution: Monotonic convergence of the empirical distribution. (a) Note that, p̂2n (x) = = = 2n 1 ! I(Xi = x) 2n i=1 n 2n 11! 11 ! I(Xi = x) + I(Xi = x) 2 n i=1 2 n i=n+1 1 1 p̂n (x) + p̂’n (x). 2 2 Using convexity of D(p||q) we have that, 1 1 1 1 D(p̂2n ||p) = D( p̂n + p̂’n || p + p) 2 2 2 2 1 1 ≤ D(p̂n ||p) + D(p̂’n ||p). 2 2 Taking expectations and using the fact the X i ’s are identically distributed we get, ED(p̂2n ||p) ≤ ED(p̂n ||p). 58 The Asymptotic Equipartition Property (b) The trick to this part is similar to part a) and involves rewriting p̂ n in terms of p̂n−1 . We see that, p̂n = ! 1 n−1 I(Xn = x) I(Xi = x) + n i=0 n p̂n = 1! I(Xj = x) , I(Xi = x) + n i+=j n or in general, where j ranges from 1 to n . Summing over j we get, n n−1! np̂n = p̂j + p̂n , n j=1 n−1 or, n 1! p̂j p̂n = n j=1 n−1 where, n ! j=1 p̂jn−1 = 1 ! I(Xi = x). n − 1 i+=j Again using the convexity of D(p||q) and the fact that the D(p̂ jn−1 ||p) are identically distributed for all j and hence have the same expected value, we obtain the final result. (n) 13. Calculation of typical set To clarify the notion of a typical set A ! and the smallest (n) set of high probability Bδ , we will calculate the set for a simple example. Consider a sequence of i.i.d. binary random variables, X 1 , X2 , . . . , Xn , where the probability that Xi = 1 is 0.6 (and therefore the probability that X i = 0 is 0.4). (a) Calculate H(X) . (n) (b) With n = 25 and % = 0.1 , which sequences fall in the typical set A ! ? What is the probability of the typical set? How many elements are there in the typical set? (This involves computation of a table of probabilities for sequences with k 1’s, 0 ≤ k ≤ 25 , and finding those sequences that are in the typical set.) 59 The Asymptotic Equipartition Property k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ,nk 1 25 300 2300 12650 53130 177100 480700 1081575 2042975 3268760 4457400 5200300 5200300 4457400 3268760 2042975 1081575 480700 177100 53130 12650 2300 300 25 1 ,n k pk (1 − p)n−k 0.000000 0.000000 0.000000 0.000001 0.000007 0.000054 0.000227 0.001205 0.003121 0.013169 0.021222 0.077801 0.075967 0.267718 0.146507 0.575383 0.151086 0.846448 0.079986 0.970638 0.019891 0.997633 0.001937 0.999950 0.000047 0.000003 − n1 log p(xn ) 1.321928 1.298530 1.275131 1.251733 1.228334 1.204936 1.181537 1.158139 1.134740 1.111342 1.087943 1.064545 1.041146 1.017748 0.994349 0.970951 0.947552 0.924154 0.900755 0.877357 0.853958 0.830560 0.807161 0.783763 0.760364 0.736966 (c) How many elements are there in the smallest set that has probability 0.9? (d) How many elements are there in the intersection of the sets in part (b) and (c)? What is the probability of this intersection? Solution: (a) H(X) = −0.6 log 0.6 − 0.4 log 0.4 = 0.97095 bits. (n) (b) By definition, A! for % = 0.1 is the set of sequences such that − n1 log p(xn ) lies in the range (H(X)−%, H(X)+%) , i.e., in the range (0.87095, 1.07095). Examining the last column of the table, it is easy to see that the typical set is the set of all sequences with the number of ones lying between 11 and 19. The probability of the typical set can be calculated from cumulative probability column. The probability that the number of 1’s lies between 11 and 19 is equal to F (19) − F (10) = 0.970638 − 0.034392 = 0.936246 . Note that this is greater than 1 − % , i.e., the n is large enough for the probability of the typical set to be greater than 1 − % . 60 The Asymptotic Equipartition Property The number of elements in the typical set can be found using the third column. 19 ! . / . / . / 19 10 ! ! n n n = = − = 33486026 − 7119516 = 26366510. k k k k=11 k=0 k=0 (3.31) (n) Note that the upper and lower bounds for the size of the A ! can be calculated as 2n(H+!) = 225(0.97095+0.1) = 226.77 = 1.147365 × 108 , and (1 − %)2n(H−!) = 0.9 × 225(0.97095−0.1) = 0.9 × 221.9875 = 3742308 . Both bounds are very loose! |A(n) ! | (n) (c) To find the smallest set Bδ of probability 0.9, we can imagine that we are filling a bag with pieces such that we want to reach a certain weight with the minimum number of pieces. To minimize the number of pieces that we use, we should use the largest possible pieces. In this case, it corresponds to using the sequences with the highest probability. Thus we keep putting the high probability sequences into this set until we reach a total probability of 0.9. Looking at the fourth column of the table, it is clear that the probability of a sequence increases monotonically with k . Thus the set consists of sequences of k = 25, 24, . . . , until we have a total probability 0.9. (n) Using the cumulative probability column, it follows that the set B δ consist of sequences with k ≥ 13 and some sequences with k = 12 . The sequences with (n) k ≥ 13 provide a total probability of 1−0.153768 = 0.846232 to the set B δ . The remaining probability of 0.9 − 0.846232 = 0.053768 should come from sequences with k = 12 . The number of such sequences needed to fill this probability is at least 0.053768/p(xn ) = 0.053768/1.460813×10−8 = 3680690.1 , which we round up to 3680691. Thus the smallest set with probability 0.9 has 33554432 − 16777216 + (n) 3680691 = 20457907 sequences. Note that the set B δ is not uniquely defined – it could include any 3680691 sequences with k = 12 . However, the size of the smallest set is well defined. (n) (n) (d) The intersection of the sets A! and Bδ in parts (b) and (c) consists of all sequences with k between 13 and 19, and 3680691 sequences with k = 12 . The probability of this intersection = 0.970638 − 0.153768 + 0.053768 = 0.870638 , and the size of this intersection = 33486026 − 16777216 + 3680691 = 20389501 . Chapter 4 Entropy Rates of a Stochastic Process 1. Doubly stochastic matrices. An n × n matrix P = [P ij ] is said to be doubly $ $ stochastic if Pij ≥ 0 and j Pij = 1 for all i and i Pij = 1 for all j . An n × n matrix P is said to be a permutation matrix if it is doubly stochastic and there is precisely one Pij = 1 in each row and each column. It can be shown that every doubly stochastic matrix can be written as the convex combination of permutation matrices. $ (a) Let at = (a1 , a2 , . . . , an ) , ai ≥ 0 , ai = 1 , be a probability vector. Let b = aP , where P is doubly stochastic. Show that b is a probability vector and that H(b1 , b2 , . . . , bn ) ≥ H(a1 , a2 , . . . , an ) . Thus stochastic mixing increases entropy. (b) Show that a stationary distribution µ for a doubly stochastic matrix P is the uniform distribution. (c) Conversely, prove that if the uniform distribution is a stationary distribution for a Markov transition matrix P , then P is doubly stochastic. Solution: Doubly Stochastic Matrices. (a) H(b) − H(a) = − = ! j !! j = i !! i ≥ bj log bj + j ! ai log ai i ! ai Pij log( ak Pkj ) + k ai k ak Pkj ai Pij log $   $ ! i,j ai   ai Pij log $ i,j 61 i,j bj (4.1) ! ai log ai (4.2) i (4.3) (4.4) 62 Entropy Rates of a Stochastic Process = 1 log = 0, m m (4.5) (4.6) where the inequality follows from the log sum inequality. (b) If the matrix is doubly stochastic, the substituting µ i = that it satisfies µ = µP . 1 m , we can easily check (c) If the uniform is a stationary distribution, then ! 1 1 ! = µi = µj Pji = Pji , m m j j or $ j (4.7) Pji = 1 or that the matrix is doubly stochastic. 2. Time’s arrow. Let {Xi }∞ i=−∞ be a stationary stochastic process. Prove that H(X0 |X−1 , X−2 , . . . , X−n ) = H(X0 |X1 , X2 , . . . , Xn ). In other words, the present has a conditional entropy given the past equal to the conditional entropy given the future. This is true even though it is quite easy to concoct stationary random processes for which the flow into the future looks quite different from the flow into the past. That is to say, one can determine the direction of time by looking at a sample function of the process. Nonetheless, given the present state, the conditional uncertainty of the next symbol in the future is equal to the conditional uncertainty of the previous symbol in the past. Solution: Time’s arrow. By the chain rule for entropy, H(X0 |X−1 , . . . , X−n ) = H(X0 , X−1 , . . . , X−n ) − H(X−1 , . . . , X−n ) = H(X0 , X1 , X2 , . . . , Xn ) − H(X1 , X2 , . . . , Xn ) = H(X0 |X1 , X2 , . . . , Xn ), (4.8) (4.9) (4.10) where (4.9) follows from stationarity. 3. Shuffles increase entropy. Argue that for any distribution on shuffles T and any distribution on card positions X that H(T X) ≥ H(T X|T ) = H(T if X and T are independent. −1 T X|T ) (4.11) (4.12) = H(X|T ) (4.13) = H(X), (4.14) 63 Entropy Rates of a Stochastic Process Solution: Shuffles increase entropy. H(T X) ≥ H(T X|T ) = H(T −1 T X|T ) (4.15) (4.16) = H(X|T ) (4.17) = H(X). (4.18) The inequality follows from the fact that conditioning reduces entropy and the first equality follows from the fact that given T , we can reverse the shuffle. 4. Second law of thermodynamics. Let X1 , X2 , X3 . . . be a stationary first-order Markov chain. In Section 4.4, it was shown that H(X n | X1 ) ≥ H(Xn−1 | X1 ) for n = 2, 3 . . . . Thus conditional uncertainty about the future grows with time. This is true although the unconditional uncertainty H(X n ) remains constant. However, show by example that H(Xn |X1 = x1 ) does not necessarily grow with n for every x 1 . Solution: Second law of thermodynamics. H(Xn |X1 ) ≤ H(Xn |X1 , X2 ) (Conditioning reduces entropy) = H(Xn |X2 ) (by Markovity) = H(Xn−1 |X1 ) (by stationarity) (4.19) (4.20) (4.21) Alternatively, by an application of the data processing inequality to the Markov chain X1 → Xn−1 → Xn , we have I(X1 ; Xn−1 ) ≥ I(X1 ; Xn ). (4.22) Expanding the mutual informations in terms of entropies, we have H(Xn−1 ) − H(Xn−1 |X1 ) ≥ H(Xn ) − H(Xn |X1 ). (4.23) By stationarity, H(Xn−1 ) = H(Xn ) and hence we have H(Xn−1 |X1 ) ≤ H(Xn |X1 ). (4.24) 5. Entropy of a random tree. Consider the following method of generating a random tree with n nodes. First expand the root node: “! ” ! ” ! Then expand one of the two terminal nodes at random: “! ” ! ” ! “! ” ! ” ! “! ” ! ” ! “! ! “” ! 64 Entropy Rates of a Stochastic Process At time k , choose one of the k − 1 terminal nodes according to a uniform distribution and expand it. Continue until n terminal nodes have been generated. Thus a sequence leading to a five node tree might look like this: “! “! “! “! ” ! ” ! ” ! ” ! ” ! ” ! ” ! ” ! “! “! “! ” ! ” ! ” !! ” ! ” ! ” “! “! ” ! ” ! ” ! ” ! “! ! “” ! Surprisingly, the following method of generating random trees yields the same probability distribution on trees with n terminal nodes. First choose an integer N 1 uniformly distributed on {1, 2, . . . , n − 1} . We then have the picture. “! ” ! ” ! N1 n − N1 Then choose an integer N2 uniformly distributed over {1, 2, . . . , N 1 − 1} , and independently choose another integer N3 uniformly over {1, 2, . . . , (n − N1 ) − 1} . The picture is now: #$ $$ ## # $$ # “! “! ” ! ” ! ” ! ” ! N2 N1 − N2 N3 n − N 1 − N3 Continue the process until no further subdivision can be made. (The equivalence of these two tree generation schemes follows, for example, from Polya’s urn model.) Now let Tn denote a random n -node tree generated as described. The probability distribution on such trees seems difficult to describe, but we can find the entropy of this distribution in recursive form. First some examples. For n = 2 , we have only one tree. Thus H(T 2 ) = 0 . For n = 3 , we have two equally probable trees: “! ” ! ” ! “! ” ! ” ! “! ” ! ” ! “! ! “” ! 65 Entropy Rates of a Stochastic Process Thus H(T3 ) = log 2 . For n = 4 , we have five possible trees, with probabilities 1/3, 1/6, 1/6, 1/6, 1/6. Now for the recurrence relation. Let N 1 (Tn ) denote the number of terminal nodes of Tn in the right half of the tree. Justify each of the steps in the following: H(Tn ) (a) = H(N1 , Tn ) (4.25) (b) H(N1 ) + H(Tn |N1 ) (4.26) = (c) = log(n − 1) + H(Tn |N1 ) (4.27) (d) = log(n − 1) + (4.28) (e) = log(n − 1) + = log(n − 1) + n−1 ! 1 [H(Tk ) + H(Tn−k )] n − 1 k=1 ! 2 n−1 H(Tk ). n − 1 k=1 ! 2 n−1 Hk . n − 1 k=1 (4.29) (4.30) (f) Use this to show that or (n − 1)Hn = nHn−1 + (n − 1) log(n − 1) − (n − 2) log(n − 2), (4.31) Hn Hn−1 = + cn , n n−1 (4.32) $ for appropriately defined cn . Since cn = c < ∞ , you have proved that n1 H(Tn ) converges to a constant. Thus the expected number of bits necessary to describe the random tree Tn grows linearly with n . Solution: Entropy of a random tree. (a) H(Tn , N1 ) = H(Tn ) + H(N1 |Tn ) = H(Tn ) + 0 by the chain rule for entropies and since N1 is a function of Tn . (b) H(Tn , N1 ) = H(N1 ) + H(Tn |N1 ) by the chain rule for entropies. (c) H(N1 ) = log(n − 1) since N1 is uniform on {1, 2, . . . , n − 1} . (d) H(Tn |N1 ) = = n−1 ! k=1 P (N1 = k)H(Tn |N1 = k) ! 1 n−1 H(Tn |N1 = k) n − 1 k=1 (4.33) (4.34) by the definition of conditional entropy. Since conditional on N 1 , the left subtree and the right subtree are chosen independently, H(T n |N1 = k) = H(Tk , Tn−k |N1 = 66 Entropy Rates of a Stochastic Process k) = H(Tk ) + H(Tn−k ) , so ! 1 n−1 (H(Tk ) + H(Tn−k )) . n − 1 k=1 H(Tn |N1 ) = (4.35) (e) By a simple change of variables, n−1 ! H(Tn−k ) = k=1 n−1 ! H(Tk ). (4.36) k=1 (f) Hence if we let Hn = H(Tn ) , (n − 1)Hn = (n − 1) log(n − 1) + 2 (n − 2)Hn−1 = (n − 2) log(n − 2) + 2 n−1 ! k=1 n−2 ! Hk (4.37) Hk (4.38) k=1 (4.39) Subtracting the second equation from the first, we get (n − 1)Hn − (n − 2)Hn−1 = (n − 1) log(n − 1) − (n − 2) log(n − 2) + 2H n−1 (4.40) or Hn n Hn−1 log(n − 1) (n − 2) log(n − 2) + − n−1 n n(n − 1) Hn−1 + Cn n−1 = = (4.41) (4.42) where Cn = = log(n − 1) (n − 2) log(n − 2) − n n(n − 1) log(n − 1) log(n − 2) 2 log(n − 2) − + n (n − 1) n(n − 1) (4.43) (4.44) Substituting the equation for H n−1 in the equation for Hn and proceeding recursively, we obtain a telescoping sum Hn n = = n ! j=3 n ! Cj + H2 2 2 log(j − 2) 1 + log(n − 1). j(j − 1) n j=3 (4.45) (4.46) 67 Entropy Rates of a Stochastic Process Since limn→∞ 1 n log(n − 1) = 0 Hn n→∞ n lim = ≤ = ∞ ! 2 log(j − 2) j(j − 1) j=3 ∞ ! 2 log(j − 1) (j − 1)2 j=3 ∞ ! 2 j=2 j2 log j (4.47) (4.48) (4.49) √ For sufficiently large j , log j ≤ j and hence the sum in (4.49) is dominated by the $ −3 sum j j 2 which converges. Hence the above sum converges. In fact, computer evaluation shows that lim ∞ Hn ! 2 = log(j − 2) = 1.736 bits. n j(j − 1) j=3 (4.50) Thus the number of bits required to describe a random n -node tree grows linearly with n. 6. Monotonicity of entropy per element. For a stationary stochastic process X 1 , X2 , . . . , Xn , show that (a) H(X1 , X2 , . . . , Xn ) H(X1 , X2 , . . . , Xn−1 ) ≤ . n n−1 (4.51) H(X1 , X2 , . . . , Xn ) ≥ H(Xn |Xn−1 , . . . , X1 ). n (4.52) (b) Solution: Monotonicity of entropy per element. (a) By the chain rule for entropy, H(X1 , X2 , . . . , Xn ) n = = = $n i=1 H(Xi |X i−1 ) n $n−1 H(Xn |X n−1 ) + i=1 H(Xi |X i−1 ) n H(Xn |X n−1 ) + H(X1 , X2 , . . . , Xn−1 ) . n From stationarity it follows that for all 1 ≤ i ≤ n , H(Xn |X n−1 ) ≤ H(Xi |X i−1 ), (4.53) (4.54) (4.55) 68 Entropy Rates of a Stochastic Process which further implies, by averaging both sides, that, H(Xn |X n−1 $n−1 H(Xi |X i−1 ) n−1 H(X1 , X2 , . . . , Xn−1 ) . n−1 ) ≤ i=1 = Combining (4.55) and (4.57) yields, H(X1 , X2 , . . . , Xn ) n ≤ = (4.56) (4.57) 5 4 1 H(X1 , X2 , . . . , Xn−1 ) + H(X1 , X2 , . . . , Xn−1 ) n n−1 H(X1 , X2 , . . . , Xn−1 ) . (4.58) n−1 (b) By stationarity we have for all 1 ≤ i ≤ n , which implies that H(Xn |X n−1 ) ≤ H(Xi |X i−1 ), $n i=1 H(Xn |X H(Xn |X n−1 ) = n−1 ) n $n i−1 ) i=1 H(Xi |X n H(X1 , X2 , . . . , Xn ) . n ≤ = (4.59) (4.60) (4.61) 7. Entropy rates of Markov chains. (a) Find the entropy rate of the two-state Markov chain with transition matrix P = " 1 − p01 p01 p10 1 − p10 # . (b) What v

Solution Manual of Elements of Information Theory

100% ont trouvé ce document utile (17 votes)

100% (17) 100% ont trouvé ce document utile (17 votes)

Enregistrer Enregistrer Solution Manual of Elements of Information Theory pour plus tard

100% 100% ont trouvé ce document utile, Marquez ce document comme utile

0% 0 % ont trouvé ce document inutile, Marquez ce document comme n’étant pas utile

Intégrer

Partager

키워드에 대한 정보 elements of information theory solution

다음은 Bing에서 elements of information theory solution 주제에 대한 검색 결과입니다. 필요한 경우 더 읽을 수 있습니다.

이 기사는 인터넷의 다양한 출처에서 편집되었습니다. 이 기사가 유용했기를 바랍니다. 이 기사가 유용하다고 생각되면 공유하십시오. 매우 감사합니다!

사람들이 주제에 대해 자주 검색하는 키워드 Most Complete Solution manual for Elements of Information Theory 2nd Edition Thomas M. Cover.wmv

  • Most Complete
  • Solution manua
  • 2nd Edition
  • ISBN-10: 0471241954
  • ISBN-13: 978-0471241959
  • Thomas M. Cover
  • Joy A. Thomas

Most #Complete #Solution #manual #for #Elements #of #Information #Theory #2nd #Edition #Thomas #M. #Cover.wmv


YouTube에서 elements of information theory solution 주제의 다른 동영상 보기

주제에 대한 기사를 시청해 주셔서 감사합니다 Most Complete Solution manual for Elements of Information Theory 2nd Edition Thomas M. Cover.wmv | elements of information theory solution, 이 기사가 유용하다고 생각되면 공유하십시오, 매우 감사합니다.

See also  N2 Spray Gun | Cryotherapy How-To 빠른 답변