Shannon’s entropy


Definition (Discrete)

Let X be a discrete random variable on a finite setMathworldPlanetmath 𝒳={x1,,xn}, with probability distribution function p(x)=Pr(X=x). The entropyMathworldPlanetmath H(X) of X is defined as

H(X)=-x𝒳p(x)logbp(x). (1)

The convention 0log0=0 is adopted in the definition. The logarithm is usually taken to the base 2, in which case the entropy is measured in “bits,” or to the base e, in which case H(X) is measured in “nats.”

If X and Y are random variables on 𝒳 and 𝒴 respectively, the joint entropy of X and Y is

H(X,Y)=-(x,y)𝒳×𝒴p(x,y)logbp(x,y),

where p(x,y) denote the joint distributionPlanetmathPlanetmath of X and Y.

Discussion

The Shannon entropy was first introduced by Shannon in 1948 in his landmark paper “A Mathematical Theory of Communication.” The entropy is a functionalPlanetmathPlanetmathPlanetmath of the probability distribution function p(x), and is sometime written as

H(p(x1),p(x2),,p(xn)).

It is noted that the entropy of X does not depend on the actual values of X, it only depends on p(x). The definition of Shannon’s entropy can be written as an expectation

H(X)=-E[logbp(X)].

The quantity -logbp(x) is interpreted as the information content of the outcome x𝒳, and is also called the Hartley information of x. Hence the Shannon’s entropy is the averageMathworldPlanetmath amount of information contained in random variable X, it is also the uncertainty removed after the actual outcome of X is revealed.

Characterization

We write H(X) as Hn(p1,,pn). The Shannon entropy satisfies the following properties.

  1. 1.

    For any n, Hn(p1,,pn) is a continuousMathworldPlanetmathPlanetmath and symmetric function on variables p1, p2,,pn.

  2. 2.

    Event of probability zero does not contribute to the entropy, i.e. for any n,

    Hn+1(p1,,pn,0)=Hn(p1,,pn).
  3. 3.

    Entropy is maximized when the probability distribution is uniform. For all n,

    Hn(p1,,pn)Hn(1n,,1n).

    This follows from Jensen inequalityMathworldPlanetmath,

    H(X)=E[logb(1p(X))]logb(E[1p(X)])=logb(n).
  4. 4.

    If pij, 1im, 1jn are non-negative real numbers summing up to one, and qi=j=1npij, then

    Hmn(p11,,pmn)=Hm(q1,,qm)+i=1mqiHn(pi1qi,,pinqi).

    If we partitionPlanetmathPlanetmath the mn outcomes of the random experiment into m groups, each group contains n elements, we can do the experiment in two steps: first determine the group to which the actual outcome belongs to, and second find the outcome in this group. The probability that you will observe group i is qi. The conditional probability distribution function given group i is (pi1/qi,,pin/qi). The entropy

    Hn(pi1qi,,pinqi)

    is the entropy of the probability distribution conditioned on group i. Property 4 says that the total information is the sum of the information you gain in the first step, Hm(q1,,qm), and a weighted sum of the entropies conditioned on each group.

Khinchin in 1957 showed that the only function satisfying the above assumptionsPlanetmathPlanetmath is of the form:

Hn(p1,,pn)=-ki=1npilogpi

where k is a positive constant, essentially a choice of unit of measure.

Definition (Continuous)

Entropy in the continuous case is called differential entropyMathworldPlanetmath (http://planetmath.org/DifferentialEntropy).

Discussion—Continuous Entropy

Despite its seductively analogous form, continuous entropy cannot be obtained as a limiting case of discrete entropy.

We wish to obtain a generally finite measure as the “bin size” goes to zero. In the discrete case, the bin size is the (implicit) width of each of the n (finite or infiniteMathworldPlanetmath) bins/buckets/states whose probabilities are the pn. As we generalize to the continuous domain, we must make this width explicit.

To do this, start with a continuous function f discretized as shown in the figure:

Figure 1: Discretizing the function f into bins of width Δ

As the figure indicates, by the mean-value theorem there exists a value xi in each bin such that

f(xi)Δ=iΔ(i+1)Δf(x)𝑑x (2)

and thus the integral of the function f can be approximated (in the Riemannian sense) by

-f(x)𝑑x=limΔ0i=-f(xi)Δ (3)

where this limit and “bin size goes to zero” are equivalentMathworldPlanetmathPlanetmathPlanetmathPlanetmathPlanetmathPlanetmath.

We will denote

HΔ:=-i=-Δf(xi)logΔf(xi) (4)

and expanding the log we have

HΔ =-i=-Δf(xi)logΔf(xi) (5)
=-i=-Δf(xi)logf(xi)-i=-f(xi)ΔlogΔ. (6)

As Δ0, we have

i=-f(xi)Δ f(x)𝑑x=1  and (7)
i=-Δf(xi)logf(xi) f(x)logf(x)𝑑x (8)

This leads us to our definition of the differential entropy (continuous entropy):

h[f]=limΔ0[HΔ+logΔ]=--f(x)logf(x)𝑑x. (9)
Title Shannon’s entropy
Canonical name ShannonsEntropy
Date of creation 2013-03-22 12:00:53
Last modified on 2013-03-22 12:00:53
Owner kshum (5987)
Last modified by kshum (5987)
Numerical id 26
Author kshum (5987)
Entry type Definition
Classification msc 94A17
Synonym entropy
Synonym Shannon entropy
Related topic Logarithm
Related topic ConditionalEntropy
Related topic MutualInformation
Related topic DifferentialEntropy
Related topic HartleyFunction
Defines joint entropy