## You are here

HomeShannon's entropy

## Primary tabs

# Shannonβs entropy

# Definition (Discrete)

Let $X$ be a discrete random variable on a finite set $\mathcal{X}=\{x_{1},\ldots,x_{n}\}$, with probability distribution function $p(x)=\Pr(X=x)$. The entropy $H(X)$ of $X$ is defined as

$H(X)=-\sum_{{x\in\mathcal{X}}}p(x)\log_{b}p(x).$ | (1) |

The convention $0\log 0=0$ is adopted in the definition. The logarithm is usually taken to the base 2, in which case the entropy is measured in βbits,β or to the base e, in which case $H(X)$ is measured in βnats.β

If $X$ and $Y$ are random variables on $\mathcal{X}$ and $\mathcal{Y}$ respectively, the joint entropy of $X$ and $Y$ is

$H(X,Y)=-\sum_{{(x,y)\in\mathcal{X}\times\mathcal{Y}}}p(x,y)\log_{b}p(x,y),$ |

where $p(x,y)$ denote the joint distribution of $X$ and $Y$.

# Discussion

The Shannon entropy was first introduced by Shannon in 1948 in his landmark paper βA Mathematical Theory of Communication.β The entropy is a functional of the probability distribution function $p(x)$, and is sometime written as

$H(p(x_{1}),p(x_{2}),\ldots,p(x_{n})).$ |

It is noted that the entropy of $X$ does not depend on the actual values of $X$, it only depends on $p(x)$. The definition of Shannonβs entropy can be written as an expectation

$H(X)=-E[\log_{b}p(X)].$ |

The quantity $-\log_{b}p(x)$ is interpreted as the information content of the outcome $x\in\mathcal{X}$, and is also called the Hartley information of $x$. Hence the Shannonβs entropy is the average amount of information contained in random variable $X$, it is also the uncertainty removed after the actual outcome of $X$ is revealed.

# Characterization

We write $H(X)$ as $H_{n}(p_{1},\ldots,p_{n})$. The Shannon entropy satisfies the following properties.

1. For any $n$, $H_{n}(p_{1},\ldots,p_{n})$ is a continuous and symmetric function on variables $p_{1}$, $p_{2},\ldots,p_{n}$.

2. 3. Entropy is maximized when the probability distribution is uniform. For all $n$,

$H_{n}(p_{1},\ldots,p_{n})\leq H_{n}\Big(\frac{1}{n},\ldots,\frac{1}{n}\Big).$ This follows from Jensen inequality,

$H(X)=E\Big[\log_{b}\Big(\frac{1}{p(X)}\Big)\Big]\leq\log_{b}\Big(E\Big[\frac{1% }{p(X)}\Big]\Big)=\log_{b}(n).$ 4. If $p_{{ij}}$, $1\leq i\leq m$, $1\leq j\leq n$ are non-negative real numbers summing up to one, and $q_{i}=\sum_{{j=1}}^{n}p_{{ij}}$, then

$H_{{mn}}(p_{{11}},\ldots,p_{{mn}})=H_{m}(q_{1},\ldots,q_{m})+\sum_{{i=1}}^{m}q% _{i}H_{n}\Big(\frac{p_{{i1}}}{q_{i}},\ldots,\frac{p_{{in}}}{q_{i}}\Big).$ If we partition the $mn$ outcomes of the random experiment into $m$ groups, each group contains $n$ elements, we can do the experiment in two steps: first determine the group to which the actual outcome belongs to, and second find the outcome in this group. The probability that you will observe group $i$ is $q_{i}$. The conditional probability distribution function given group $i$ is $(p_{{i1}}/q_{i},\ldots,p_{{in}}/q_{i})$. The entropy

$H_{n}\Big(\frac{p_{{i1}}}{q_{i}},\ldots,\frac{p_{{in}}}{q_{i}}\Big)$ is the entropy of the probability distribution conditioned on group $i$. Property 4 says that the total information is the sum of the information you gain in the first step, $H_{m}(q_{1},\ldots,q_{m})$, and a weighted sum of the entropies conditioned on each group.

Khinchin in 1957 showed that the only function satisfying the above assumptions is of the form:

$H_{n}(p_{1},\ldots,p_{n})=-k\sum_{{i=1}}^{n}p_{i}\log p_{i}$ |

where $k$ is a positive constant, essentially a choice of unit of measure.

# Definition (Continuous)

Entropy in the continuous case is called *differential entropy*.

# DiscussionβContinuous Entropy

Despite its seductively analogous form, continuous entropy cannot be obtained as a limiting case of discrete entropy.

We wish to obtain a generally finite measure as the βbin sizeβ goes to zero. In the discrete case, the bin size is the (implicit) width of each of the $n$ (finite or infinite) bins/buckets/states whose probabilities are the $p_{n}$. As we generalize to the continuous domain, we must make this width explicit.

To do this, start with a continuous function $f$ discretized as shown in the figure:

As the figure indicates, by the mean-value theorem there exists a value $x_{i}$ in each bin such that

$f(x_{i})\Delta=\int_{{i\Delta}}^{{(i+1)\Delta}}f(x)dx$ | (2) |

and thus the integral of the function $f$ can be approximated (in the Riemannian sense) by

$\int_{{-\infty}}^{{\infty}}f(x)dx=\lim_{{\Delta\to 0}}\sum_{{i=-\infty}}^{{% \infty}}f(x_{i})\Delta$ | (3) |

where this limit and βbin size goes to zeroβ are equivalent.

We will denote

$H^{{\Delta}}:=-\sum_{{i=-\infty}}^{{\infty}}\Delta f(x_{i})\log\Delta f(x_{i})$ | (4) |

and expanding the $\log$ we have

$\displaystyle H^{{\Delta}}$ | $\displaystyle=-\sum_{{i=-\infty}}^{{\infty}}\Delta f(x_{i})\log\Delta f(x_{i})$ | (5) | ||

$\displaystyle=-\sum_{{i=-\infty}}^{{\infty}}\Delta f(x_{i})\log f(x_{i})-\sum_% {{i=-\infty}}^{{\infty}}f(x_{i})\Delta\log\Delta.$ | (6) |

As $\Delta\to 0$, we have

$\displaystyle\sum_{{i=-\infty}}^{{\infty}}f(x_{i})\Delta$ | $\displaystyle\to\int f(x)dx=1\qquad\text{and}$ | (7) | ||

$\displaystyle\sum_{{i=-\infty}}^{{\infty}}\Delta f(x_{i})\log f(x_{i})$ | $\displaystyle\to\int f(x)\log f(x)dx$ | (8) |

This leads us to our definition of the differential entropy (continuous entropy):

$h[f]=\lim_{{\Delta\to 0}}\left[H^{{\Delta}}+\log\Delta\right]=-\int_{{-\infty}% }^{{\infty}}f(x)\log f(x)dx.$ | (9) |

## Mathematics Subject Classification

94A17*no label found*

- Forums
- Planetary Bugs
- HS/Secondary
- University/Tertiary
- Graduate/Advanced
- Industry/Practice
- Research Topics
- LaTeX help
- Math Comptetitions
- Math History
- Math Humor
- PlanetMath Comments
- PlanetMath System Updates and News
- PlanetMath help
- PlanetMath.ORG
- Strategic Communications Development
- The Math Pub
- Testing messages (ignore)

- Other useful stuff

## Recent Activity

new problem: Geometry by parag

Aug 24

new question: Scheduling Algorithm by ncovella

new question: Scheduling Algorithm by ncovella

## Corrections

suggestion by akrowne β

Explicit definition in the continuous case. by Dr_Absentius β

classification by yark β

missing word by Mathprof β

## Comments

## proof

I suppose I should post the proof as well...later.

## Re: proof

before someone else does =]

-apk

## Proof - external link

The proof can eventually be found at

http://www.cs.brandeis.edu/~cs170a/Handouts/axiomatics.pdf