Login
Shannon's entropy
Definition (Discrete)
Let $X$ be a discrete random variable on a finite set $\mathcal{X}=\{x_1,\ldots,x_n\}$ , with probability distribution function $p(x) = \Pr(X=x)$ . The entropy $H(X)$ of $X$ is defined as \begin{equation} H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_b p(x). \end{equation}The convention $0 \log 0 =0$ is adopted in the definition. The logarithm is usually taken to the base 2, in which case the entropy is measured in ``bits,'' or to the base e, in which case $H(X)$ is measured in ``nats.''
If $X$ and $Y$ are random variables on $\mathcal{X}$ and $\mathcal{Y}$ respectively, the joint entropy of $X$ and $Y$ is$$ H(X,Y) = -\sum_{(x,y)\in \mathcal{X}\times \mathcal{Y}} p(x,y) \log_b p(x,y),$$ where $p(x,y)$ denote the joint distribution of $X$ and $Y$ .
Discussion
The Shannon entropy was first introduced by Shannon in 1948 in his landmark paper ``A Mathematical Theory of Communication.'' The entropy is a functional of the probability distribution function $p(x)$ , and is sometime written as$$ H(p(x_1), p(x_2),\ldots, p(x_n)).$$ It is noted that the entropy of $X$ does not depend on the actual values of $X$ , it only depends on $p(x)$ . The definition of Shannon's entropy can be written as an expectation$$ H(X) = -E[\log_b p(X)].$$ The quantity $-\log_b p(x)$ is interpreted as the information content of the outcome $x\in\mathcal{X}$ , and is also called the Hartley information of $x$ . Hence the Shannon's entropy is the average amount of information contained in random variable $X$ , it is also the uncertainty removed after the actual outcome of $X$ is revealed.Characterization
We write $H(X)$ as $H_n(p_1,\ldots,p_n)$ . The Shannon entropy satisfies the following properties.
- For any $n$ , $H_n(p_1,\ldots,p_n)$ is a continuous and symmetric function on variables $p_1$ , $p_2,\ldots, p_n$ .
- Event of probability zero does not contribute to the entropy, i.e. for any $n$ ,$$ H_{n+1}(p_1,\ldots,p_n,0) = H_n(p_1,\ldots,p_n).$$
- Entropy is maximized when the probability distribution is uniform. For all $n$ ,$$ H_n(p_1,\ldots,p_n) \leq H_n\Big(\frac{1}{n},\ldots,\frac{1}{n} \Big).$$ This follows from Jensen inequality,$$ H(X) = E\Big[\log_b \Big( \frac{1}{p(X)}\Big) \Big] \leq \log_b \Big( E\Big[ \frac{1}{p(X)} \Big] \Big) = \log_b(n).$$
- If $p_{ij}$ , $1\leq i \leq m$ , $1\leq j \leq n$ are non-negative real numbers summing up to one, and $q_i = \sum_{j=1}^n p_{ij}$ , then$$ H_{mn}(p_{11},\ldots, p_{mn}) = H_m(q_1,\ldots,q_m) + \sum_{i=1}^m q_i H_n\Big(\frac{p_{i1}}{q_i},\ldots, \frac{p_{in}}{q_i} \Big).$$ If we partition the $mn$ outcomes of the random experiment into $m$ groups, each group contains $n$ elements, we can do the experiment in two steps: first determine the group to which the actual outcome belongs to, and second find the outcome in this group. The probability that you will observe group $i$ is $q_i$ . The conditional probability distribution function given group $i$ is $(p_{i1}/q_i,\ldots,p_{in}/q_i)$ . The entropy$$ H_n\Big(\frac{p_{i1}}{q_i},\ldots, \frac{p_{in}}{q_i} \Big)$$ is the entropy of the probability distribution conditioned on group $i$ . Property 4 says that the total information is the sum of the information you gain in the first step, $H_m(q_1,\ldots, q_m)$ , and a weighted sum of the entropies conditioned on each group.
Khinchin in 1957 showed that the only function satisfying the above assumptions is of the form:
where $k$ is a positive constant, essentially a choice of unit of measure.Definition (Continuous)
Entropy in the continuous case is called differential entropy.Discussion--Continuous Entropy
Despite its seductively analogous form, continuous entropy cannot be obtained as a limiting case of discrete entropy.We wish to obtain a generally finite measure as the ``bin size'' goes to zero. In the discrete case, the bin size is the (implicit) width of each of the $n$ (finite or infinite) bins/buckets/states whose probabilities are the $p_n$ . As we generalize to the continuous domain, we must make this width explicit.
To do this, start with a continuous function $f$ discretized as shown in the figure:
We will denote \begin{equation} H^{\Delta} \defined - \sum_{i=-\infty}^{\infty} \Delta f(x_i) \log \Delta f(x_i) \end{equation}and expanding the $\log$ we have
| (1) | ||
| (2) |
As $\Delta \to 0$ , we have
| and | (3) | |
| (4) |
This leads us to our definition of the differential entropy (continuous entropy): \begin{equation} h[f] = \lim_{\Delta \to 0} \left[H^{\Delta} + \log \Delta\right] = -\int_{-\infty}^{\infty} f(x) \log f(x) dx. \end{equation}
