# 25. Etymology of Entropy#

This lecture describes and compares several notions of entropy.

Among the senses of entropy, we’ll encounter these

A measure of

**uncertainty**of a random variable advanced by Claude Shannon [SW49]A key object governing thermodynamics

Kullback and Leibler’s measure of the statistical divergence between two probability distributions

A measure of the volatility of stochastic discount factors that appear in asset pricing theory

Measures of unpredictability that occur in classical Wiener-Kolmogorov linear prediction theory

A frequency domain criterion for constructing robust decision rules

The concept of entropy plays an important role in robust control formulations described in this lecture Risk and Model Uncertainty and in this lecture Robustness.

## 25.1. Information Theory#

In information theory [SW49], entropy is a measure of the unpredictability of a random variable.

To illustrate things, let \(X\) be a discrete random variable taking values \(x_1, \ldots, x_n\) with probabilities \(p_i = \textrm{Prob}(X = x_i) \geq 0, \sum_i p_i =1\).

Claude Shannon’s [SW49] definition of entropy is

where \(\log_b\) denotes the log function with base \(b\).

Inspired by the limit

we set \(p \log p = 0\) in equation (25.1).

Typical bases for the logarithm are \(2\), \(e\), and \(10\).

In the information theory literature, logarithms of base \(2\), \(e\), and \(10\) are associated with units of information called bits, nats, and dits, respectively.

Shannon typically used base \(2\).

## 25.2. A Measure of Unpredictability#

For a discrete random variable \(X\) with probability density \(p = \{p_i\}_{i=1}^n\), the **surprisal**
for state \(i\) is \( s_i = \log\left(\frac{1}{p_i}\right) \).

The quantity \( \log\left(\frac{1}{p_i}\right) \) is called the **surprisal** because it is inversely related to the likelihood that state
\(i\) will occur.

Note that entropy \(H(p)\) equals the **expected surprisal**

### 25.2.1. Example#

Take a possibly unfair coin, so \(X = \{0,1\}\) with \(p = {\rm Prob}(X=1) = p \in [0,1]\).

Then

Evidently,

at \(p=.5\) and \(H''(p) = -\frac{1}{1-p} -\frac{1}{p} < 0\) for \(p\in (0,1)\).

So \(p=.5\) maximizes entropy, while entropy is minimized at \(p=0\) and \(p=1\).

Thus, among all coins, a fair coin is the most unpredictable.

See Fig. 25.1

### 25.2.2. Example#

Take an \(n\)-sided possibly unfair die with a probability distribution \(\{p_i\}_{i=1}^n\). The die is fair if \(p_i = \frac{1}{n} \forall i\).

Among all dies, a fair die maximizes entropy.

For a fair die, entropy equals \(H(p) = - n^{-1} \sum_i \log \left( \frac{1}{n} \right) = \log(n)\).

To specify the expected number of bits needed to isolate the outcome of one roll of a fair \(n\)-sided die requires \(\log_2 (n)\) bits of information.

For example, if \(n=2\), \(\log_2(2) =1\).

For \(n=3\), \(\log_2(3) = 1.585\).

## 25.3. Mathematical Properties of Entropy#

For a discrete random variable with probability vector \(p\), entropy \(H(p)\) is a function that satisfies

\(H\) is

*continuous*.\(H\) is

*symmetric*: \(H(p_1, p_2, \ldots, p_n) = H(p_{r_1}, \ldots, p_{r_n})\) for any permutation \(r_1, \ldots, r_n\) of \(1,\ldots, n\).A uniform distribution maximizes \(H(p)\): \( H(p_1, \ldots, p_n) \leq H(\frac{1}{n}, \ldots, \frac{1}{n}) .\)

Maximum entropy increases with the number of states: \( H(\frac{1}{n}, \ldots, \frac{1}{n} ) \leq H(\frac{1}{n+1} , \ldots, \frac{1}{n+1})\).

Entropy is not affected by events zero probability.

## 25.4. Conditional Entropy#

Let \((X,Y)\) be a bivariate discrete random vector with outcomes \(x_1, \ldots, x_n\) and \(y_1, \ldots, y_m\), respectively, occurring with probability density \(p(x_i, y_i)\).

Conditional entropy \(H(X| Y)\) is defined as

Here \(\frac{p(y_j)}{p(x_i,y_j)}\), the reciprocal of the conditional probability of \(x_i\) given \(y_j\), can be defined as the **conditional surprisal**.

## 25.5. Independence as Maximum Conditional Entropy#

Let \(m=n\) and \([x_1, \ldots, x_n ] = [y_1, \ldots, y_n]\).

Let \(\sum_j p(x_i,y_j) = \sum_j p(x_j, y_i) \) for all \(i\), so that the marginal distributions of \(x\) and \(y\) are identical.

Thus, \(x\) and \(y\) are identically distributed, but they are not necessarily independent.

Consider the following problem: choose a joint distribution \(p(x_i,y_j)\) to maximize conditional entropy (25.2) subject to the restriction that \(x\) and \(y\) are identically distributed.

The conditional-entropy-maximizing \(p(x_i,y_j)\) sets

Thus, among all joint distributions with identical marginal distributions, the conditional entropy maximizing joint distribution makes \(x\) and \(y\) be independent.

## 25.6. Thermodynamics#

Josiah Willard Gibbs (see https://en.wikipedia.org/wiki/Josiah_Willard_Gibbs) defined entropy as

where \(p_i\) is the probability of a micro state and \(k_B\) is Boltzmann’s constant.

The Boltzmann constant \(k_b\) relates energy at the micro particle level with the temperature observed at the macro level. It equals what is called a gas constant divided by an Avogadro constant.

The second law of thermodynamics states that the entropy of a closed physical system increases until \(S\) defined in (25.3) attains a maximum.

## 25.7. Statistical Divergence#

Let \(X\) be a discrete state space \(x_1, \ldots, x_n\) and let \(p\) and \(q\) be two discrete probability distributions on \(X\).

Assume that \(\frac{p_i}{q_t} \in (0,\infty)\) for all \(i\) for which \(p_i >0\).

Then the Kullback-Leibler statistical divergence, also called **relative entropy**,
is defined as

Evidently,

where \(H(p,q) = \sum_i p_i \log q_i\) is the cross-entropy.

It is easy to verify, as we have done above, that \( D(p|q) \geq 0\) and that \(D(p|q) = 0\) implies that \(p_i = q_i\) when \(q_i >0\).

## 25.8. Continuous distributions#

For a continuous random variable, Kullback-Leibler divergence between two densities \(p\) and \(q\) is defined as

## 25.9. Relative entropy and Gaussian distributions#

We want to compute relative entropy for two continuous densities \(\phi\) and \(\hat \phi\) when \(\phi\) is \({\cal N}(0,I)\) and \({\hat \phi}\) is \({\cal N}(w, \Sigma)\), where the covariance matrix \(\Sigma\) is nonsingular.

We seek a formula for

**Claim**

**Proof**

The log likelihood ratio is

Observe that

Applying the identity \(\varepsilon = w + (\varepsilon - w)\) gives

Taking mathematical expectations

Combining terms gives

which agrees with equation (25.5). Notice the separate appearances of the mean distortion \(w\) and the covariance distortion \(\Sigma - I\) in equation (25.7).

**Extension**

Let \(N_0 = {\mathcal N}(\mu_0,\Sigma_0)\) and \(N_1={\mathcal N}(\mu_1, \Sigma_1)\) be two multivariate Gaussian distributions.

Then

## 25.10. Von Neumann Entropy#

Let \(P\) and \(Q\) be two positive-definite symmetric matrices.

A measure of the divergence between two \(P\) and \(Q\) is

where the log of a matrix is defined here (https://en.wikipedia.org/wiki/Logarithm_of_a_matrix).

A density matrix \(P\) from quantum mechanics is a positive definite matrix with trace \(1\).

The von Neumann entropy of a density matrix \(P\) is

## 25.11. Backus-Chernov-Zin Entropy#

After flipping signs, [BCZ14] use Kullback-Leibler relative entropy as a measure of volatility of stochastic discount factors that they assert is useful for characterizing features of both the data and various theoretical models of stochastic discount factors.

Where \(p_{t+1}\) is the physical or true measure, \(p_{t+1}^*\) is the risk-neutral measure, and \(E_t\) denotes conditional expectation under the \(p_{t+1}\) measure, [BCZ14] define entropy as

Evidently, by virtue of the minus sign in equation (25.9),

where \(D_{KL,t}\) denotes conditional relative entropy.

Let \(m_{t+1}\) be a stochastic discount factor, \(r_{t+1}\) a gross one-period return on a risky security, and \((r_{t+1}^1)^{-1}\equiv q_t^1 = E_t m_{t+1}\) be the reciprocal of a risk-free one-period gross rate of return. Then

[BCZ14] note that a stochastic discount factor satisfies

They derive the following **entropy bound**

which they propose as a complement to a Hansen-Jagannathan [HJ91] bound.

## 25.12. Wiener-Kolmogorov Prediction Error Formula as Entropy#

Let \(\{x_t\}_{t=-\infty}^\infty\) be a covariance stationary stochastic process with mean zero and spectral density \(S_x(\omega)\).

The variance of \(x\) is

As described in chapter XIV of [Sar87], the Wiener-Kolmogorov formula for the one-period ahead prediction error is

Occasionally the logarithm of the one-step-ahead prediction error \(\sigma_\epsilon^2\) is called entropy because it measures unpredictability.

Consider the following problem reminiscent of one described earlier.

**Problem:**

Among all covariance stationary univariate processes with unconditional variance \(\sigma_x^2\), find a process with maximal one-step-ahead prediction error.

The maximizer is a process with spectral density

Thus, among all univariate covariance stationary processes with variance \(\sigma_x^2\), a process with a flat spectral density is the most uncertain, in the sense of one-step-ahead prediction error variance.

This no-patterns-across-time outcome for a temporally dependent process resembles the no-pattern-across-states outcome for the static entropy maximizing coin or die in the classic information theoretic analysis described above.

## 25.13. Multivariate Processes#

Let \(y_t\) be an \(n \times 1\) covariance stationary stochastic process with mean \(0\) with matrix covariogram \(C_y(j) = E y_t y_{t-j}' \) and spectral density matrix

Let

be a Wold representation for \(y\), where \(D(0)\epsilon_t\) is a vector of one-step-ahead errors in predicting \(y_t\) conditional on the infinite history \(y^{t-1} = [y_{t-1}, y_{t-2}, \ldots ]\) and \(\epsilon_t\) is an \(n\times 1\) vector of serially uncorrelated random disturbances with mean zero and identity contemporaneous covariance matrix \(E \epsilon_t \epsilon_t' = I\).

Linear-least-squares predictors have one-step-ahead prediction error \(D(0) D(0)'\) that satisfies

Being a measure of the unpredictability of an \(n \times 1\) vector covariance stationary stochastic process, the left side of (25.12) is sometimes called entropy.

## 25.14. Frequency Domain Robust Control#

Chapter 8 of [HS08b] adapts work in the control theory literature to define a
**frequency domain entropy** criterion for robust control as

where \(\theta \in (\underline \theta, +\infty)\) is a positive robustness parameter and \(G_F(\zeta)\) is a \(\zeta\)-transform of the objective function.

Hansen and Sargent [HS08b] show that criterion (25.13) can be represented as

for an appropriate covariance stationary stochastic process derived from \(\theta, G_F(\zeta)\).

This explains the
moniker **maximum entropy** robust control for decision rules \(F\) designed to maximize criterion (25.13).

## 25.15. Relative Entropy for a Continuous Random Variable#

Let \(x\) be a continuous random variable with density \(\phi(x)\), and let \(g(x) \) be a nonnegative random variable satisfying \(\int g(x) \phi(x) dx =1\).

The relative entropy of the distorted density \(\hat \phi(x) = g(x) \phi(x)\) is defined as

Fig. 25.2 plots the functions \(g \log g\) and \(g -1\) over the interval \(g \geq 0\).

That relative entropy \(\textrm{ent}(g) \geq 0\) can be established by noting (a) that \(g \log g \geq g-1\) (see Fig. 25.2) and (b) that under \(\phi\), \(E g =1\).

Fig. 25.3 and Fig. 25.4 display aspects of relative entropy visually for a continuous random variable \(x\) for two densities with likelihood ratio \(g \geq 0\).

Where the numerator density is \({\mathcal N}(0,1)\), for two denominator Gaussian densities \({\mathcal N}(0,1.5)\) and \({\mathcal N}(0,.95)\), respectively, Fig. 25.3 and Fig. 25.4 display the functions \(g \log g\) and \(g -1\) as functions of \(x\).