Information theory for machine learning

Why is it that courses on machine learning tend to start by covering some basic concepts from information theory? This post is a short explainer of why information theory is relevant to the formal study of machine learning.

Machine learning as model-fitting

If we want to get anywhere, we need to start with a solid concept of what “machine learning” actually means. Basically, we want a computer to be able to process and utilize complicated patterns in data. This is, of course, what we do whenever we fit a statistical model to data, and it turns out that the sorts of algorithms used for machine learning are really just complicated statistical models: neural networks, for example, are in their simplest form a set of linear regressions stacked on top of each other.

Let’s make this more explicit: a machine learning model is a highly flexible statistical model – it has many parameters, which allow it to fit a variety of statistical distributions. When we train a machine learning model, we are finding the parameters of the model that make the model approximate the training data as closely as possible. (Note that by statistical model, we mean a description of a probability distribution, e.g., a specification of a probability density function.)

In a typical case, we have a data distribution XX with labels YY, related with a conditional distribution PYXP_{Y|X}. If we want to create a model that predicts labels from data, we introduce a model P^YX;θ\hat P_{Y|X; \theta}, and choose the parameters θ\theta to make P^YX;θ\hat P_{Y|X; \theta} as close as possible to PYXP_{Y|X}.

Entropy

The natural question, of course, is what it means for one distribution to be “as close as possible” to another distribution. This is where information theory starts showing its usefulness. Intuitively, two distributions are “close” if they convey similar information, i.e., if sampling from one allows us to learn more or less the same thing as sampling from the other. This suggests that the metric we can use to measure distributional “closeness” should be something related to information content.

And now the concept of entropy enters the stage. The entropy of a distribution is a measure of how much information the distribution contains – high-entropy distributions contain little information, and low-entropy distributions contain lots of information. Formally, the entropy of a random variable XX with p.d.f. pXp_X is H(X)=E[logpX(X)].H(X) = - \mathbb E[\log p_X(X)]. We don’t typically talk about the information content of statistical distributions, so this concept can seem a bit mysterious at first, and it’s worth thinking about what entropy really means in a statistical context. It may be helpful to think in terms of how “predictable” a distribution is: highly “predictable” distributions have low entropy, while “unpredictable” distributions have high entropy. A distribution that takes only a few values with regular probabilities is “predictable” and therefore will have low entropy, while a distribution that can take many values with non-neglible probabilties is “unpredictable” and will have relatively high entropy.

Conditional entropy

What if we want to describe how much one distribution gives about another? Here there are a few different concepts that we can make use of. First, we have conditional entropy. Formally, conditional entropy is defined as

H(YX)=E[logpYX(YX)].H(Y|X) = -\mathbb E[\log p_{Y|X}(Y|X)].

Since

pYX(yx)=pX,Y(x,y)pX(x),p_{Y|X}(y|x) = \frac{p_{X,Y}(x, y)}{p_X(x)},

we can decompose conditional entropy of YXY|X into the difference of the entropies of the joint distribution of XX and YY and of XX alone: H(YX)=H([X,Y])H(X)H(Y|X) = H([X, Y]) - H(X) Intutively, conditional entropy tells us how predictable YY is, given that we already know XX. As an example, suppose that XX describes the outcome of the roll of a die, and YY describes whether the same die roll is even or odd. If we know XX, then we also know YY, i.e., YY is completely predictable given XX; therefore, H(YX)H(Y|X) is as low as possible. On the other hand, if XX gives us no information about YY, then H(YX)=H(Y)H(Y|X) = H(Y).

Conditional entropy is useful for thinking about the relationship between distributions, but only in situations where we know the joint distribution of the random variables involved. That is, if we have some model pX,Yp_{X, Y} for the joint distribution of XX and YY, then we can use conditional entropy to think about how much information XX gives about YY and vice versa. However, if the goal is instead to compare alternative models for the same data, we need a different tool. This is where cross-entropy comes in, which is a key concept in machine learning.

Cross-entropy

Here we return to the canonical example presented in the first section: we have data XX with labels YY, with a conditional distribution described by PYXP_{Y|X}, and we want to find parameters θ\theta so that our model P^XY;θ\hat P_{X|Y;\theta} is “as close as possible” to PXYP_{X|Y}. The typical way we describe this is with the cross-entropy

H(P,P^):=EYXP[P^YX;θ(YX)].H(P, \hat P) := -\mathbb E_{Y|X \sim P} [\hat P_{Y|X; \theta}(Y|X)].

Notice that this is similar to the definition of entropy, but here we are sampling from the population distribution but using our model distribution inside the expectation.

(Also note that this is different from joint entropy, which I denoted by H([X,Y])H([X, Y]) earlier. Sometimes people use the same notation for both, which can be confusing.)

Why is this useful? When we are doing machine learning, we don’t actually know the form of the population distribution PP – that’s what we’re trying to figure out – so we can’t plug anything into its p.d.f.; however, we can sample from it by sampling from our training data. In training the model P^\hat P, we want to make P^\hat P as informative as possible over the data, so we choose θ\theta to minimize H(P,P^)H(P, \hat P).

You can check that H(P,P^)H(P, \hat P) is bounded below by H(P)H(P) – therefore, we often normalize cross-entropy by the entropy of the target distribution to get a value called the Kullback-Leibler (KL) divergence:

DKL(P,P^):=H(P,P^)H(P)D_{KL}(P, \hat P) := H(P, \hat P) - H(P)

A KL divergence of zero indicates minimum cross entropy; intuitively, our model fits the population distribution as well as possible.

Conclusion

In machine learning, our goal is typically to find a parameterized distribution that fits the data as well as possible. We don’t know the ideal distribution of the data, but using an information-theoretic value like cross-entropy we can compare candidate models and say which is closest to the ideal distribution (as reflected in training data). The process of training is the process of exploring the parameter space to find the best (closest to the population distribution) model.

goodness, truth, and butterflies