Why is it that courses on machine learning tend to start by covering some basic concepts from information theory? This post is a short explainer of why information theory is relevant to the formal study of machine learning.
If we want to get anywhere, we need to start with a solid concept of what “machine learning” actually means. Basically, we want a computer to be able to process and utilize complicated patterns in data. This is, of course, what we do whenever we fit a statistical model to data, and it turns out that the sorts of algorithms used for machine learning are really just complicated statistical models: neural networks, for example, are in their simplest form a set of linear regressions stacked on top of each other.
Let’s make this more explicit: a machine learning model is a highly flexible statistical model – it has many parameters, which allow it to fit a variety of statistical distributions. When we train a machine learning model, we are finding the parameters of the model that make the model approximate the training data as closely as possible. (Note that by statistical model, we mean a description of a probability distribution, e.g., a specification of a probability density function.)
In a typical case, we have a data distribution with labels , related with a conditional distribution . If we want to create a model that predicts labels from data, we introduce a model , and choose the parameters to make as close as possible to .
The natural question, of course, is what it means for one distribution to be “as close as possible” to another distribution. This is where information theory starts showing its usefulness. Intuitively, two distributions are “close” if they convey similar information, i.e., if sampling from one allows us to learn more or less the same thing as sampling from the other. This suggests that the metric we can use to measure distributional “closeness” should be something related to information content.
And now the concept of entropy enters the stage. The entropy of a distribution is a measure of how much information the distribution contains – high-entropy distributions contain little information, and low-entropy distributions contain lots of information. Formally, the entropy of a random variable with p.d.f. is We don’t typically talk about the information content of statistical distributions, so this concept can seem a bit mysterious at first, and it’s worth thinking about what entropy really means in a statistical context. It may be helpful to think in terms of how “predictable” a distribution is: highly “predictable” distributions have low entropy, while “unpredictable” distributions have high entropy. A distribution that takes only a few values with regular probabilities is “predictable” and therefore will have low entropy, while a distribution that can take many values with non-neglible probabilties is “unpredictable” and will have relatively high entropy.
What if we want to describe how much one distribution gives about another? Here there are a few different concepts that we can make use of. First, we have conditional entropy. Formally, conditional entropy is defined as
Since
we can decompose conditional entropy of into the difference of the entropies of the joint distribution of and and of alone: Intutively, conditional entropy tells us how predictable is, given that we already know . As an example, suppose that describes the outcome of the roll of a die, and describes whether the same die roll is even or odd. If we know , then we also know , i.e., is completely predictable given ; therefore, is as low as possible. On the other hand, if gives us no information about , then .
Conditional entropy is useful for thinking about the relationship between distributions, but only in situations where we know the joint distribution of the random variables involved. That is, if we have some model for the joint distribution of and , then we can use conditional entropy to think about how much information gives about and vice versa. However, if the goal is instead to compare alternative models for the same data, we need a different tool. This is where cross-entropy comes in, which is a key concept in machine learning.
Here we return to the canonical example presented in the first section: we have data with labels , with a conditional distribution described by , and we want to find parameters so that our model is “as close as possible” to . The typical way we describe this is with the cross-entropy
Notice that this is similar to the definition of entropy, but here we are sampling from the population distribution but using our model distribution inside the expectation.
(Also note that this is different from joint entropy, which I denoted by earlier. Sometimes people use the same notation for both, which can be confusing.)
Why is this useful? When we are doing machine learning, we don’t actually know the form of the population distribution – that’s what we’re trying to figure out – so we can’t plug anything into its p.d.f.; however, we can sample from it by sampling from our training data. In training the model , we want to make as informative as possible over the data, so we choose to minimize .
You can check that is bounded below by – therefore, we often normalize cross-entropy by the entropy of the target distribution to get a value called the Kullback-Leibler (KL) divergence:
A KL divergence of zero indicates minimum cross entropy; intuitively, our model fits the population distribution as well as possible.
In machine learning, our goal is typically to find a parameterized distribution that fits the data as well as possible. We don’t know the ideal distribution of the data, but using an information-theoretic value like cross-entropy we can compare candidate models and say which is closest to the ideal distribution (as reflected in training data). The process of training is the process of exploring the parameter space to find the best (closest to the population distribution) model.