Information & Entropy

What is information and how is it measured? What is entropy? Cross entropy? Relative entropy (aka KL Divergence)?

Omar Aflak

Published in

TDS Archive

10 min readFeb 21, 2025

If this article is shown as member-only, you can read it freely by clicking here.

Information

Information is tied to the field of probabilities, and it can be seen as a measure of uncertainty. To avoid extrapolation and misuse of this concept, you need to remember that it only makes sense to talk about information (in the mathematical sense) when you are studying a probabilistic event.

Information relates to probabilities in that the realisation of an event with low probability brings a lot of information, and the realisation of an event with high probability brings little information.

For example: the event “It rains in London” is very likely therefore it brings little information. In contrast, the event “It rains in the Sahara Desert” is so unlikely, it brings a lot more information (e.g. it could more realistically help you pin-point the day of the event).

So information relates to probability, but how exactly?

Let’s explore the properties we would like such a mapping to have:

Low probability ⇒ high information (already established)
High probability ⇒ low information (already established)
Probability = 1 ⇒ Information = 0 (derived from 2 — that’s because if an event is certain to be realised, then knowing about it doesn’t bring about any information)
Probability → 0 ⇒ Information → +inf (the opposite of 3 must be true)
Information should be additive for independent events, i.e. learning about two independent events should give you the amount of information equal to the sum of the information gained from each event separately:

information(event1 and event2) = information(event1) + information(event2)

Given that the probability of two independent events to be realised together is p(event1) * p(event2) (product of the probabilities of each event to happen individually), then we must have:

information(p(event1) * p(event2)) = information(p(event1)) + information(p(event2))

I’m abusing the previous notation, because here we passed a probability into this information function instead of an event, but that’s just to illustrate the idea…

The realisation of two independent events brings as much information as the sum of the information gained from each event separately.

Lastly, if we need this mapping function to be continuous, then there’s only one family of functions that respects those properties: the logarithms.

More precisely, the negative logarithms:

We define information mathematically as:

Relates the information content of an event `x` to its probability of realisation `p(x)`

We said “the family of functions” earlier — indeed, logarithms of all bases respect the properties listed above. You can use any of them, the difference will be in the unit of the information:

log2(x) will give “bits”
log10(x) will give “dits”
ln(x) will give “nats”

All of those are valid ways of expressing information. In practice, we often use the base 2 logarithm.

Bit: represents the amount of information content gained with a binary choice.

Example 1

I flip a fair coin p(heads) = p(tails) = 1/2 and tell you the result. I have just given you: -log2(1/2) = log2(2) = 1 bit of information! In other words, I have given you the information content gained with 1 binary choice.

Recall that log(1/x) = -log(x).

This is what the logarithm in base 2 does: it answers the question “how many times do I have to divide x by 2 to get 1 (or less) ?”, or in other words, how many binary choices do I have to make on my input space to be left with 1 element (or less). Each of these binary choices (division) represents one bit of information.

Example 2

I have to pick one fruit between 8 different fruits (assume each is equally likely to be picked). I pick one and tell you which: I have just given you -log2(1/8) = log2(8) = 3 bits of information. In other words, I have given you the information content gained with 3 binary choices (divide 8 by two 3 times).

I could have measured information in “dits” instead. It is equally correct to say that I would have given you -log10(1/8) = log10(8) = 0.9 dits of information.
3 bits = 0.9 dits

Entropy

In the previous examples, you’ll notice that I used uniform probability distributions. This means the probability of each outcome was equally likely (p=1/2 for the coin toss, and p=1/8 for the fruit pick). Then I asked:

What is the information gained for observing one of those events?

Since the probability was the same for all events, then the answer to that question would be the same regardless of the outcome of the random experiment.

What if each outcome had a different probability of realisation?

What if I had to pick between 3 fruits, each with a different probability according to my preferences:

Mango (p=0.7)
Apples (p=0.2)
Orange (p=0.1)

A natural question is: on average, what is the information gained for observing an event from that random experiment? We are asking the same question as before, but of course since each outcome has a different probability and since the information depends on the probability, the result will change for different outcomes. Therefore we ask about the average outcome.

One way is to sum the information gained by each event weighted by the probability of realisation.

Expected information gained for observing an event from the random experiment

That is exactly what entropy is!

We call entropy the expected amount of information gained for observing an event from a random variable. In other words, this answers the question: “If I sample an event from a variable X ; On average, what is the information gained for observing one of x1, x2, ..., or xn given the probability distribution of those events?”.

We usually denote the entropy of a random variable X as H(X):

An interesting follow up question is: when is the entropy minimal/maximal ?

Without even doing any calculation, we can intuitively try to answer. Give it a thought!

Minimal Entropy

Since entropy is the expected information to be gained from observing a random variable, and since information is minimal when events are certain to be realised, the absolute minimum would be reached if a random variable could be predictable every time, i.e. if it had an event with probability p=1 and the rest p=0 (in which case H(X)=0). Any other probability distribution would yield some amount of information.

Maximal Entropy

Entropy is maximised if the average information is maximal. We know information is highest for most improbable events (p->0). So if we have multiple events, each with a certain probability, and we want those probabilities to be as low as possible, then the lowest we can go on average is when we spread the probability space over all events equally, that is p=1/n with n the number of events, or in other words: a uniform probability distribution.

You can see the uniform distribution as the most “unpredictable” — the one for which each event brings the maximum amount of information content.

I highly advise checking out 3B1B video on how to solve the game Wordle using the concept of entropy.

There’s also another way to interpret entropy, and it’s going to be useful for the rest of the article, so before going further with Cross Entropy and Relative Entropy, we’re making a little stop at encoders.

Encoders

An encoder is a machine/routine/code that assigns to each event of a probability distribution a code (let’s say in bits, but we could use another base).

An encoder is optimal, if on average, it uses the theoretical minimum number of bits possible to represent an event drawn from the distribution.

Example 1

Say we have three events A,B,C, with p(A)=p(B)=p(C)=1/3.

We could create a coding (a mapping) that uses 2 bits to encode each outcome:

A->00, B->01, C->10

If I then give you a list of bits, e.g. 011000, you are able to decode it (by splitting every 2 bits and using the mapping above): 011000 → BCA. This works out fine, but we are waisting the 11 state of our 2 bits, which accounts for 25% of all possible states! This is not very optimal.

What if we assigned less bits to some events ?

Example 2

Consider the following encoder:

A->0, B->10, C->11

Here, we use a total of 5 bits to encode 3 states (instead of 6 bits in the previous coding), that is 5/3 = 1.7 bits on average, which is less than 2 bits like previously.

With this new encoder, suppose we read the first 2 bits of a message b1,b2:

if b1 == 0 then A
if b1 == 1 and b2 == 0then B
if b1 == 1 and b2 == 1then C

And we can keep reading and decoding a long string of bits that way.

You might ask, why not use even less bits? Well, let’s see.

Example 3 ❌

Consider this final encoder:

A->0, B->1, C->00

This uses less bits than the previous too, but it is also ambiguous!

The bits string 00 could be either AA or C , and there’s no way to go around this.

An important feature of an encoder is that it has to be unambiguous, i.e. decodable in a single way.

Encoders & Entropy

How does that relate to entropy?

Think about the optimal encoder: that will be the encoder that assigns, on average, the least amount of bits possible to an event of your distribution.

In the example 2 above, we considered A,B,C to be equally likely; but what if C was more probable than A and B? Wouldn’t it be better then to assign the single bit to C and two bits to A and B?

In general, to achieve optimality, we need to assign less bits to more probable outcomes and more bits to less probable outcomes.

A natural question is then: what is the minimum number of bits we can use to encode events drawn from a given probability distribution?

The answer is… the entropy!

Shannon's source coding theorem - Wikipedia

In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the statistical limits…

en.wikipedia.org

To clarify: the entropy is the theoretical minimum, but in practice you may not come up with an encoder that uses `entropy` number of bits on average.

Now that we’re equipped with this new insight, let’s tackle the next concepts!

Cross Entropy

Let’s say I have a machine that produces random letters (a-z) according to a certain unknown probability distribution P = [p(a), p(b), …, p(z)].

Your task is to create an encoder that is as efficient as possible for data coming from this distribution, i.e. an encoder that uses, on average, the least amount of bits possible to encode events from this distribution.

We know from earlier that the optimal encoder uses, on average, a number of bits equal to the entropy of the distribution, H(P). But for this you need to know the exact distribution, and here you don’t!

Therefore, you will have to guess what the true distribution is and produce an encoder based on your guess. Let’s call your guessed distribution Q = [q(a), q(b), ..., q(z)]. The average number of bits used by your encoder for Q will be higher or equal to H(P); and the actual amount is called the cross entropy between P and Q.

The cross entropy between P and Q is the average number of bits needed to encode events from P using an optimal encoder for Q. We denote that number H(P, Q).

Said differently, it means that you were expecting data from a probability distribution Q, but in reality the data belonged to a probability distribution P. And the average amount of bits used to encode those events from P (while expecting they were drawn from Q) is what we call the cross entropy.

Can you guess the formula?

This looks very much like H(Q), but the information is weighted by the probabilities coming from P. This makes sense:

You will be using Q to encode events coming from the machine, therefore the information content will be calculated using q(x). However, the actual weighting of the information from each event comes from P since that is the true frequency of the events.

Notice that H(P, Q) ≠ H(Q, P). The result changes depending on which is the true distribution and which is the guessed distribution.

Also, notice that if you had guessed P perfectly well (Q=P), then the result should be the theoretical minimum number of bits possible to encode events from P, that is the entropy:

Cross Entropy with a distribution and itself is the Entropy

Relative Entropy

Lastly, the relative entropy, also known as the KL divergence.

If you’ve understood the cross entropy, then this should be a piece of cake!

Recall that the cross entropy is average number of bits used if you encode events drawn from a distribution P while expecting the events to come from a distribution Q. We said this number must be higher or equal to H(P) since that would be the number of bits used by a perfect encoder for P.

The number of extra bits used relative to H(P) is what we call the relative entropy and we denote it KL(P||Q)! That is, not the entire entropy but just the extra you used due to the error in guessing P.

Essentially, that is the difference in bits used by a suboptimal encoder and an optimal encoder. So this is simply H(P, Q) - H(P).

Like the cross entropy, the relative entropy is not commutative: KL(P||Q) ≠ KL(Q||P). You can understand it as a measure of relative difference between two probability distributions, the minimum being 0 when Q=P.

Last Note

In machine learning we try to minimise the cross entropy:

Cross Entropy

Where P is the distribution of the data, and Q is the distribution of the model. Since the data doesn’t change during the training —H(P) is a constant— we are essentially minimising the relative entropy, i.e. the difference between P and Q.

Information & Entropy

What is information and how is it measured? What is entropy? Cross entropy? Relative entropy (aka KL Divergence)?

Information

Example 1

Example 2

Entropy

Minimal Entropy

Maximal Entropy

Encoders

Example 1

Example 2

Example 3 ❌

Encoders & Entropy

Shannon's source coding theorem - Wikipedia

In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the statistical limits…

Cross Entropy

Relative Entropy

Last Note

Published in TDS Archive

Written by Omar Aflak

No responses yet