Quantifying the amount of information in a data set

Doug Kerr · Sep 15, 2011

An important concept in the field of information theory is that there is a measure of the amount of information contained in a set of data.

I will try and give an intuitive understanding of that concept.

Suppose we have a situation in which there are a number of data items, and each of them can have only one of 8 possible values. We are tempted to say that we must use 3 bits to encode each item's value.

We have heard that an amount of information can be quantified in bits. Thus perhaps there are 3 bits of information in each item.

True, but only if all values are equally likely to occur. The importance of this condition may not be at first evident. But bear with me - I will in the next part of this series give a discussion of variable length encoding that illuminates why it matters.

Let's for now assume, however, that in fact all 8 values are equally likely to occur; that is they have probability, p, of 0.125 (1/8) of occurring as the value for any data item.

It turns out that the amount of information (denominated in bits) in each data item, i, is given by:

i=-log2(p)

where "log2" means the binary, or base 2, logarithm.

In the example, with p = 0.125, i would be 3 bits - no surprise, since we had thought that intuitively. But it's good to check!

But now suppose that all values are not equally probable.

Then we go through the list of the possible values. For each, based on its probability, p, we assess the information content, i, of a data item having that value. We then add those up over the entire "alphabet" of possible values, weighting each value of i by the probability, p, of that value occurring within the data set. The result will be the information content, in bits, of the collection of data items. (Yes, p gets into this twice - the infamous -p log p.)

So there is less information in the fact that a data item has a "likely" value than if it has an "unlikely" value. To use an old joke, "dog bites man" is not really news; "man bites dog" is.

It turns out that for a given number of possible values, the more disparate are the probabilities of occurrence of the different possible values in a data set, the lower will be the overall information content of the data set. The highest information content is when all possible values are equally probable.

Thus, for our example where there are 8 possible values, we find that if they are all equally probably to occur in our data set, the total information content is 3 bits per item - just what we suspected at the beginning from thinking of the number of bits required to simplistically encode each item. (Not that this simplistic view only works when the number of values is an integral power of two: 2, 4, 8, 16, etc.).

Now, suppose all values are not equally probable. For example, consider a data set in which the items are the number of system alarms in each one hour period. Hopefully, it is extremely probably that for any hour there will be no system alarms. Thus the value "0" may have a very high probability of occurring. Each time it does occur, that represents a small amount of information. And since that happens a lot, we run up an overall small amount of information from all those goose eggs.

Each time we get a non-zero value, that represents a larger amount of information. But that doesn't happen very often, so again we run up a small total from all those occurrences. The overall result is a relatively small total amount of information .

Thus, an analysis of this data for some period (a week, perhaps) will suggest that the amount of information is far less than 3 bits per item. But in a simple system, we would use three actual "physical" (well, "logical") bits to encode each item. Is that somehow wasteful? It is indeed. We will look into that in the next part of this series.

Best regards,

Doug

Doug Kerr · Sep 16, 2011

In part 1 of this series, I discussed the matter of reckoning the amount of information in a data item, and emphasized that it was related to the probability of that particular value occurring. I didn't really give a "justification" for that except by invoking the old story about "dog bites man".

I also pointed out that, if all the possible values are not equally probable, then the total amount of information contained in a set of data would be less than the number of bits we would use to encode the data in the traditional way. I said "yes, this is inefficient", and intimated that we might be able to do better. Here I will discuss how we might do that, and in the process hopefully further illuminates why the probability of occurrence of a value controls the amount of information in a data item that has that value.

Suppose that every day, we must transmit over some channel a report that comprises 1000 data items. Each item can take on only the values 0, 1, 2, and 3.

Suppose that in Monday's report, the distribution of occurrence of these values is as follows:

Value Occurrences

0 925
1 35
2 25
3 15

Yes, I have purposely chosen an example where the probabilities are greatly disparate. That is merely to make the effect I discuss more visible.

One obvious way to encode this data is with a straightforward two-bit code, which we might construct as follows:

Value Code

0 00
1 01
2 10
3 11

When we do that, then to encode the entire report (1000 items) will require 2000 bits.

But we could also consider a variable length code. We might assign its symbols this way:

Value Code

0 0
1 101
2 110
3 111

Variable codes of this type are often called Huffman codes. Huffman, however, did not really introduce this concept. His contribution was an orderly system for actually devising such codes, which made them usable in a practical sense.

How would the "receiver" parse the incoming bit stream? If the first bit was a 0, then the symbol is 1 bit long (and would mean "0"). If the first bit was a 1, then the symbol was three bits long, and the next two bits gave the value.

One might ask, suppose the receiver got confused? Well, it could for a fixed length code as well, so in either case we have to make some provisions for taking care of that.

Now, for Monday's report, how many bits would we have to send for the 1000 data items? We can tot that up here:

Value Occurrences Bits each Total bits for items with that value

0 925 1 925
1 35 3 105
2 25 3 75
3 15 3 45

REPORT TOTAL 1150 bits

We see that fewer bits overall were needed to convey the entire report than with a fixed length code.

Now, would that same code table be best for Tuesday's report, if it had a different distribution of occurrences of the values? No. A new code would probably have to be devised. (There are algorithms for doing so conveniently, ones devised by Huffman being among the prominent ones.)

How would the receiver know which code table was being used? We send a description of the code table in a preamble to the message. Won't that eat into the gain in efficiency? Yes, but in reality the reports would probably be much bigger than in my example, so that load is easily absorbed.

Although I don't demonstrate this in detail, I think we can see intuitively from this example that we can encode the information in a data item with fewer bits if the probability of occurrence of the item's value is larger. (In the example, we can encode the value of items whose value is "0", the probability of which occurring is 0.925, with one bit, whereas to encode the values of items whose value is 1, 2, or 3, the value of any specific one of which occurring is 0.0.35, 0.025, and 0.015, we must use three bits.)

Now, we can intuitively grasp that this means there must be less information in the "high probability" value (that's "dog biting man" - not much news there) than in the lower-probability values. If we can encode an item with value "0" with 1 bit, there must be no more than 1 bit of information in that item. (In fact, the information in that item is 0.112 bit!)

But of course in the encoding, we are stuck with integral numbers of bits in our "symbols" - we cannot use 0.112 bits to encode the value "0".

Further, there is some overhead. In order to allow the receiver to figure out how long each symbol is, there are symbols we cannot use. (For example, we cannot use 001, since the receiver would see the "0" and think that it was all of a one-bit symbol.)

By the way, the actual total information content in the example report is only 120.25 bits.

In fact, the formal name of this type of code is prefix code. That means that no symbol (sequence of bits) representing a value could be the prefix - that is the first part - of another symbol that is used. The name of course sounds backwards, but they didn't ask me.

So the total number of bits used to encode a report will be greater than the amount of information, in bits, contained. But it is still less than the number of bits used in just a "classical" binary encoding.

Best regards,

Doug

Quantifying the amount of information in a data set

Doug Kerr

Well-known member

Doug Kerr

Well-known member