Doug Kerr
Well-known member
An important concept in the field of information theory is that there is a measure of the amount of information contained in a set of data.
I will try and give an intuitive understanding of that concept.
Suppose we have a situation in which there are a number of data items, and each of them can have only one of 8 possible values. We are tempted to say that we must use 3 bits to encode each item's value.
We have heard that an amount of information can be quantified in bits. Thus perhaps there are 3 bits of information in each item.
True, but only if all values are equally likely to occur. The importance of this condition may not be at first evident. But bear with me - I will in the next part of this series give a discussion of variable length encoding that illuminates why it matters.
Let's for now assume, however, that in fact all 8 values are equally likely to occur; that is they have probability, p, of 0.125 (1/8) of occurring as the value for any data item.
It turns out that the amount of information (denominated in bits) in each data item, i, is given by:
i=-log2(p)
where "log2" means the binary, or base 2, logarithm.
In the example, with p = 0.125, i would be 3 bits - no surprise, since we had thought that intuitively. But it's good to check!
But now suppose that all values are not equally probable.
Then we go through the list of the possible values. For each, based on its probability, p, we assess the information content, i, of a data item having that value. We then add those up over the entire "alphabet" of possible values, weighting each value of i by the probability, p, of that value occurring within the data set. The result will be the information content, in bits, of the collection of data items. (Yes, p gets into this twice - the infamous -p log p.)
So there is less information in the fact that a data item has a "likely" value than if it has an "unlikely" value. To use an old joke, "dog bites man" is not really news; "man bites dog" is.
It turns out that for a given number of possible values, the more disparate are the probabilities of occurrence of the different possible values in a data set, the lower will be the overall information content of the data set. The highest information content is when all possible values are equally probable.
Thus, for our example where there are 8 possible values, we find that if they are all equally probably to occur in our data set, the total information content is 3 bits per item - just what we suspected at the beginning from thinking of the number of bits required to simplistically encode each item. (Not that this simplistic view only works when the number of values is an integral power of two: 2, 4, 8, 16, etc.).
Now, suppose all values are not equally probable. For example, consider a data set in which the items are the number of system alarms in each one hour period. Hopefully, it is extremely probably that for any hour there will be no system alarms. Thus the value "0" may have a very high probability of occurring. Each time it does occur, that represents a small amount of information. And since that happens a lot, we run up an overall small amount of information from all those goose eggs.
Each time we get a non-zero value, that represents a larger amount of information. But that doesn't happen very often, so again we run up a small total from all those occurrences. The overall result is a relatively small total amount of information .
Thus, an analysis of this data for some period (a week, perhaps) will suggest that the amount of information is far less than 3 bits per item. But in a simple system, we would use three actual "physical" (well, "logical") bits to encode each item. Is that somehow wasteful? It is indeed. We will look into that in the next part of this series.
Best regards,
Doug
I will try and give an intuitive understanding of that concept.
Suppose we have a situation in which there are a number of data items, and each of them can have only one of 8 possible values. We are tempted to say that we must use 3 bits to encode each item's value.
We have heard that an amount of information can be quantified in bits. Thus perhaps there are 3 bits of information in each item.
True, but only if all values are equally likely to occur. The importance of this condition may not be at first evident. But bear with me - I will in the next part of this series give a discussion of variable length encoding that illuminates why it matters.
Let's for now assume, however, that in fact all 8 values are equally likely to occur; that is they have probability, p, of 0.125 (1/8) of occurring as the value for any data item.
It turns out that the amount of information (denominated in bits) in each data item, i, is given by:
i=-log2(p)
where "log2" means the binary, or base 2, logarithm.
In the example, with p = 0.125, i would be 3 bits - no surprise, since we had thought that intuitively. But it's good to check!
But now suppose that all values are not equally probable.
Then we go through the list of the possible values. For each, based on its probability, p, we assess the information content, i, of a data item having that value. We then add those up over the entire "alphabet" of possible values, weighting each value of i by the probability, p, of that value occurring within the data set. The result will be the information content, in bits, of the collection of data items. (Yes, p gets into this twice - the infamous -p log p.)
So there is less information in the fact that a data item has a "likely" value than if it has an "unlikely" value. To use an old joke, "dog bites man" is not really news; "man bites dog" is.
It turns out that for a given number of possible values, the more disparate are the probabilities of occurrence of the different possible values in a data set, the lower will be the overall information content of the data set. The highest information content is when all possible values are equally probable.
Thus, for our example where there are 8 possible values, we find that if they are all equally probably to occur in our data set, the total information content is 3 bits per item - just what we suspected at the beginning from thinking of the number of bits required to simplistically encode each item. (Not that this simplistic view only works when the number of values is an integral power of two: 2, 4, 8, 16, etc.).
Now, suppose all values are not equally probable. For example, consider a data set in which the items are the number of system alarms in each one hour period. Hopefully, it is extremely probably that for any hour there will be no system alarms. Thus the value "0" may have a very high probability of occurring. Each time it does occur, that represents a small amount of information. And since that happens a lot, we run up an overall small amount of information from all those goose eggs.
Each time we get a non-zero value, that represents a larger amount of information. But that doesn't happen very often, so again we run up a small total from all those occurrences. The overall result is a relatively small total amount of information .
Thus, an analysis of this data for some period (a week, perhaps) will suggest that the amount of information is far less than 3 bits per item. But in a simple system, we would use three actual "physical" (well, "logical") bits to encode each item. Is that somehow wasteful? It is indeed. We will look into that in the next part of this series.
Best regards,
Doug