Doug Kerr
Well-known member
I will be speaking here of data compression in the sense of taking a suite of data that describes s thing and replacing with with a smaller suite of data that describes the same thing (perhaps "exactly", perhaps not.
Compression schemes can be divided into two classes, which can meaningfully be called "reversible" and "non-reversible".
In a reversible scheme, one can take the "smaller" suite of data and from it reconstruct exactly the original suite of data.
In a non-reversible scheme, one cannot do that.
Sadly, the term "lossless" has come into widespread use to denote reversible schemes, and the term "lossy" then was back-formed to denote non-reversible schemes.
Although there is an outlook that makes these terms sort of apt (I'll present that later), the usual concepts and explanations surrounding them are often misleading.
The direct connotation is that, in the "compressed" suite of data (under a "lossy" scheme), some of the original data has been lost. (It would hardly seem wise to say that all the original data had been lost, or else it would seem that the "compressed" suite of data would be useless.)
But, if that is the case, we should be able to say (if we have all the details of a specific case) "how much of the original data" has been lost, or just what "pieces" of it have been lost.
Let's consider the application of the concept to a familiar topic - the JPEG scheme for compressing digital image data.
We start with an image comprising some number of pixels, each of which has a color described with three 8-bit values (24 bits altogether). We encode it in JPEG form and save it, the file having far fewer bits than the original representation. Later, the file is "decoded" with a JPEG decoder, giving a "reconstructed" image of the same number of pixels, each again described by three 8-bit numbers.
How many of those 24-bit color descriptions are identical to the original colors of their pixels? I have not seen any studies of this, but my guess is a very small fraction - my guess is less than 1% for a random image pattern. (I may be way off.)
So why is this image useful to us? Because, generally, it "looks very much like" the original image.
Now, have we lost data? In a sense yes - probably almost all the original color values are "gone", replaced by values that differ perhaps in one bit, or perhaps in all 24 (and of course how far the color represented diverges from the original color depends on which bits are different).
So in that sense, the moniker "lossy" is apt.
But where we get into trouble is with descriptions that say, "in lossy compression, the data suite is made smaller by losing some of the original data." That just gives no insight at all in what is going on.
In reality, in "lossy" compression, we make the data suite smaller, not by "discarding" data, but by exploiting the redundancy in the data (in particular, in a way that exploits our perceptual tolerances of certain kinds of differences between the original image and the reconstructed one). And as a result, in the case of a "lossy" algorithm, in the reconstructed data suite, probably almost all the original data is gone.
I can further illuminate my point by a quite different example. Suppose we have a temperature sensor which delivers a high-precision digital representation of the air temperature. We compress this data to facilitate its transmission over a low bit rate channel to an indicator station. At the indicator station, the compressed data is decoded and the result (which we would hope would accurately mimic the original data) is displayed.
In this example, for each "reading", the first number is the temperature as digitized at the sensor end, and the second the value displayed by the indicator (assume degrees Fahrenheit):
72.33 72.22
74.61 74.61
70.01 69.99
72.23 72.92
75.17 75.21
Now, as a result of this non-reversible compression, have we lost data? And if so, how much?
Possible answers:
• No, we have a credible display for every reading - no data has been lost. Some seems to have been "corrupted".
• Yes, some of it - 1 of the original 5 readings was not delivered (that is, as it was) in the reconstructed readings. We have lost 20% of them.
• Yes, some of it - 10 of the original 20 decimal digits were not delivered (that is, as they were) in the reconstructed readings. We have lost 50% of them.
In fact, it is not the data that has been lost - it is its correctness. And we might use various measures of that. We might for example decide that a good measure of the overall amount of incorrectness for this suite of temperature is the standard deviation of the error. In this case, that would be 0.31 degree Fahrenheit.
My point here is not to suggest that authors (other than myself) give up use of the ill-chosen monikers "lossless" and "lossy" to describe reversible and non-reversible data compression. I just hope for us to recognize that these are metaphorical terms, and don't really relate to any property that we could "quantify" ("Well, how lossy is it? What data did we lose?")
Best regards,
Doug
Compression schemes can be divided into two classes, which can meaningfully be called "reversible" and "non-reversible".
In a reversible scheme, one can take the "smaller" suite of data and from it reconstruct exactly the original suite of data.
In a non-reversible scheme, one cannot do that.
Sadly, the term "lossless" has come into widespread use to denote reversible schemes, and the term "lossy" then was back-formed to denote non-reversible schemes.
Although there is an outlook that makes these terms sort of apt (I'll present that later), the usual concepts and explanations surrounding them are often misleading.
The direct connotation is that, in the "compressed" suite of data (under a "lossy" scheme), some of the original data has been lost. (It would hardly seem wise to say that all the original data had been lost, or else it would seem that the "compressed" suite of data would be useless.)
But, if that is the case, we should be able to say (if we have all the details of a specific case) "how much of the original data" has been lost, or just what "pieces" of it have been lost.
Let's consider the application of the concept to a familiar topic - the JPEG scheme for compressing digital image data.
We start with an image comprising some number of pixels, each of which has a color described with three 8-bit values (24 bits altogether). We encode it in JPEG form and save it, the file having far fewer bits than the original representation. Later, the file is "decoded" with a JPEG decoder, giving a "reconstructed" image of the same number of pixels, each again described by three 8-bit numbers.
How many of those 24-bit color descriptions are identical to the original colors of their pixels? I have not seen any studies of this, but my guess is a very small fraction - my guess is less than 1% for a random image pattern. (I may be way off.)
So why is this image useful to us? Because, generally, it "looks very much like" the original image.
Now, have we lost data? In a sense yes - probably almost all the original color values are "gone", replaced by values that differ perhaps in one bit, or perhaps in all 24 (and of course how far the color represented diverges from the original color depends on which bits are different).
So in that sense, the moniker "lossy" is apt.
But where we get into trouble is with descriptions that say, "in lossy compression, the data suite is made smaller by losing some of the original data." That just gives no insight at all in what is going on.
In reality, in "lossy" compression, we make the data suite smaller, not by "discarding" data, but by exploiting the redundancy in the data (in particular, in a way that exploits our perceptual tolerances of certain kinds of differences between the original image and the reconstructed one). And as a result, in the case of a "lossy" algorithm, in the reconstructed data suite, probably almost all the original data is gone.
Trying to keep consistent with that explanation in the case of a reversible ("lossless") scheme gets us quickly into trouble. "In lossless compression, the data suite is made smaller by not losing some of the original data". That should be a hint that we are in trouble with this whole notion.
I can further illuminate my point by a quite different example. Suppose we have a temperature sensor which delivers a high-precision digital representation of the air temperature. We compress this data to facilitate its transmission over a low bit rate channel to an indicator station. At the indicator station, the compressed data is decoded and the result (which we would hope would accurately mimic the original data) is displayed.
In this example, for each "reading", the first number is the temperature as digitized at the sensor end, and the second the value displayed by the indicator (assume degrees Fahrenheit):
72.33 72.22
74.61 74.61
70.01 69.99
72.23 72.92
75.17 75.21
Now, as a result of this non-reversible compression, have we lost data? And if so, how much?
Possible answers:
• No, we have a credible display for every reading - no data has been lost. Some seems to have been "corrupted".
• Yes, some of it - 1 of the original 5 readings was not delivered (that is, as it was) in the reconstructed readings. We have lost 20% of them.
• Yes, some of it - 10 of the original 20 decimal digits were not delivered (that is, as they were) in the reconstructed readings. We have lost 50% of them.
In fact, it is not the data that has been lost - it is its correctness. And we might use various measures of that. We might for example decide that a good measure of the overall amount of incorrectness for this suite of temperature is the standard deviation of the error. In this case, that would be 0.31 degree Fahrenheit.
My point here is not to suggest that authors (other than myself) give up use of the ill-chosen monikers "lossless" and "lossy" to describe reversible and non-reversible data compression. I just hope for us to recognize that these are metaphorical terms, and don't really relate to any property that we could "quantify" ("Well, how lossy is it? What data did we lose?")
Best regards,
Doug