Doug Kerr
Well-known member
In the JPEG system for encoding an image, a key role is played by the discrete cosine transform (DCT). Here I will give some insight into what this is, how it works, and some implications of the specific way it is employed in JPEG encoding.
Background
The color space of JPEG
Most often the image presented to a JPEG encoder is in the sRGB color space. The color of each pixel is described by the values of three coordinates, R, G, and B. These are nonlinear presentations of the quantities of three "primaries" which, combined, would make the color of interest.
JPEG encoding works with a different color space, directly relatable to sRGB. Here again, the color of each pixel is described by the values of three coordinates, here Y, Cb, and Cr.
Y is a fixed linear combination of R, G, and B. It is something like luminance (in fact, its symbol, Y, is the symbol for luminance), but it is not luminance. (For it to truly correspond to luminance, it would have to be a fixed linear combination of r, g, and b, which directly represent the quantities of the three primaries in linear form).
Cb is defined as B-Y, and Cr as R-Y. These two together describe something like the chrominance of the pixel color (although since they are derived from the nonlinear variables B, R, and, through Y, G, they do not form any standard definition of chrominance).
Y, Cb, and Cr ordinarily are represented by 8-bit integers. Thus we see there is the opportunity for rounding error as we convert from the RGB color space to the YCbCr color space during JPEG encoding and then as we convert back after decoding.
The pixel block
In JPEG encoding, the image is divided into square blocks of 8 × 8 pixels. Each block is worked independently, leading to a set of data which approximately describes the color of all the pixels in the block.
The Fourier series
If we have, for example, a recurrent electrical waveform (a variation of voltage with time), we can consider it to be composed of a collection of sine and cosine functions of various frequencies, all integer multiples of the frequency of recurrence of the waveform (said to have "harmonically-related" frequencies). This is called the Fourier series representation of a recurrent waveform.
By stating the amplitudes of all these sine and cosine components, we can completely describe the waveform. Thus, for example, instead of actually transmitting the waveform to a distant point, we could transmit (perhaps in digital form) the amplitudes of all the sine and cosine functions that can be considered to constitute it. At the distant end, from this set of "coefficients", we can reconstruct the waveform.
If the waveform has a certain time symmetry, the amplitudes of all the sine functions it might contain are in fact zero. Thus in such a case we can completely represent the waveform it by stating the amplitudes of all the "harmonically-related" cosine functions it "contains".
In some cases we do not have the waveform as a "continuous" variable but only have its value at repetitive instants - usually the case today with digital audio or video. In that case, the process used to determine the coefficients is called the "discrete Fourier transform" (DFT).
Hold those thoughts.
The role of the discrete cosine transform (DCT)
Introduction
In the JPEG encoding system, the set of Y values for the pixels in an 8 × 8 block, the set of Cb values, and the set of Cr values are handled in separate operations, following essentially the identical procedure. For convenience here, I will speak of the Y values.
I will also for the next little while ask you to consider a "block" of 8 × 1 pixels, as this will make for a more clear presentation of the principles.
This block of 8 pixels is just like a small segment of an electrical waveform described as discrete samples. Accordingly, we can describe the variation in Y along this little block in terms of the amplitudes of several cosine functions, with "harmonically-related" spatial frequencies (that is, frequencies that are all integer multiples of some basic frequency), which together would make up the pattern of variation in Y.
Eight cosine functions
In JPEG case, we actually use eight cosine functions, with frequencies from 0 to 7 times the base frequency, which together very nearly compose the actual variation in Y along the 8 pixels of the line. The amplitudes of those 8 functions are the 8 "coefficients" which now describe that variation in Y, taking the place of the 8 Y values themselves. This representation is called the "discrete cosine transform" (DCT) of the variation of the value of Y itself.
What is the significance of a cosine function with frequency 0 (the first one of the collection)? This represents the average of all the Y values.
The base spatial frequency
Now, what is the base spatial frequency here (the actual not-DC cosine functions have from 1 to 7 times its frequency)?
Recall that we can only represent, by the array of pixel values, spatial frequencies in the image path along the pixels that are less than the Nyquist frequency, which is 1/2p, where p is the pixel spacing.
Now, in the JPEG DCT, the base frequency (that of cosine component "1") has frequency 1/16p. Thus the highest frequency cosine component (number "7") has frequency 7/16p.
This is 7/8 the Nyquist frequency. We know that for various reasons (including the matter of the Kell factor) the highest resultion we can typically get in a digital sensor system is about 3/4 times the Nyquist frequency. Thus the highest available cosine component is more than adequate in frequency to represent the highest frequency component that might usefully appear in the image track.
[Continued in part 2]
Background
The color space of JPEG
Most often the image presented to a JPEG encoder is in the sRGB color space. The color of each pixel is described by the values of three coordinates, R, G, and B. These are nonlinear presentations of the quantities of three "primaries" which, combined, would make the color of interest.
JPEG encoding works with a different color space, directly relatable to sRGB. Here again, the color of each pixel is described by the values of three coordinates, here Y, Cb, and Cr.
Y is a fixed linear combination of R, G, and B. It is something like luminance (in fact, its symbol, Y, is the symbol for luminance), but it is not luminance. (For it to truly correspond to luminance, it would have to be a fixed linear combination of r, g, and b, which directly represent the quantities of the three primaries in linear form).
Cb is defined as B-Y, and Cr as R-Y. These two together describe something like the chrominance of the pixel color (although since they are derived from the nonlinear variables B, R, and, through Y, G, they do not form any standard definition of chrominance).
Y, Cb, and Cr ordinarily are represented by 8-bit integers. Thus we see there is the opportunity for rounding error as we convert from the RGB color space to the YCbCr color space during JPEG encoding and then as we convert back after decoding.
The pixel block
In JPEG encoding, the image is divided into square blocks of 8 × 8 pixels. Each block is worked independently, leading to a set of data which approximately describes the color of all the pixels in the block.
The Fourier series
If we have, for example, a recurrent electrical waveform (a variation of voltage with time), we can consider it to be composed of a collection of sine and cosine functions of various frequencies, all integer multiples of the frequency of recurrence of the waveform (said to have "harmonically-related" frequencies). This is called the Fourier series representation of a recurrent waveform.
By stating the amplitudes of all these sine and cosine components, we can completely describe the waveform. Thus, for example, instead of actually transmitting the waveform to a distant point, we could transmit (perhaps in digital form) the amplitudes of all the sine and cosine functions that can be considered to constitute it. At the distant end, from this set of "coefficients", we can reconstruct the waveform.
If the waveform has a certain time symmetry, the amplitudes of all the sine functions it might contain are in fact zero. Thus in such a case we can completely represent the waveform it by stating the amplitudes of all the "harmonically-related" cosine functions it "contains".
In some cases we do not have the waveform as a "continuous" variable but only have its value at repetitive instants - usually the case today with digital audio or video. In that case, the process used to determine the coefficients is called the "discrete Fourier transform" (DFT).
Hold those thoughts.
The role of the discrete cosine transform (DCT)
Introduction
In the JPEG encoding system, the set of Y values for the pixels in an 8 × 8 block, the set of Cb values, and the set of Cr values are handled in separate operations, following essentially the identical procedure. For convenience here, I will speak of the Y values.
I will also for the next little while ask you to consider a "block" of 8 × 1 pixels, as this will make for a more clear presentation of the principles.
This block of 8 pixels is just like a small segment of an electrical waveform described as discrete samples. Accordingly, we can describe the variation in Y along this little block in terms of the amplitudes of several cosine functions, with "harmonically-related" spatial frequencies (that is, frequencies that are all integer multiples of some basic frequency), which together would make up the pattern of variation in Y.
Eight cosine functions
In JPEG case, we actually use eight cosine functions, with frequencies from 0 to 7 times the base frequency, which together very nearly compose the actual variation in Y along the 8 pixels of the line. The amplitudes of those 8 functions are the 8 "coefficients" which now describe that variation in Y, taking the place of the 8 Y values themselves. This representation is called the "discrete cosine transform" (DCT) of the variation of the value of Y itself.
What is the significance of a cosine function with frequency 0 (the first one of the collection)? This represents the average of all the Y values.
If this were an electrical waveform, the electrical engineer would call this the "DC component", and in fact in the JPEG literature that same term is used.
The base spatial frequency
Now, what is the base spatial frequency here (the actual not-DC cosine functions have from 1 to 7 times its frequency)?
Recall that we can only represent, by the array of pixel values, spatial frequencies in the image path along the pixels that are less than the Nyquist frequency, which is 1/2p, where p is the pixel spacing.
Now, in the JPEG DCT, the base frequency (that of cosine component "1") has frequency 1/16p. Thus the highest frequency cosine component (number "7") has frequency 7/16p.
This is 7/8 the Nyquist frequency. We know that for various reasons (including the matter of the Kell factor) the highest resultion we can typically get in a digital sensor system is about 3/4 times the Nyquist frequency. Thus the highest available cosine component is more than adequate in frequency to represent the highest frequency component that might usefully appear in the image track.
[Continued in part 2]