The discrete cosine transforn (DCT) in JPEG encoding

Doug Kerr · Apr 14, 2014

In the JPEG system for encoding an image, a key role is played by the discrete cosine transform (DCT). Here I will give some insight into what this is, how it works, and some implications of the specific way it is employed in JPEG encoding.

Background

The color space of JPEG

Most often the image presented to a JPEG encoder is in the sRGB color space. The color of each pixel is described by the values of three coordinates, R, G, and B. These are nonlinear presentations of the quantities of three "primaries" which, combined, would make the color of interest.

JPEG encoding works with a different color space, directly relatable to sRGB. Here again, the color of each pixel is described by the values of three coordinates, here Y, Cb, and Cr.

Y is a fixed linear combination of R, G, and B. It is something like luminance (in fact, its symbol, Y, is the symbol for luminance), but it is not luminance. (For it to truly correspond to luminance, it would have to be a fixed linear combination of r, g, and b, which directly represent the quantities of the three primaries in linear form).

Cb is defined as B-Y, and Cr as R-Y. These two together describe something like the chrominance of the pixel color (although since they are derived from the nonlinear variables B, R, and, through Y, G, they do not form any standard definition of chrominance).

Y, Cb, and Cr ordinarily are represented by 8-bit integers. Thus we see there is the opportunity for rounding error as we convert from the RGB color space to the YCbCr color space during JPEG encoding and then as we convert back after decoding.

The pixel block

In JPEG encoding, the image is divided into square blocks of 8 × 8 pixels. Each block is worked independently, leading to a set of data which approximately describes the color of all the pixels in the block.

The Fourier series

If we have, for example, a recurrent electrical waveform (a variation of voltage with time), we can consider it to be composed of a collection of sine and cosine functions of various frequencies, all integer multiples of the frequency of recurrence of the waveform (said to have "harmonically-related" frequencies). This is called the Fourier series representation of a recurrent waveform.

By stating the amplitudes of all these sine and cosine components, we can completely describe the waveform. Thus, for example, instead of actually transmitting the waveform to a distant point, we could transmit (perhaps in digital form) the amplitudes of all the sine and cosine functions that can be considered to constitute it. At the distant end, from this set of "coefficients", we can reconstruct the waveform.

If the waveform has a certain time symmetry, the amplitudes of all the sine functions it might contain are in fact zero. Thus in such a case we can completely represent the waveform it by stating the amplitudes of all the "harmonically-related" cosine functions it "contains".

In some cases we do not have the waveform as a "continuous" variable but only have its value at repetitive instants - usually the case today with digital audio or video. In that case, the process used to determine the coefficients is called the "discrete Fourier transform" (DFT).

Hold those thoughts.

The role of the discrete cosine transform (DCT)

Introduction

In the JPEG encoding system, the set of Y values for the pixels in an 8 × 8 block, the set of Cb values, and the set of Cr values are handled in separate operations, following essentially the identical procedure. For convenience here, I will speak of the Y values.

I will also for the next little while ask you to consider a "block" of 8 × 1 pixels, as this will make for a more clear presentation of the principles.

This block of 8 pixels is just like a small segment of an electrical waveform described as discrete samples. Accordingly, we can describe the variation in Y along this little block in terms of the amplitudes of several cosine functions, with "harmonically-related" spatial frequencies (that is, frequencies that are all integer multiples of some basic frequency), which together would make up the pattern of variation in Y.

Eight cosine functions

In JPEG case, we actually use eight cosine functions, with frequencies from 0 to 7 times the base frequency, which together very nearly compose the actual variation in Y along the 8 pixels of the line. The amplitudes of those 8 functions are the 8 "coefficients" which now describe that variation in Y, taking the place of the 8 Y values themselves. This representation is called the "discrete cosine transform" (DCT) of the variation of the value of Y itself.

What is the significance of a cosine function with frequency 0 (the first one of the collection)? This represents the average of all the Y values.

If this were an electrical waveform, the electrical engineer would call this the "DC component", and in fact in the JPEG literature that same term is used.

The base spatial frequency

Now, what is the base spatial frequency here (the actual not-DC cosine functions have from 1 to 7 times its frequency)?

Recall that we can only represent, by the array of pixel values, spatial frequencies in the image path along the pixels that are less than the Nyquist frequency, which is 1/2p, where p is the pixel spacing.

Now, in the JPEG DCT, the base frequency (that of cosine component "1") has frequency 1/16p. Thus the highest frequency cosine component (number "7") has frequency 7/16p.

This is 7/8 the Nyquist frequency. We know that for various reasons (including the matter of the Kell factor) the highest resultion we can typically get in a digital sensor system is about 3/4 times the Nyquist frequency. Thus the highest available cosine component is more than adequate in frequency to represent the highest frequency component that might usefully appear in the image track.

[Continued in part 2]

Doug Kerr · Apr 14, 2014

[Part 2]

The block is actually two-dimensional

Above for simplicity we considered a "block" of pixels 8 × 1 pixel. We saw that we could replace the set of 8 Y values with a set of 8 coefficients which described the set of 8 cosine functions which together would describe the same variation in Y.

But in reality we work with a block of 8 × 8 pixels. And in fact our discrete cosine transform comprises 64 coefficients (usually presented in an 8 × 8 array, although its cells do not correspond to pixels in the source block).

Each cell in this array holds a coefficient describing the "potency" of a two-dimensional pattern of the variation of Y with certain frequencies in the horizontal and vertical directions.

By the way, the upper left-hand cell of the array is the "overall DC component", now the average of all the Y values of the block. This function has no aviation at all.

Thus, at position 3, 4 of this array is a coefficient that tells the "potency" of a two-dimensional pattern which:

• Varies in the horizontal direction according to a cosine function whose spatial frequency is 3 times the base frequency and

• Varies in the vertical direction according to a cosine function whose spatial frequency is 4 times the base frequency

It is the overlay of these 64 two-dimensional patterns (one of them trivial), each with its own "potency", which now describes the variation in Y over the 64 pixel of the block.

The transform process itself

The transformation of the set of 64 Y values into 64 cosine term coefficients is done by multiplying the set of 64 Y values (as a matrix) by a matrix that is based itself on certain cosine functions.

The Y values are held to a precision of 8 bits, but the transform matrix may have a substantially higher precision, and the result )(the set of coefficients) is held (for the moment) to a fairly high precision.

What is the point?

What is the value in replacing 64 8-bit Y values with 64 coefficients that each are held in more than 8 bits? The total data load here is substantially more than for the actual source data.

We get a clue when we examine a typical set of 64 coefficients.

As we move to the right and down in the array, we encounter coefficients that describe the presence of higher and higher spatial frequencies in the image, and in almost all actual image cases, many (even most) of those coefficients are much smaller than the coefficients in the upper-left portion of the array (which describe lower-spatial-frequency components).

Quantization

Now we for the moment have all these coefficient values held in a large number of bits. But we will soon round them to a more modest number of bits. But not the same number of bits for all coefficients.

Because the higher-frequency coefficients are (mostly) smaller, we can represent them to the same absolute precision with a smaller number of bits.

And in fact when we do that, many will round to 0. And the clever multi-stage scheme of encoding that is subsequently used means that essentially these will require almost 0 bits each to represent.

And is is from this that most of the data "compression" of JPEG comes from.

The quantization tables

The rounding of the calculated coefficients to differing levels of precision (different numbers of bits, we might say) is done under control of a quantizing table. (There is one for Y and a usually-different one for Cb and Cr.) The table is an 8 × 8 array, whose cells correspond to the 64 coefficients derived by the initial DCT calculation.

For each coefficient, the initially-calculated value is divided by the value in the corresponding cell of the quantization table, and the result rounded to an integer. The higher the value in the table cell, the greater is the reduction in precision (the less the size of the rounded result and thus a lesser number of significant bits in it).

There is an "illustrative" table given in the JPEG specification. It then calls for a fixed degree of reduction in precision for each of the coefficient positions. These have been determined by examination of many images and a conclusion as to what pattern of quantization gives the best "overall result" over all of them.

Of course the encoder manufacturer is free to determine a "more nearly optimal table" for each image, if desired, based on a preliminary analysis of the image.

In any case, the table(s) used are carried in the file header, so the decoder can behave accordingly when the image is being reconstructed.

The succeeding multi-stage encoding

The succeeding multi-stage encoding is a fascinating story in its own right. But that is for another time. It's almost time for my nap now.

Best regards,

Doug

Doug Kerr · Apr 14, 2014

This is a nice graphic presentation of the 64 two-dimensional patterns used in the JPEG DCT system:

Thanks to David Marshall

The upper-left pattern (location 0,0) has no variation over the pixels of the block. The coefficient for that pattern is the "DC component" (overall mean value of Y, Cb, or Cr for the block). (There is actually a scaling factor involved.)

The pattern at location 1,0 has a variation at 1X the base frequency in the horizontal direction and no variation in the vertical direction.

The pattern at location 0,1 has no variation in the horizontal direction and a variation at 1X the base frequency in the vertical direction.

The pattern at location 1,1 has a variation at 1X the base frequency in the horizontal direction and a variation at 1X the base frequency in the vertical direction.

The pattern at 2,3 has a variation at 2X the base frequency in the horizontal direction and a variation at 3X the base frequency in the vertical direction.

The pattern at 6,7 has a variation at 6X the base frequency in the horizontal direction and a variation at 7X the base frequency in the vertical direction.

The set of 64 coefficients tells us with what "potency" should these 64 patterns be combined to match the actual variation of Y (or Cb, or Cr) over the 64 cells of the pixel block being encoded.

Note that the patterns themselves are discrete (64 values over the entire pattern). They can only have meaning at 64 locations, since those are the only places that there are pixels whose values we wish to describe.

Neat, wot?

By the way, Dave's paper indicates that the initially-computed values of the coefficients are held as integers to a precision of 11 bits (range -1024 through +1023).

http://www.cs.cf.ac.uk/Dave/Multimedia/node231.html

Best regards,

Doug

Doug Kerr · Apr 15, 2014

A further refinement

Jerome Marot recently called to my attention an aspect of the JPEG quantization scheme of which I had previously been unaware.

An extension to the original JPEG specification makes provision for what we can think of as a table with an entry for each 8 × 8 block. The value of that entry scales the entire quantization table as it will be applied to processing that block.

The purpose is to allow "tuning" the quantization to recognize that in different blocks, by virtue of their content, the effects of a certain "rounding" regimen may be differently perceived by the viewer.

Reading that in the opposite direction, for some blocks we may be able to apply a more "brutal" quantization (and thus further reduce the amount of data required to encode the block) while still achieving our target level of perceptual impact.

According to the author of one paper on this matter, properties of a block that influence "how brutal" a quantization is tolerable include the average luminance of the block and certain aspects of the contrast within the block.

Thus, the authors point out, if, before actually processing the image per the JPEG algorithm, we first do an assessment of each block ascertain how resistant it is to "more brutal" quantization, we can, without damage to the overall perceived image quality, apply, on the average, more brutal quantization. The benefit is a smaller amount of data to represent the entire image.

One price is the space for the "table" that "particularizes" the overall quantization regime block-by-block, of course along with the additional processing required.

The "table" is actually implemented by a system that records where, in a sequence of the blocks in "raster" order, the "quantization control coefficient" changes. Thus, typically, there is not an actual "entry" in this data structure for each block.

The further details of this scheme are not yet known to me.

Whether or not to exploit this capability at all, and how to determine a plan of quantization optimization for the image at hand, is not prescribed by the JPEG standard. It is up to the developer of the specific JPEG encoder to decide such things.

Any JPEG decoder that fully complies with the standard can "play along" and thus attain the benefits to be expected by the decoder developer's scheme.

I do not have any information about how, if at all, various JPEG encoders exploit this provision of the standard.

Best regards,

Doug

The discrete cosine transforn (DCT) in JPEG encoding

Doug Kerr

Well-known member

Doug Kerr

Well-known member

Doug Kerr

Well-known member

Doug Kerr

Well-known member