• Please use real names.

    Greetings to all who have registered to OPF and those guests taking a look around. Please use real names. Registrations with fictitious names will not be processed. REAL NAMES ONLY will be processed

    Firstname Lastname

    Register

    We are a courteous and supportive community. No need to hide behind an alia. If you have a genuine need for privacy/secrecy then let me know!
  • Welcome to the new site. Here's a thread about the update where you can post your feedback, ask questions or spot those nasty bugs!

Floating point numbers

Doug Kerr

Well-known member
Various of the color spaces specially suited for representing HDR images use floating-point representation of their color coordinates.

Although I've been generally familiar with the floating-point concept, I had not until recently been aware of the details of the use of the principle in modern computer science. But I've just lately boned up, and I thought I would pass on my understanding to my colleagues.

**********

The object

The object of floating-point representation of numeric values is to allow a very large range of values to be represented, to a generally-constant percentage precision, with a modest number of digits.

Scientific notation

What is often called scientific notation is a familiar type of floating point representation with a decimal base.

Here is an example:

1.0632 x 10^-6

This means 0.0000010632.

7.4592 x 10^+5 means 745920.

The 1.0632 is called (in modern terminology) the significand (the earlier term is mantissa) (it contains the significant digits) and the "-6" is (not surprisingly) called the exponent.

In the usual practice, for a 5 significant-digit scheme, the range of the significand is:

±1.0000 to ±9.9999

Suppose that in our scheme, we allow the range of the exponent to be:

-9 through +9

Then the largest number we can represent is:

+9.9999 x 10^+9 [9999000000]

and the smallest positive number we can represent is:

+1.0000 x 10^-9 [0.0000000010000]

This scheme requires 6 decimal digits (5 for the significand and 1 for the exponent plus two sign bits (one for the significand and one for the exponent).

The ratio precision (the ratio between two consecutive possible values) varies with the significand. For a mid range significand (about 5.0000), the precision is 0.002%

In computer programming, and often informally in other scientific work, a more-typographically-convenient form of the notation is used is used:

1.0632E-06 [for 1.0632 x 10^-6]

In binary form

In computers and elsewhere, we often use a binary form of floating point representation. Here's an example.

We will use a significand in this range (in binary):

±1.0000000 through ±1.1111111 (yes, that is a binary point)

and an exponent in this range:

±1111 (±15 in decimal)

Thus, the largest number we can represent is:

1.1111111 x 2^15 (the significand in binary, the rest in decimal).

(That is 32768.99609 in decimal.)

The smallest positive number we can represent is:

1.0000000 x 2^-15

(That is 0.00003051757813 in decimal.)

This scheme requires a total of 14 bits: 8 for the significand, 4 for the exponent, and 1 sign bit for each.

We can see that some special code will have to be used to represent zero (I won't say just now what that might be).

A wrinkle

An important wrinkle can perhaps best be illustrated in our decimal version.

The three smallest non-zero positive numbers than can be represented are:

+1.0002 x 10^-9 [0.0000000010002]
+1.0001 x 10^-9 [0.0000000010001]
+1.0000 x 10^-9 [0.0000000010000]

But the next lower representable number is:
<special code> [0.0000000000000]

So the values just above zero are separated by:

0.0000000000001

but the lowest of them is separated from the next lowest value (zero) by:

0.0000000010000

so it is as if there is a really big "dead zone" near zero.

This can cause a problem in calculations. If we subtract two distinct small non-zero values:

+0.0000000010706

minus

+0.0000000010501

we should get:

+0.0000000000205

But there is no representation for that. So it would have to be given the special code that means:

+0.0000000000000

Then, even though the subtraction produces a non-zero answer, that is lost. This can result in divide-by-zero faults when there should not be such, and so forth.

Thus, some special coding is needed to continue the list of values that can be represented from the lowest normal non-zero value, at the same spacing used between the lowest normal no-zero values, all the way down to zero.

These specially-represented values are called denormal values. I won't show how we might encode them in a decimal floating-point system, since our concern with details will actually be in the binary realm.

The binary realm

In the binary world, there are a number of floating point schemes that have been standardized. An important family, normally used in computer science contexts, is specified by IEEE standard 754.

It defines various different schemes, giving different precisions and ranges with the use of different numbers of bits. I will describe here the 32-bit version ("full precision").

First, let's reexamine our binary example above. I said that in that version, the range of the significand was:

±1.0000000 through ±1.1111111

Note that the units digit is always 1. So what's the point of using a bit to carry it - we can plug it in when we decode the number. Indeed, in this scheme, we would only need 7 bits to carry the significand (not counting the sign bit for it).

In the IEEE-754 32-bit scheme, the significand is conceptually 24 bits in length, but because the first one is always 1, we use only an 23-bit field to carry the "fractional part" of the significand.

A separate sign bit is used. There is no notion of "twos-complement" or anything like that. O means positive, 1 means negative, and the magnitude is given by the fraction field.

In this scheme, the range of the exponent is:

-126 through +127

But rather than using a sign bit, we use an 8-bit field, which carries a value of 127 less than the exponent (an "exponent offset" of -127).

Thus a field value of 0 mean an exponent of -127; a field value of 127 means an exponent of 0; and a field value of 254 means an exponent of +127. Looks likes we have wasted some values. Well, we have a couple of "funny things" to do.

Funny things

We have to do something funny to represent the value 0, since the significand cannot become 0.

For the value 0, we code all 0s in the fraction field (which would ordinarily mean a significand proper of 1.0000) and 0 in the exponent field. This whole thing is recognized as meaning a value 0. (The significand sign bit still works, so we can have +0 and -0 if that does anything for us.)

What would that set of field encodings ordinarily mean?

1.0000000000... x 2^-127.

That is outside the legitimate exponent range of the scheme, so it would never be used with that meaning.

The next funny thing we have to do is to encode the denormal values - values smaller than the smallest (in absolute terms) value that can be represented in the normal way.

For those the exponent field also carries 0, and the fraction field carries a non-zero value.

This combination means:

• Do not add 1 to the value in the fraction field to get the significant; just use the fraction value as is.

• Consider the exponent to be -126.

The significand sign bit still works so we can fill both the positive and negative gaps.

While were are making funny things, let's make positive and negative infinity.

For either sign of infinity we put 255 in the exponent field (never used before) and all zeroes in the fraction field. The significand sign bit still works so we can have both positive and negative infinity.

Now, one last funny.We wish to represent the "value" "not a number" (NaN), used to record a meaningless result of a calculation or such, or to just indicate (in data transmission) "no data here yet".

For that we put 255 in the exponent field and anything except zero in the fraction field. The significand sign bit ism allowed to have either value, but that rarely has any significance.

The 16-bit form

In the 16-bit form ("half-precision"), the fraction field is 10 bits in size (so the significand itself has 11 bits), and there is a sign bit for the significand. The exponent field has five bits. The range of actual exponent values is -14 through +15 (the exponent offset is -15). The "funnies" work just as in the 32-bit form.

The largest (absolute) representable values are ±65504.


Best regards,

Doug
 
Last edited by a moderator:
Top