Floating-point multiplication28. Nov '13
Introduction
In computers real numbers are represented in floating point format. Usually this means that the number is split into exponent and fraction, which is also known as significand or mantissa:
The mantissa is within the range of 0 .. base. Usually 2 is used as base, this means that mantissa has to be within 0 .. 2. In case of normalized numbers the mantissa is within range 1 .. 2 to take full advantage of the precision this format offers.
For instance Pi can be rewritten as follows:
Single-precision floating point numbers
Most modern computers use IEEE 754 standard to represent floating-point numbers. One of the most commonly used format is the binary32 format of IEEE 754:
sign fraction/significand/mantissa (23 bits)
| / \
| exponent (8 bits) / \
| / \ / \
0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1
Note that exponent is encoded using an offset-binary representation, which means it's always off by 127. So if usually 10000000 in binary would be 128 in decimal, in single-precision the value of exponent is:
Same goes for fraction bits, if usually 10010010000111111011011 in binary would evaluate to 4788187 in decimal then in case of single-precision numbers their weights are shifted and off by one:
Multiplication of single-precision numbers
Multiplication of such numbers can be tricky. In this example let's use numbers:
Normalized values and biased exponents:
The exponents:
The numbers in IEEE754 binary32:
The mantissa could be rewritten as following totaling 24 bits per operand:
Their multiplication totals in 48 bits:
Which has to be truncated to 24 bits:
The exponents 2 and -2 can easily be summed up so only last thing to do is to normalize fraction which means that the resulting number is:
Which could be written in IEEE 754 binary32 format as:
Multiplication of double-precision numbers
The IEEE 754 standard also specifies 64-bit representation of floating-point numbers called binary64 also known as double-precision floating-point number.
sign fraction aka significand aka mantissa (52 bits)
| / \
| exponent / \
| (11 bits) / \
| / \ / \
0 10000000000 1001001000011111101101010100010001000010110100011000
Compared to binary32 representation 3 bits are added for exponent and 29 for mantissa:
0 10000000000 1001001000011111101101010100010001000010110100011000
0 10000000 10010010000111111011011
Thus pi can be rewritten with higher precision:
The multiplication with earlier presented numbers:
Yields in following binary64 representation:
Thu fraction operands are 53 bits each:
And their multiplication is 106 bits long:
Which of course means that it has to be truncated to 53 bits:
The exponent is handled as in single-precision arithmetic, thus the resulting number in binary64 format is:
Which converted to decimal is:
Conclusion
Expected result:
Single-precision result:
Double-precision result:
As can be seen single-precision arithmetic distorts the result around 6th fraction digit whereas double-precision arithmetic result diverges around 15th fraction digit.