Floating-Point Representation: A Deep Dive
IEEE 754 Floating-Point Standard
The IEEE 754 standard defines how floating-point numbers are represented in computers. A number, , is expressed as:
Where:
- (Sign): Determines the sign of the number: for positive, for negative.
- (Significand/Mantissa): A fractional binary number. It ranges from to for normalized values, or from to for denormalized values, where is the machine epsilon.
- (Exponent): Weights the value by a power of 2.
Common Floating-Point Formats
The following table summarizes the key characteristics of common floating-point formats:
Format | Total Bits | Exponent Bits () | Fraction Bits () |
---|---|---|---|
Double | 64 | 11 | 52 |
Float | 32 | 8 | 23 |
FP16 | 16 | 5 | 10 |
BF16 | 16 | 8 | 7 |
Special Value Categories
Floating-point numbers can represent various special values, defined by the exponent () and fraction () fields:
Category | Condition | Value |
---|---|---|
Normalized Values | ||
Denormalized Values | ||
Infinity | , | |
NaN (Not a Number) | , | NaN |
Where the bias is .
Denormalized numbers serve two crucial purposes:
- Representation of Zero: They allow for distinct representations of positive () and negative () zero, differentiated by the sign bit.
- Representation of Values Close to Zero: They enable the representation of numbers very close to , filling the gap between zero and the smallest normalized number.
Example: 6-Bit Floating-Point Format
Let’s illustrate with a 6-bit floating-point format (1 sign bit, 3 exponent bits, 2 fraction bits):
Description | Bit Representation | Value | ||
---|---|---|---|---|
Zero | 0 000 00 | -3 | 0/4 | 0/32 |
Smallest Positive | 0 000 01 | -3 | 1/4 | 1/32 |
Largest Denormalized | 0 000 11 | -3 | 3/4 | 3/32 |
Smallest Normalized | 0 001 00 | -2 | 0/4 | 4/32 |
One | 0 011 00 | 0 | 0/4 | 4/4 |
Largest Normalized | 0 110 11 | 3 | 3/4 | 28/4 |
Infinity | 0 111 00 | - | - |
Rounding Modes
When a number cannot be represented exactly, rounding is necessary. Common rounding modes include:
Mode | 1.4 | 1.6 | 1.5 | 2.5 | -1.5 |
---|---|---|---|---|---|
Round-to-Even | 1 | 2 | 2 | 2 | -2 |
Round-Toward-Zero | 1 | 1 | 1 | 2 | -1 |
Round-Down (Floor) | 1 | 1 | 1 | 2 | -2 |
Round-Up (Ceiling) | 2 | 2 | 2 | 3 | -1 |
Floating-Point Operations: Precision and Pitfalls
A significant challenge with floating-point arithmetic is the “big eats small” phenomenon. Consider:
This occurs due to the following steps in floating-point addition:
- Alignment: Exponents are aligned by shifting the significand of the smaller number until both exponents match.
- Significand Addition: The significands are added.
- Normalization and Rounding: The result is normalized and rounded if necessary.
Precision loss happens during the alignment step when one number is significantly larger than the other.
Python Representation of Floating-Point Numbers
The following Python code demonstrates how to decompose a float into its significand and exponent:
import struct
def float_to_fe(f):
packed = struct.pack('>f', f)
int_val = struct.unpack('>I', packed)[0]
sign = (int_val >> 31) & 1
exponent = (int_val >> 23) & 0xFF
mantissa = int_val & 0x7FFFFF
if exponent == 0xFF: # Infinity or NaN
if mantissa == 0:
return "Infinity" if sign == 0 else "-Infinity"
else:
return "NaN"
bias = 127
if exponent == 0:
e = 1 - bias
mantissa_binary = f"0.{mantissa:023b}" #denormalized
else:
e = exponent - bias
mantissa_binary = f"1.{mantissa:023b}" #normalized
if sign == 1:
mantissa_binary = "-" + mantissa_binary
return f"{mantissa_binary} * 2^{e}"
Example:
To align with , its significand must be right-shifted by 32 bits. Due to the 23-bit fraction field in single-precision floats, effectively becomes .
Conversion Between Floating-Point Formats
Converting between floating-point formats can lead to:
- Overflow: If the target format’s exponent range is smaller.
- Loss of Precision/Underflow: If the target format’s fraction field is smaller.