2025-03-14

Floating-Point Representation: A Deep Dive

IEEE 754 Floating-Point Standard

The IEEE 754 standard defines how floating-point numbers are represented in computers. A number, VV, is expressed as:

V=(1)s×M×2EV = (-1)^s \times M \times 2^E

Where:

Common Floating-Point Formats

The following table summarizes the key characteristics of common floating-point formats:

FormatTotal BitsExponent Bits (kk)Fraction Bits (nn)
Double641152
Float32823
FP1616510
BF161687

Special Value Categories

Floating-point numbers can represent various special values, defined by the exponent (ee) and fraction (ff) fields:

CategoryConditionValue
Normalized Values0<e<2k10 < e < 2^k - 1(1)s×(1+f)×2ebias(-1)^s \times (1 + f) \times 2^{e - bias}
Denormalized Valuese=0e = 0(1)s×f×21bias(-1)^s \times f \times 2^{1 - bias}
Infinitye=2k1e = 2^k - 1, f=0f = 0(1)s×(-1)^s \times \infty
NaN (Not a Number)e=2k1e = 2^k - 1, f0f \neq 0NaN

Where the bias is 2k112^{k-1} - 1.

Denormalized numbers serve two crucial purposes:

  1. Representation of Zero: They allow for distinct representations of positive (+0.0+0.0) and negative (0.0-0.0) zero, differentiated by the sign bit.
  2. Representation of Values Close to Zero: They enable the representation of numbers very close to 0.00.0, filling the gap between zero and the smallest normalized number.

Example: 6-Bit Floating-Point Format

Let’s illustrate with a 6-bit floating-point format (1 sign bit, 3 exponent bits, 2 fraction bits):

DescriptionBit RepresentationEEffValue
Zero0 000 00-30/40/32
Smallest Positive0 000 01-31/41/32
Largest Denormalized0 000 11-33/43/32
Smallest Normalized0 001 00-20/44/32
One0 011 0000/44/4
Largest Normalized0 110 1133/428/4
Infinity0 111 00--\infty

Rounding Modes

When a number cannot be represented exactly, rounding is necessary. Common rounding modes include:

Mode1.41.61.52.5-1.5
Round-to-Even1222-2
Round-Toward-Zero1112-1
Round-Down (Floor)1112-2
Round-Up (Ceiling)2223-1

Floating-Point Operations: Precision and Pitfalls

A significant challenge with floating-point arithmetic is the “big eats small” phenomenon. Consider:

3.14+1×10101×1010=0.03.14 + 1 \times 10^{10} - 1 \times 10^{10} = 0.0

This occurs due to the following steps in floating-point addition:

  1. Alignment: Exponents are aligned by shifting the significand of the smaller number until both exponents match.
  2. Significand Addition: The significands are added.
  3. Normalization and Rounding: The result is normalized and rounded if necessary.

Precision loss happens during the alignment step when one number is significantly larger than the other.

Python Representation of Floating-Point Numbers

The following Python code demonstrates how to decompose a float into its significand and exponent:

import struct

def float_to_fe(f):
    packed = struct.pack('>f', f)
    int_val = struct.unpack('>I', packed)[0]
    sign = (int_val >> 31) & 1
    exponent = (int_val >> 23) & 0xFF
    mantissa = int_val & 0x7FFFFF

    if exponent == 0xFF:  # Infinity or NaN
        if mantissa == 0:
            return "Infinity" if sign == 0 else "-Infinity"
        else:
            return "NaN"

    bias = 127
    if exponent == 0:
        e = 1 - bias
        mantissa_binary = f"0.{mantissa:023b}" #denormalized
    else:
        e = exponent - bias
        mantissa_binary = f"1.{mantissa:023b}" #normalized

    if sign == 1:
        mantissa_binary = "-" + mantissa_binary

    return f"{mantissa_binary} * 2^{e}"

Example:

3.14=1.10010001111010111000011×211×1010=1.00101010000001011111001×2333.14 = 1.10010001111010111000011 \times 2^1 \\ 1 \times 10^{10} = 1.00101010000001011111001 \times 2^{33}

To align 3.143.14 with 1×10101 \times 10^{10}, its significand must be right-shifted by 32 bits. Due to the 23-bit fraction field in single-precision floats, 3.143.14 effectively becomes 0.00.0.

Conversion Between Floating-Point Formats

Converting between floating-point formats can lead to: