IEEE 754 Floating-Point Visualizer

Binary Representation

How the Number is Encoded

Component Breakdown

How It Works

IEEE 754 Floating-Point Format

Floating-point numbers are stored using three components:

Component bfloat16 Single (32-bit) Double (64-bit) Purpose
Sign 1 bit 1 bit 1 bit 0 = positive, 1 = negative
Exponent 8 bits 8 bits 11 bits Determines the scale/magnitude
Mantissa (Fraction) 7 bits 23 bits 52 bits Stores the precision/significant digits
Total Size 16 bits 32 bits 64 bits Memory footprint

The Formula

The actual value is calculated as:

value = (-1)^sign x 1.mantissa x 2^(exponent - bias)

Where:

Special Values

Converting Fractions to Binary

Converting the fractional part of a decimal number to binary uses a different algorithm than converting integers:

Algorithm: Multiply by 2 repeatedly

1. Take the fractional part
2. Multiply by 2
3. If result >= 1: record "1" and subtract 1
4. If result < 1: record "0"
5. Repeat with the new fractional part

Example: Convert 0.625 to binary

Fraction x 2 Result Bit New Fraction
0.625 x 2 1.25 1 0.25 (subtract 1)
0.25 x 2 0.5 0 0.5
0.5 x 2 1.0 1 0.0 (done!)

Reading top to bottom: 0.625 (base 10) = 0.101 (base 2)

Why some fractions can't be exactly represented:

Just like 1/3 = 0.333... in decimal (repeating forever), many decimal fractions become repeating patterns in binary. For example, 0.1 (base 10) = 0.0001100110011... (base 2) (repeating). This is why 0.1 + 0.2 != 0.3 in many programming languages!

Format Comparison: bfloat16 vs float32 vs float64

What is bfloat16?

Brain Floating Point 16 (bfloat16) is a 16-bit format developed by Google Brain for machine learning. It's essentially a truncated float32 that keeps the same 8-bit exponent but reduces the mantissa from 23 bits to just 7 bits.

Think of it as: bfloat16 = float32 with the last 16 bits chopped off

Property bfloat16 float32 float64
Total Bits 16 32 64
Sign + Exponent + Mantissa 1 + 8 + 7 1 + 8 + 23 1 + 11 + 52
Dynamic Range (Max) +/-3.4 x 10^38 +/-3.4 x 10^38 +/-1.7 x 10^308
Smallest Normal Positive ~1.18 x 10^-38 ~1.18 x 10^-38 ~2.23 x 10^-308
Decimal Precision ~2-3 digits ~7 digits ~15-16 digits
Precision (bits) ~7-8 bits (~0.78%) ~24 bits (~0.000012%) ~53 bits (~0.00000000000011%)
Memory per Value 2 bytes 4 bytes 8 bytes
Memory Savings vs float32 50% reduction - 2x increase
Primary Use Case ML Training General Purpose Scientific Computing

Key Trade-offs

bfloat16 Advantages

  • Same range as float32 (8-bit exponent)
  • 50% memory savings
  • 2x bandwidth improvement
  • Faster on AI hardware (TPUs, newer GPUs)
  • Simple conversion to/from float32
  • Acts as regularization (reduces overfitting)
  • Gradient stability maintained

bfloat16 Limitations

  • Low precision: Only ~3 decimal digits
  • Rounding errors accumulate faster
  • Poor for exact calculations (finance, science)
  • Can lose small gradients if not careful
  • Not suitable for accumulation operations
  • No precision advantage over float32

Precision Loss Examples

Example 1: pi

Actual: 3.141592653589793...
bfloat16: 3.140625 (error: 0.000968, ~0.031%)
float32: 3.1415927 (error: 0.0000001, accurate to 7 digits)
float64: 3.141592653589793 (exact to 15 digits)

Example 2: Small number (gradient in neural net)

Actual: 0.000123456789
bfloat16: 0.0001232624 (loses ~4 digits)
float32: 0.000123456789 (accurate)
float64: 0.000123456789 (accurate)

Example 3: Accumulation error

Sum 0.1 + 0.1 + 0.1 ten thousand times (should = 1000):
bfloat16: ~992-1008 (+/-0.8% error!)
float32: 1000.0039 (+/-0.0004% error)
float64: 1000.0000 (negligible error)

When to Use Each Format

Use bfloat16 when:

  • Training large neural networks (transformers, CNNs)
  • Memory bandwidth is a bottleneck
  • You have TPUs or NVIDIA Ampere/Hopper GPUs
  • You need float32's range but can sacrifice precision
  • Mixed-precision training (bfloat16 forward, float32 accumulation)

Use float32 when:

  • General-purpose computing and graphics
  • You need ~7 decimal digits of precision
  • Standard ML inference (after training)
  • Game physics, audio processing
  • Default choice for most applications

Use float64 when:

  • Scientific simulations (weather, physics, astronomy)
  • Financial calculations (must avoid rounding errors)
  • Long numerical integrations
  • Computing mathematical constants
  • You need ~15 decimal digits of precision

Range vs Precision Summary

Format Range Coverage Precision Best For
bfloat16 Excellent (same as float32) Poor (~0.78% relative error) ML training where speed > precision
float32 Excellent Good (~0.000012% relative error) General computing, balanced choice
float64 Outstanding (huge range) Excellent (~10^-14% relative error) Scientific/financial precision critical

Key Insight: bfloat16 sacrifices precision for memory efficiency while keeping the same range as float32. This works for neural networks because they're error-tolerant, but fails for applications requiring exact calculations. The 8-bit exponent means you can represent numbers from tiny (10^-38) to huge (10^38), but with only 7 mantissa bits, you can't distinguish between nearby values (e.g., 1.0 and 1.01 might round to the same representation).