Floating-point numbers are stored using three components:
| Component | bfloat16 | Single (32-bit) | Double (64-bit) | Purpose |
|---|---|---|---|---|
| Sign | 1 bit | 1 bit | 1 bit | 0 = positive, 1 = negative |
| Exponent | 8 bits | 8 bits | 11 bits | Determines the scale/magnitude |
| Mantissa (Fraction) | 7 bits | 23 bits | 52 bits | Stores the precision/significant digits |
| Total Size | 16 bits | 32 bits | 64 bits | Memory footprint |
The actual value is calculated as:
Where:
Converting the fractional part of a decimal number to binary uses a different algorithm than converting integers:
Example: Convert 0.625 to binary
| Fraction | x 2 | Result | Bit | New Fraction |
|---|---|---|---|---|
| 0.625 | x 2 | 1.25 | 1 | 0.25 (subtract 1) |
| 0.25 | x 2 | 0.5 | 0 | 0.5 |
| 0.5 | x 2 | 1.0 | 1 | 0.0 (done!) |
Reading top to bottom: 0.625 (base 10) = 0.101 (base 2)
Why some fractions can't be exactly represented:
Just like 1/3 = 0.333... in decimal (repeating forever), many decimal fractions become repeating patterns in binary. For example, 0.1 (base 10) = 0.0001100110011... (base 2) (repeating). This is why 0.1 + 0.2 != 0.3 in many programming languages!
Brain Floating Point 16 (bfloat16) is a 16-bit format developed by Google Brain for machine learning. It's essentially a truncated float32 that keeps the same 8-bit exponent but reduces the mantissa from 23 bits to just 7 bits.
Think of it as: bfloat16 = float32 with the last 16 bits chopped off
| Property | bfloat16 | float32 | float64 |
|---|---|---|---|
| Total Bits | 16 | 32 | 64 |
| Sign + Exponent + Mantissa | 1 + 8 + 7 | 1 + 8 + 23 | 1 + 11 + 52 |
| Dynamic Range (Max) | +/-3.4 x 10^38 | +/-3.4 x 10^38 | +/-1.7 x 10^308 |
| Smallest Normal Positive | ~1.18 x 10^-38 | ~1.18 x 10^-38 | ~2.23 x 10^-308 |
| Decimal Precision | ~2-3 digits | ~7 digits | ~15-16 digits |
| Precision (bits) | ~7-8 bits (~0.78%) | ~24 bits (~0.000012%) | ~53 bits (~0.00000000000011%) |
| Memory per Value | 2 bytes | 4 bytes | 8 bytes |
| Memory Savings vs float32 | 50% reduction | - | 2x increase |
| Primary Use Case | ML Training | General Purpose | Scientific Computing |
Example 1: pi
Example 2: Small number (gradient in neural net)
Example 3: Accumulation error
Use bfloat16 when:
Use float32 when:
Use float64 when:
| Format | Range Coverage | Precision | Best For |
|---|---|---|---|
| bfloat16 | Excellent (same as float32) | Poor (~0.78% relative error) | ML training where speed > precision |
| float32 | Excellent | Good (~0.000012% relative error) | General computing, balanced choice |
| float64 | Outstanding (huge range) | Excellent (~10^-14% relative error) | Scientific/financial precision critical |
Key Insight: bfloat16 sacrifices precision for memory efficiency while keeping the same range as float32. This works for neural networks because they're error-tolerant, but fails for applications requiring exact calculations. The 8-bit exponent means you can represent numbers from tiny (10^-38) to huge (10^38), but with only 7 mantissa bits, you can't distinguish between nearby values (e.g., 1.0 and 1.01 might round to the same representation).