Floating-Point Number Visualizer

IEEE 754 Floating-Point Visualizer

Single Precision (32-bit / float) Double Precision (64-bit / double) Brain Float 16 (bfloat16)

Enter a Number:

Or enter hexadecimal representation:

Binary Representation

How the Number is Encoded

Component Breakdown

How It Works

IEEE 754 Floating-Point Format

Floating-point numbers are stored using three components:

Component	bfloat16	Single (32-bit)	Double (64-bit)	Purpose
Sign	1 bit	1 bit	1 bit	0 = positive, 1 = negative
Exponent	8 bits	8 bits	11 bits	Determines the scale/magnitude
Mantissa (Fraction)	7 bits	23 bits	52 bits	Stores the precision/significant digits
Total Size	16 bits	32 bits	64 bits	Memory footprint

The Formula

The actual value is calculated as:

value = (-1)^sign x 1.mantissa x 2^(exponent - bias)

Where:

Bias = 127 for single precision and bfloat16, 1023 for double precision
1.mantissa = implicit leading 1 + fractional part from mantissa bits

Special Values

Zero: exponent = 0, mantissa = 0
Infinity: exponent = all 1s, mantissa = 0
NaN (Not a Number): exponent = all 1s, mantissa != 0
Subnormal: exponent = 0, mantissa != 0 (very small numbers)

Converting Fractions to Binary

Converting the fractional part of a decimal number to binary uses a different algorithm than converting integers:

Algorithm: Multiply by 2 repeatedly

1. Take the fractional part
2. Multiply by 2
3. If result >= 1: record "1" and subtract 1
4. If result < 1: record "0"
5. Repeat with the new fractional part

Example: Convert 0.625 to binary

Fraction	x 2	Result	Bit	New Fraction
0.625	x 2	1.25	1	0.25 (subtract 1)
0.25	x 2	0.5	0	0.5
0.5	x 2	1.0	1	0.0 (done!)

Reading top to bottom: 0.625 (base 10) = 0.101 (base 2)

Why some fractions can't be exactly represented:

Just like 1/3 = 0.333... in decimal (repeating forever), many decimal fractions become repeating patterns in binary. For example, 0.1 (base 10) = 0.0001100110011... (base 2) (repeating). This is why 0.1 + 0.2 != 0.3 in many programming languages!

Format Comparison: bfloat16 vs float32 vs float64

What is bfloat16?

Brain Floating Point 16 (bfloat16) is a 16-bit format developed by Google Brain for machine learning. It's essentially a truncated float32 that keeps the same 8-bit exponent but reduces the mantissa from 23 bits to just 7 bits.

Think of it as: bfloat16 = float32 with the last 16 bits chopped off

Property	bfloat16	float32	float64
Total Bits	16	32	64
Sign + Exponent + Mantissa	1 + 8 + 7	1 + 8 + 23	1 + 11 + 52
Dynamic Range (Max)	+/-3.4 x 10^38	+/-3.4 x 10^38	+/-1.7 x 10^308
Smallest Normal Positive	~1.18 x 10^-38	~1.18 x 10^-38	~2.23 x 10^-308
Decimal Precision	~2-3 digits	~7 digits	~15-16 digits
Precision (bits)	~7-8 bits (~0.78%)	~24 bits (~0.000012%)	~53 bits (~0.00000000000011%)
Memory per Value	2 bytes	4 bytes	8 bytes
Memory Savings vs float32	50% reduction	-	2x increase
Primary Use Case	ML Training	General Purpose	Scientific Computing

Key Trade-offs

bfloat16 Advantages

Same range as float32 (8-bit exponent)
50% memory savings
2x bandwidth improvement
Faster on AI hardware (TPUs, newer GPUs)
Simple conversion to/from float32
Acts as regularization (reduces overfitting)
Gradient stability maintained

bfloat16 Limitations

Low precision: Only ~3 decimal digits
Rounding errors accumulate faster
Poor for exact calculations (finance, science)
Can lose small gradients if not careful
Not suitable for accumulation operations
No precision advantage over float32

Precision Loss Examples

Example 1: pi

        Actual:   3.141592653589793...

        bfloat16: 3.140625 (error: 0.000968, ~0.031%)

        float32:  3.1415927 (error: 0.0000001, accurate to 7 digits)

        float64:  3.141592653589793 (exact to 15 digits)

Example 2: Small number (gradient in neural net)

        Actual:   0.000123456789

        bfloat16: 0.0001232624 (loses ~4 digits)

        float32:  0.000123456789 (accurate)

        float64:  0.000123456789 (accurate)

Example 3: Accumulation error

        Sum 0.1 + 0.1 + 0.1 ten thousand times (should = 1000):

        bfloat16: ~992-1008 (+/-0.8% error!)

        float32:  1000.0039 (+/-0.0004% error)

        float64:  1000.0000 (negligible error)

When to Use Each Format

Use bfloat16 when:

Training large neural networks (transformers, CNNs)
Memory bandwidth is a bottleneck
You have TPUs or NVIDIA Ampere/Hopper GPUs
You need float32's range but can sacrifice precision
Mixed-precision training (bfloat16 forward, float32 accumulation)

Use float32 when:

General-purpose computing and graphics
You need ~7 decimal digits of precision
Standard ML inference (after training)
Game physics, audio processing
Default choice for most applications

Use float64 when:

Scientific simulations (weather, physics, astronomy)
Financial calculations (must avoid rounding errors)
Long numerical integrations
Computing mathematical constants
You need ~15 decimal digits of precision

Range vs Precision Summary

Format	Range Coverage	Precision	Best For
bfloat16	Excellent (same as float32)	Poor (~0.78% relative error)	ML training where speed > precision
float32	Excellent	Good (~0.000012% relative error)	General computing, balanced choice
float64	Outstanding (huge range)	Excellent (~10^-14% relative error)	Scientific/financial precision critical

Key Insight: bfloat16 sacrifices precision for memory efficiency while keeping the same range as float32. This works for neural networks because they're error-tolerant, but fails for applications requiring exact calculations. The 8-bit exponent means you can represent numbers from tiny (10^-38) to huge (10^38), but with only 7 mantissa bits, you can't distinguish between nearby values (e.g., 1.0 and 1.01 might round to the same representation).