Fp8

Also known as: FP8, 8-bit floating point, fp8 quantization

Fp8
An 8-bit floating-point format used to store and compute AI model weights and activations at half the memory cost of FP16, with two encodings — E4M3 for inference and E5M2 for training — supported natively on modern GPU hardware.

FP8 is an 8-bit floating-point number format that cuts AI model memory in half compared to 16-bit formats while preserving nearly all accuracy during both inference and training.

What It Is

When AI models run on GPUs, they store billions of numerical values — weights, activations, gradients. Traditional 32-bit and 16-bit formats represent these numbers accurately, but they consume large amounts of memory and slow down computation. FP8 exists because the AI industry needed a number format that sharply reduces memory without sacrificing the accuracy those larger formats provide. In the context of quantization — where models get compressed to run faster and cheaper — FP8 sits at the practical boundary where accuracy holds steady before the degradation that hits at lower bit widths.

Think of FP8 as a shorthand notation for numbers. Instead of writing every decimal place, you round to fewer digits. You lose some precision, but for neural network calculations, that lost precision almost never changes the final answer.

FP8 comes in two flavors, each designed for a different job. E4M3 uses 4 bits for the exponent (the range of values it can represent) and 3 bits for the mantissa (the precision within that range). This balance makes E4M3 well-suited for inference, where you need accurate representations of model weights that tend to cluster in narrow value ranges. E5M2 flips the allocation — 5 exponent bits and only 2 mantissa bits — giving it a wider dynamic range at the cost of precision. That wider range handles the large value swings common in training gradients without overflow. According to NVIDIA Blog, these two encodings were specifically designed to match the computational demands of forward passes and backward passes in deep learning.

The hardware side matters just as much as the format itself. According to vLLM Docs, FP8 delivers roughly 2x memory reduction and up to 1.6x throughput improvement over FP16 on supported hardware — NVIDIA Hopper, Ada Lovelace, and Blackwell GPUs, as well as Intel Gaudi 2 and AMD MI300x. “Native support” means the GPU has dedicated circuits for FP8 arithmetic — no software emulation, no overhead. Without that hardware backing, FP8 would be an interesting specification on paper with no practical speed benefit.

How It’s Used in Practice

The most common place you’ll encounter FP8 is in serving large language models. When a team deploys a model with billions of parameters, the difference between FP16 and FP8 means fitting the model on fewer GPUs — or fitting a larger model on the same hardware. Serving frameworks like vLLM now support FP8 quantization (specifically the W8A8 scheme, where both weights and activations use 8 bits), making it a practical option for teams running inference at scale.

In the broader quantization picture, FP8 represents the floor where accuracy stays intact. According to Liu et al. (COLM 2025), W8A8 FP8 quantization is lossless across all tasks and model sizes tested. Drop below 8 bits into 4-bit or 2-bit territory, and accuracy starts degrading in task-specific, often unpredictable ways. FP8 is the safe zone before that cliff.

Pro Tip: If you’re evaluating quantization options for a model, start with FP8 before experimenting with lower precisions like INT4 or GPTQ. FP8 on supported hardware gives you meaningful memory savings with essentially zero accuracy cost — it’s the low-risk baseline to benchmark everything else against.

When to Use / When Not

ScenarioUseAvoid
Deploying an LLM on Hopper or newer GPUs
Running inference where accuracy is critical (medical, legal)
Targeting older GPUs without native FP8 support
Training large models that need gradient stability✅ (E5M2)
Maximizing compression on edge devices with severe memory limits
Benchmarking before trying sub-4-bit quantization formats

Common Misconception

Myth: FP8 is just a less accurate version of FP16 — you always lose quality when you drop precision. Reality: For inference workloads, FP8 (specifically E4M3 with proper calibration) produces results statistically indistinguishable from FP16. The measurable accuracy loss is near zero in well-calibrated setups, which is why recent research calls it “lossless” quantization. The real accuracy problems begin below 8 bits, not at 8 bits.

One Sentence to Remember

FP8 is the sweet spot of model quantization — half the memory of FP16, native GPU acceleration, and no meaningful accuracy loss — which makes it the natural baseline before reaching for more aggressive compression that risks the accuracy collapse seen at sub-4-bit widths.

FAQ

Q: What is the difference between FP8 and INT8 quantization? A: FP8 uses floating-point representation with exponent and mantissa bits, giving it better dynamic range than INT8’s fixed integer format. FP8 handles outlier values in model weights more gracefully without clipping.

Q: Do I need special hardware to use FP8? A: Yes. FP8 requires GPUs with native support — NVIDIA Hopper, Ada Lovelace, or Blackwell series, Intel Gaudi 2, or AMD MI300x. Without native hardware, there is no speed benefit.

Q: Can FP8 be used for model training, not just inference? A: Yes. The E5M2 encoding is designed specifically for training gradients, providing the wider dynamic range needed for backward passes. E4M3 handles the forward pass during training.

Sources

Expert Takes

FP8 is a case study in precision engineering at the format level. The split between E4M3 and E5M2 reflects the statistical distribution of values in neural networks. Weights cluster in narrow ranges, favoring mantissa precision. Gradients span orders of magnitude, favoring exponent range. The format design follows the math, not the other way around. That fit is why accuracy holds at eight bits and collapses below it.

When you’re setting up a serving stack, FP8 is the first quantization lever to pull. The workflow is clean: calibrate your model against a representative dataset, convert to FP8 using your serving framework’s built-in tools, and verify outputs against your FP16 baseline. If the results match — and they almost always do at this precision — you just halved your memory footprint with a single configuration change.

FP8 redrew the economics of model deployment. Half the memory per parameter means either half the GPUs or double the model size on the same hardware. Every team stuck deciding between renting more compute and optimizing what they have should be running FP8 on supported hardware first. The compression is free. The cloud bill is not.

The “lossless” label deserves scrutiny. Lossless across all tasks tested is not the same as lossless across all tasks that exist. What about workloads not yet benchmarked? What about models trained on underrepresented languages or domains where activation distributions look different from the training sets used to validate FP8? Before assuming universal safety, ask what wasn’t measured — and who that exclusion affects.