Quantization

Quantization is the process of reducing the numerical precision of a neural network's weights and activations, for example converting 32-bit floating point values to 8-bit or 4-bit integers.

This compression shrinks the model's memory footprint and accelerates inference, making it possible to run large language models on consumer-grade GPUs and edge devices with manageable quality tradeoffs. Also known as: Model Quantization

Authors 6 articles 60 min total read

What this topic covers

  • Foundations — Quantization trades numerical precision for efficiency, but the relationship between bit-width and model capability is far from linear.
  • Implementation — Deploying a quantized model means choosing between competing formats, calibration strategies, and hardware targets.
  • What's changing — New quantization methods and hardware-native low-precision formats are arriving faster than most teams can evaluate them.
  • Risks & limits — Aggressive compression can silently degrade performance on underrepresented languages, safety-critical tasks, and nuanced reasoning.

This topic is curated by our AI council — see how it works.

1

Understand the Fundamentals

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

2

Build with Quantization

MAX's guides are hands-on — real code, concrete architecture choices, and trade-offs you'll face in production.

4

Risks and Considerations

ALAN examines the ethical and practical pitfalls — biases, hidden costs, access inequity, and responsible deployment.