MONA explainer 10 min read March 26, 2026

What Is Quantization and How FP32-to-INT4 Compression Makes LLMs Run on Consumer Hardware

MONA examining neural network weights being compressed from wide floating-point blocks into compact integer representations on a circuit board

Table of Contents

ELI5

Quantization shrinks each number in a neural network from high-precision (32-bit) to low-precision (4-bit or 8-bit), slashing memory by up to 8x so large models fit on smaller hardware.

A 70-billion-parameter model stored in FP32 occupies roughly 280 GB of memory. That is more than most data-center nodes carry in a single GPU. Yet people run that same model on a gaming laptop with 24 GB of VRAM — and the outputs are coherent. The gap between those two numbers is not a miracle of hardware. It is a deliberate act of mathematical forgetting, and the fact that it works at all tells us something unsettling about how much precision neural networks actually need.

The Geometry of Throwing Numbers Away

Every parameter in a neural network is a number. In full precision — FP32, or 32-bit floating point — each one occupies four bytes and can represent values across a staggering dynamic range: roughly ±3.4 × 10³⁸, with about seven decimal digits of resolution. That resolution is essential during Mixed Precision Training, where gradients need fine granularity to converge. But during Inference — when the model is generating tokens, not learning — most of that resolution sits idle, like a concert hall built for an audience that never arrives.

Quantization exploits that idle resolution.

What is quantization in large language models?

Quantization is the process of mapping high-precision floating-point weights to lower-precision representations — typically INT8, INT4, or Fp8 — using a scale factor and an optional zero-point offset. The core formula for symmetric quantization is deceptively compact: x_q = clip(round(x / s), α_q, β_q), where s is the scale factor that maps the original value range into the target integer range (NVIDIA Blog).

The idea is borrowed. Signal processing has used quantization for decades — it is how analog audio becomes a digital file. What changed is the target: instead of compressing sound waves, we are compressing the weight matrices of transformer models with billions of parameters, and the tolerances are different in ways that matter.

Moving from FP32 to FP16 halves memory. FP32 to INT8 cuts it by 4x. And FP32 to INT4 delivers an 8x reduction — that 70-billion-parameter model drops from 280 GB to roughly 35 GB (LocalLLM.in). The difference between needing a cluster and needing a single consumer GPU.

But here is the tension that makes quantization genuinely interesting: you are destroying information, and the model still works. Not perfectly. Not for every task. But well enough that the quality loss is, for many applications, difficult to detect without a benchmark suite.

The question is why.

Where Precision Actually Hides

The naive assumption is that every weight matters equally — that reducing the precision of any individual parameter degrades the output proportionally. If that were true, quantization would be useless. The accumulated rounding errors across billions of parameters would produce noise, not language.

That assumption is wrong. And the reason it is wrong reveals something fundamental about how neural networks store knowledge.

How does quantization reduce model precision from FP32 to INT4 without destroying accuracy?

The weight distributions in a trained transformer are not uniform. They cluster. Most weights sit near zero, forming a tight bell curve, with long tails stretching in both directions. A small fraction of weights carry disproportionate influence on the output — and those weights tend to have larger magnitudes or sit in channels that see high activation values during forward passes.

Post Training Quantization takes a trained model and compresses it after the fact, without retraining. The simplest version — round-to-nearest with a fixed scale factor — works surprisingly well for 8-bit precision. For 4-bit, the rounding errors accumulate enough to matter.

That is where calibration enters. By passing a small set of representative inputs (roughly 200 samples) through the model and observing which weights produce the largest errors when rounded, calibration methods identify where precision matters most and where it can be safely discarded. This requires no gradient updates — only observation and arithmetic (PyTorch Blog).

The alternative, quantization-aware training (QAT), inserts simulated quantization noise into the training loop itself, letting the model learn to compensate for reduced precision during optimization. QAT can recover up to 96% of the accuracy gap that basic PTQ leaves behind (PyTorch Blog) — but it requires access to training infrastructure, training data, and compute that most practitioners do not have.

For most users running open-weight models, the practical path is PTQ with a calibration method that knows where to cut. The question becomes: what does “knowing where to cut” actually mean?

The Asymmetry That Calibration Exploits

Not all compression strategies are equal. The difference between a quantized model that produces fluent text and one that hallucinates mid-sentence often comes down to a single decision: how the calibration method identifies which weights to protect and which to round aggressively.

Two methods dominate the current ecosystem, and they solve the same problem from opposite directions.

How do calibration methods like GPTQ and AWQ decide which weights to compress?

GPTQ (Frantar et al., 2022; presented at ICLR 2023) treats quantization as a layer-wise optimization problem. For each layer, it uses approximate second-order information — the Hessian matrix, which captures how sensitive the loss function is to changes in each weight — to determine which parameters can tolerate rounding and which cannot. Weights sitting in steep regions of the loss surface get quantized carefully; weights in flat regions get rounded aggressively, because a small perturbation there barely moves the output. The result: a 175-billion-parameter model quantized to 3-4 bits in approximately four GPU hours, with negligible accuracy loss. On an A100, GPTQ-quantized models run roughly 3.25x faster than their FP16 counterparts (arXiv, Frantar et al.).

AWQ (Lin et al., 2023; MLSys 2024 Best Paper Award) approaches the problem from the activation side instead of the weight side. Rather than asking which weights are sensitive to rounding, AWQ asks which weight channels carry the most signal when real data flows through the model. It discovers that protecting just 1% of salient weight channels — identified by activation magnitude, not weight magnitude — dramatically reduces quantization error (arXiv, Lin et al.). Instead of keeping those channels at higher precision (which would complicate hardware execution paths), AWQ applies per-channel scaling that amplifies salient weights before quantization and compensates afterward. The result is a 70-billion-parameter Llama-2 model running on mobile hardware, achieving over 3x speedup compared to FP16.

The distinction matters for intuition: GPTQ looks at the loss landscape and finds the flat spots. AWQ looks at the data flow and finds the loud channels. Both reach similar destinations through different geometry — and in practice, the choice between them often depends on your hardware target and serving framework more than on raw accuracy.

Diagram showing FP32-to-INT4 quantization pipeline with GPTQ Hessian-based and AWQ activation-based calibration paths — How GPTQ and AWQ calibration methods identify and protect critical weights during compression from FP32 to INT4.

What the Compression Tells You About Your Hardware

The tooling ecosystem around quantization has split along a fault line that maps directly to two different deployment realities: server-side throughput and local-first privacy.

For server-side inference, FP8 is becoming the default precision on NVIDIA Ada Lovelace and Hopper GPUs (compute capability 8.9 or higher), delivering up to 2x latency reduction compared to FP16 with minimal accuracy degradation (vLLM Docs). Serving engines like vLLM support FP8, AWQ, GPTQ, and Bitsandbytes quantization natively, often combined with Continuous Batching for throughput optimization. One caveat: FP8 on consumer GPUs (the RTX 40-series) operates as weight-only W8A16 — weights stored in 8-bit, computation in 16-bit. Full W8A8 compute requires data-center hardware.

For local inference, the GGUF format and Llama Cpp dominate. GGUF is a binary format storing tensors alongside metadata, supporting quantization levels from 1.5-bit to 8-bit. The recommended minimum for coding and reasoning models is Q4_K_M or Q5_K_M — below 4-bit, quality drops off a cliff. On the Python side, bitsandbytes (version 0.49.2 as of February 2026) provides 8-bit quantization via LLM.int8() and 4-bit via QLoRA’s NF4 data type — a format optimized for normally-distributed weights, which transformer parameters tend to be.

The interaction between quantization and Temperature And Sampling settings is worth noting: higher temperature amplifies the noise that rounding introduces into the probability distribution, making quantized models less predictable at creative sampling settings than their full-precision counterparts.

Rule of thumb: If you are running a model locally and need to choose a quantization level, start at Q5_K_M and decrease only if VRAM forces it. Each step down trades quality for memory — and below Q4, the trade becomes steep.

When it breaks: Quantization degrades gracefully until it doesn’t. Tasks requiring precise numerical reasoning, long-chain logical deduction, or code generation with exact syntax are the first to suffer at aggressive compression levels. A model that writes fluent prose at Q4 may produce subtly wrong arithmetic at the same precision — and the errors will not announce themselves.

Compatibility note:
AutoGPTQ: Development has stopped; replaced by GPTQModel, which is integrated into HuggingFace Transformers. Migrate existing AutoGPTQ workflows to GPTQModel.

The Data Says

Quantization is not an approximation you tolerate. It is a lens that reveals which parts of a neural network actually carry signal and which parts were always noise dressed up as precision. The 8x memory reduction from FP32 to INT4 is not the interesting finding — the interesting finding is that discarding most of each number’s bit-level resolution barely changes the output. That asymmetry tells us something about where knowledge lives inside these models, and we are still learning what it means.

Sources

NVIDIA Blog: Model Quantization: Concepts, Methods, and Why It Matters - Quantization fundamentals, symmetric/asymmetric methods, precision reduction factors
arXiv (Frantar et al.): GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Hessian-based calibration, 175B model quantization results, A100 speedup benchmarks
arXiv (Lin et al.): AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - Salient channel protection, mobile deployment, MLSys 2024 Best Paper
LocalLLM.in: The Complete Guide to LLM Quantization - FP32-to-INT4 memory reduction calculations
PyTorch Blog: Quantization-Aware Training for Large Language Models with PyTorch - PTQ calibration requirements, QAT accuracy recovery benchmarks
vLLM Docs: FP8 W8A8 Quantization - FP8 latency reduction benchmarks, serving engine integration

Aha Moments

MAX

What Mona mapped here is the specification layer most teams skip — and then wonder why their quantized model hallucinates on edge cases. The calibration method IS the specification. GPTQ and AWQ are not interchangeable black boxes; they optimize for different structural properties of the weight distribution, and the choice between them should be documented in your model card alongside the quantization level, calibration dataset characteristics, and target hardware. The tooling split between server-side and local is real and architectural: teams that mix GGUF workflows with vLLM pipelines without understanding why each exists end up debugging format compatibility instead of optimizing throughput. Treat quantization as an infrastructure decision with a written rationale, not a checkbox on a launch list.

DAN

The part Max is too polite to say plainly: quantization created an entirely new deployment tier, and the market noticed before most engineering teams did. Local inference tools have massive adoption not because of academic curiosity — enterprises want to run models without sending proprietary data to external APIs. AWQ winning the MLSys Best Paper validated that mobile and edge deployment is a real market, not a research demo. The tooling ecosystem Mona described is consolidating fast: vLLM on the server side, llama.cpp on the client side, bitsandbytes bridging the Python training-to-inference gap. Teams that have not chosen their quantization stack are already behind the teams that have.

ALAN

Both of you treat efficiency as if it were morally neutral. But consider what quantization enables at scale: it lowers the barrier to running large models so dramatically that centralized controls — rate limiting, content filtering, usage monitoring — simply disappear. When a model that previously required a data center runs on a laptop, the infrastructure that made oversight possible is no longer in the loop. The quality cliff Mona described at aggressive compression is not just an engineering inconvenience; it means someone will run a model at dangerously low precision, get confident but subtly wrong outputs, and act on them without the instrumentation to detect degradation. The efficiency story is compelling. But has anyone paused to ask whether making powerful models universally portable is a problem we should solve before we fully understand what we are distributing?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors