MONA explainer 11 min read March 26, 2026

GPTQ vs AWQ vs GGUF vs bitsandbytes: Quantization Formats and Their Tradeoffs Explained

Weight matrix grid transitioning from high to low precision with format labels and accuracy indicators

Table of Contents

ELI5

Quantization shrinks LLM weights from 16-bit to 4-bit or lower, cutting memory by 75% so large models run on smaller hardware — at the cost of some precision.

Take a 70-billion-parameter model. Strip three-quarters of the numerical precision from every weight. Run it. The outputs are nearly identical to the original — so close that most benchmarks barely register the difference. You discarded information, and nothing visible broke. That gap between expectation and outcome is where the interesting engineering lives, and the four dominant quantization formats exploit it in fundamentally different ways.

The Arithmetic Underneath

Every parameter in a large language model is a floating-point number, and the format of that number determines how much memory it consumes. Before comparing specific quantization formats, three foundational concepts shape everything that follows.

What concepts do you need to understand before learning about LLM quantization?

Quantization is the process of reducing the numerical precision of model weights — mapping continuous or high-precision values onto a smaller, discrete set. A weight stored in 16-bit floating point occupies two bytes; the same weight in 4-bit integer occupies half a byte. Multiply that difference across billions of parameters and you are looking at the gap between needing 140 GB of VRAM and needing 35 GB.

Not compression. Approximation.

The mechanism does not shrink files the way a zip algorithm does. It narrows the representational grid — choosing coarser steps to place numbers on, then accepting the rounding errors that follow. The art is in deciding which rounding errors you can tolerate.

Post Training Quantization applies this precision reduction after a model has already been trained; the weights are fixed, and you are approximating them with fewer bits. Mixed Precision Training, by contrast, uses different precisions during training itself — keeping gradients in higher precision while running forward passes in lower precision. Most LLM quantization today is post-training. The reason is economic: retraining a 70-billion-parameter model in reduced precision costs nearly as much as training it in the first place.

A newer entrant, Fp8, uses 8-bit floating point rather than integer representation, preserving more dynamic range than INT8 at the same memory cost. It is gaining traction for Inference but has not yet consolidated into a standard quantization format the way the four we are comparing here have.

Four Formats, Four Philosophies

The four dominant formats solve the same problem — fit a large model into less memory — but their engineering assumptions diverge sharply. That divergence determines when each one excels and when it fails.

What are the main LLM quantization formats and how do GPTQ AWQ GGUF and bitsandbytes differ?

GPTQ (Frantar et al., ICLR 2023) treats quantization as an optimization problem. It uses approximate second-order information — the Hessian of the loss — to decide which rounding errors matter most, then compensates for each weight’s error across the row before moving to the next. The result: 3-bit and 4-bit models with negligible accuracy loss on most text tasks, delivering a 3.25x to 4.5x speedup over FP16 depending on the GPU (GPTQ Paper). The tradeoff is calibration time; GPTQ needs a small representative dataset to compute those error-compensation steps. One thing to know: AutoGPTQ, the original Python toolkit, was archived in April 2025. Its successor, GPTQModel v5.8.0, now covers GPTQ, AWQ, GGUF, FP8, and EXL3 under one roof.

Awq (Lin et al., MLSys 2024 — Best Paper) starts from a different observation. Not all weights contribute equally to output quality. Roughly 1% of weight channels — those connected to high-magnitude activations — account for a disproportionate share of model accuracy. AWQ identifies those salient channels using offline activation statistics and scales them before quantization, giving the critical weights more of the precision budget while letting the rest be rounded aggressively. No backpropagation, no reconstruction loss — just a one-pass scaling operation followed by standard quantization (AWQ Paper). The llm-awq repository has no formal version releases; updates track by commit rather than semver.

GGUF — Georgi Gerganov Universal Format — is the odd one out. It is not a quantization algorithm; it is a file format designed for distribution and local inference. Introduced in August 2023 by the Llama Cpp project, GGUF packages model weights into a single file that runs on CPU, Apple Metal, CUDA, AMD HIP, Vulkan, and a growing list of backends (llama.cpp GitHub). The quantization itself happens through llama.cpp’s quantize tool, offering a spectrum from 1.58-bit to 8-bit — including importance-matrix-aware quants like IQ2_XS and IQ4_XS. GGUF’s advantage is reach: any machine with a CPU can run inference. No GPU required.

bitsandbytes takes a different position entirely. It quantizes dynamically at load time — no calibration data, no preprocessing step. Load a model from Hugging Face, pass a quantization flag, and the weights are converted on the fly to 8-bit (LLM.int8()) or 4-bit (NF4 or FP4). Nested quantization saves an additional 0.4 bits per parameter (HF Docs). But the distinctive feature is not compression — it is training. bitsandbytes is the only format in this comparison that supports fine-tuning through QLoRA, making it the default choice when you need to adapt a quantized model, not just serve it. Version 0.49.2 supports NVIDIA CUDA, AMD ROCm, Intel XPU, and Apple Metal — though Metal performance remains noticeably slower, not yet production-grade on macOS.

Format	Bit Widths	Calibration Needed	Training Support	Primary Platform
GPTQ	2/3/4-bit	Yes (small dataset)	No	GPU (CUDA)
AWQ	INT3/INT4	Yes (activation stats)	No	GPU (CUDA)
GGUF	1.58-bit to 8-bit	Optional (importance matrix)	No	CPU + GPU (multi-backend)
bitsandbytes	4-bit / 8-bit	No	Yes (QLoRA)	Multi-platform

The Calibration Divide

The distinction between how these formats arrive at their quantized weights is more consequential than the bit widths themselves.

What is the difference between post-training quantization and quantization-aware training?

All four formats use post-training quantization — they reduce weight precision after training is complete. But they differ in how much information they bring to the rounding decision.

GPTQ uses the most information: second-order derivatives that estimate how much each weight’s error will propagate through the network. This makes it the most accurate weight-for-weight at the same bit width, but also the most expensive to prepare. AWQ uses less information — just activation magnitudes — but applies it more surgically: protect the critical 1%, let everything else be rounded hard. GGUF’s quantize tool can optionally use an importance matrix computed from a calibration dataset, but most users skip this step and accept the default rounding. bitsandbytes skips calibration entirely; NF4 quantization assumes a theoretically optimal distribution for normally distributed weights, which works surprisingly well without seeing any data at all.

Quantization-aware training (QAT) takes the opposite approach. It inserts simulated quantization nodes during training, letting the model learn to compensate for precision loss through gradient updates. QAT typically yields better accuracy than PTQ at the same bit width. But for models with tens of billions of parameters, QAT’s memory cost undermines its purpose — you need full-precision gradients alongside the quantized forward pass, which can exceed the memory savings you were trying to achieve (NVIDIA Blog). This is why PTQ dominates the LLM quantization space; the models are too large for QAT to be practical outside specialized research.

The closest working approximation is QLoRA through bitsandbytes: load the base model in 4-bit, then train small adapter matrices in full precision. The base weights stay frozen — it is not true QAT — but it captures some of the same benefit by letting the model learn around its own quantization errors.

Comparison diagram showing GPTQ, AWQ, GGUF, and bitsandbytes quantization pipelines with calibration requirements, bit widths, and primary use cases — Each format makes different tradeoffs between calibration effort, hardware reach, and accuracy retention.

What the Throughput Numbers Predict — and Where They Mislead

On a Qwen2.5-32B model served through vLLM, the performance hierarchy is stark: AWQ with Marlin kernels reaches 741 tokens per second, GPTQ with Marlin hits 712, FP16 (unquantized) delivers 461, bitsandbytes manages 168, and GGUF trails at 93 (PremAI Guide). These numbers are hardware-specific — measured on a particular GPU generation — and absolute throughput will shift across different setups.

The pattern, though, is instructive. AWQ and GPTQ with optimized kernels outperform even unquantized FP16 because the memory bandwidth savings offset the dequantization overhead. The model spends less time waiting for weights to arrive from VRAM and more time computing. If your bottleneck is memory bandwidth — and for most GPU-served LLMs, it is — quantization does not just save memory. It makes inference faster.

bitsandbytes pays a throughput penalty for its dynamic quantization and training capability. GGUF’s numbers through vLLM are misleading; GGUF was not designed for GPU serving farms — it was designed for local machines, laptops, and CPUs, where Continuous Batching is irrelevant and the alternative is not running the model at all.

Quality retention tells a complementary story. At 4-bit precision, AWQ retains roughly 95% of original model accuracy, bitsandbytes holds at 95% or above, GGUF’s Q4_K_M sits around 92%, and GPTQ comes in near 90% (PremAI Guide). These are aggregate numbers across general text tasks — GPTQ’s quality varies more sharply by domain, with code generation showing larger degradation than conversational text. The Temperature And Sampling parameters interact with these precision losses; higher temperatures explore the probability distribution’s tails, where quantized weights are least accurate and rounding errors compound.

Rule of thumb: if you need GPU throughput, choose AWQ or GPTQ with Marlin kernels. If you need to run on a laptop or CPU, choose GGUF. If you need to fine-tune, bitsandbytes is the only option. If you are unsure, AWQ at 4-bit is the safest default for serving.

When it breaks: all four formats degrade on tasks requiring precise numerical reasoning, rare-token generation, or long code sequences — domains where rounding errors accumulate rather than average out. The lower the bit width, the steeper the degradation curve on these edge cases.

Compatibility notes:
AutoGPTQ (archived): Archived April 2025; replaced by GPTQModel v5.8.0 covering GPTQ, AWQ, GGUF, FP8, and EXL3. Hugging Face Transformers is deprecating AutoGPTQ integration. Migrate to GPTQModel.
vLLM legacy quant formats: An active RFC proposes deprecating 19 legacy quantization kernels, including AWQ Marlin/Triton and GPTQ Marlin/BitBLAS. Monitor vLLM #30136 before locking your serving stack.

The Data Says

Quantization is not a lossy shortcut — it is a precision-engineering tradeoff where the format you choose reveals what you value: raw GPU speed (AWQ, GPTQ), hardware reach (GGUF), or the ability to keep training (bitsandbytes). The models are more redundant than their parameter counts suggest, and the right format exploits that redundancy without touching the knowledge that matters.

Sources

GPTQ Paper: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Method, speedup benchmarks, and bit-width accuracy analysis
AWQ Paper: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - Salient weight protection and activation-aware scaling method
llama.cpp GitHub: ggml-org/llama.cpp - GGUF format specification, supported platforms, and quantization options
GPTQModel GitHub: ModelCloud/GPTQModel - AutoGPTQ successor with multi-format support
HF Docs: Bitsandbytes — Hugging Face Transformers Documentation - NF4, nested quantization, and QLoRA integration
PremAI Guide: LLM Quantization Guide: GGUF vs AWQ vs GPTQ vs bitsandbytes Compared (2026) - vLLM throughput benchmarks and quality retention comparisons
NVIDIA Blog: Model Quantization: Concepts, Methods, and Why It Matters - PTQ vs QAT methodology and memory tradeoffs

Aha Moments

MAX

Mona mapped the decision cleanly, so let me add the implementation angle she left implicit. The format you choose is not just a compression setting — it is an infrastructure constraint that propagates through your entire serving stack. GPTQ and AWQ lock you into GPU-first deployments with specific kernel dependencies; if the serving framework deprecates those kernels, your quantized artifacts become dead weight. GGUF gives you portability but surrenders batched throughput. bitsandbytes gives you training flexibility but the slowest inference path. What most teams miss is that the quantization decision should follow the deployment specification, not precede it. Define your latency budget, your hardware fleet, and whether the model needs adaptation — then the format becomes deterministic. Choosing a format because it scored well on a benchmark, without knowing your memory ceiling, is building backwards.

DAN

Max is right that this is an infrastructure decision, but he is underestimating how fast the infrastructure itself is moving. That vLLM RFC to deprecate legacy kernels is a signal — the serving layer is consolidating, and formats that depend on bespoke kernel support carry risk. AWQ holds the throughput crown today, but its lead depends on Marlin kernel maintenance. GGUF is the cockroach of quantization formats: unglamorous, runs anywhere, and will outlast every optimized kernel tied to a single framework. bitsandbytes owns the training niche, which makes it strategically durable — as long as fine-tuning matters, it matters. The ecosystem is quietly splitting into two lanes: GPU-optimized serving, where AWQ and GPTQ compete on kernel speed, and universal local inference, where GGUF has no serious competitor. Teams that commit to a single lane without a fallback plan will feel it within a product cycle.

ALAN

Both of you are discussing which format wins the performance race. Neither is asking who gets to run these models at all. GGUF democratized local inference — a graduate student with a laptop can run a model that previously required institutional GPU access. That consequence extends beyond engineering convenience; it determines who can independently audit, reproduce, and challenge these systems. When quantization optimization focuses exclusively on GPU throughput in centralized serving farms, model access consolidates around organizations that can afford the hardware. bitsandbytes’ QLoRA capability carries a parallel weight — it lets researchers with modest compute budgets adapt models rather than merely consuming them. The technical tradeoffs Mona laid out are real, but they carry a distributional question underneath: as the ecosystem converges on fewer formats and fewer serving frameworks, does the ability to independently verify model behavior expand — or contract?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors