Post Training Quantization
Also known as: PTQ, post-training weight quantization, weight-only quantization
- Post Training Quantization
- Post training quantization compresses a pre-trained model’s weights to lower-precision formats using a small calibration dataset, reducing memory requirements and accelerating inference without retraining.
Post training quantization (PTQ) compresses a pre-trained model’s weights to lower-precision formats using a small calibration dataset, reducing memory use and speeding up inference without retraining the model.
What It Is
Every large language model starts life as billions of weight values stored in 16-bit or 32-bit floating-point precision. Running these models demands expensive GPU memory that most teams cannot justify. Post training quantization addresses this by converting those weights to lower-precision formats — 8-bit, 4-bit, or even 3-bit integers — after the model has already been trained, using only a small calibration dataset of a few hundred samples.
Think of it like compressing a RAW photograph into a JPEG. The original captures every tonal detail but takes enormous storage. The compressed version looks nearly identical at a fraction of the file size. PTQ does the same for neural network weights: it reduces numerical precision enough to shrink the model while preserving most of its output quality.
During calibration, the algorithm analyzes how each weight contributes to the model’s output, then rounds values to fit smaller data types. According to PTQ Benchmark (2025), three main strategy families dominate the current field: compensation-based methods like GPTQ that adjust remaining weights to offset rounding errors, salience-based methods like AWQ that protect the most important weights at higher precision, and rotation-based methods like QuIP and QuIP# that transform weight matrices so quantization error distributes more evenly.
PTQ differs fundamentally from quantization-aware training (QAT), which simulates low-precision arithmetic during the training process itself. QAT typically produces better accuracy at extreme compression, but it requires the full training setup and dataset — a cost measured in thousands of GPU-hours. PTQ achieves comparable results in minutes to hours on a single machine. This speed difference explains why PTQ dominates real-world LLM deployment.
The tradeoff becomes critical below 4-bit precision. According to PTQ Benchmark (2025), 3-bit represents the practical floor for pure PTQ methods — pushing below that threshold causes accuracy to degrade sharply, especially on reasoning and math tasks. This is the accuracy collapse that makes sub-4-bit quantization so difficult to get right.
How It’s Used in Practice
The most common scenario is a team that wants to run an open-weight model locally or on a smaller GPU. They download a pre-quantized version — typically in GGUF format (a single-file packaging standard for quantized models) for llama.cpp or as a GPTQ/AWQ model for Python inference frameworks — and load it immediately. No training infrastructure needed, no dataset preparation, just a converted model that fits into available memory.
Cloud providers follow the same pattern at scale. According to AWS ML Blog, services like Amazon SageMaker offer built-in PTQ workflows using GPTQ and AWQ, letting teams deploy quantized models as API endpoints without managing the compression process themselves.
A newer development is hybrid PTQ-QAT, where teams apply PTQ first and then fine-tune a subset of layers through a short QAT loop. According to Emergent Mind, this approach captures most of PTQ’s speed advantage while recovering some accuracy at aggressive bit widths.
Pro Tip: Start with 4-bit AWQ or GPTQ quantization for your first deployment. Test on your actual use cases before dropping to 3-bit — accuracy degradation is task-specific, and reasoning-heavy workloads hit the wall first.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Deploying a model to consumer GPUs with limited video memory (VRAM) | ✅ | |
| Maximum accuracy on math or multi-step reasoning tasks | ❌ | |
| Quick deployment without retraining infrastructure | ✅ | |
| Sub-3-bit compression for extreme memory savings | ❌ | |
| Serving open-weight models at scale with tight cost budgets | ✅ | |
| Fine-tuning a model for a specialized domain at the same time | ❌ |
Common Misconception
Myth: Quantizing a model always degrades quality in a uniform, predictable way — you lose a fixed percentage of accuracy per bit removed.
Reality: Accuracy loss from PTQ is uneven and task-specific. A model might handle simple text generation well at 4-bit precision but fail on multi-step reasoning or code generation at the same bit width. The degradation pattern depends on which weights carry the most information for a given task, making blanket “quality loss” percentages misleading.
One Sentence to Remember
Post training quantization shrinks a finished model’s weights to lower precision in hours rather than days, but pushing below four bits risks task-specific accuracy collapse that you cannot predict without testing on your actual workload.
FAQ
Q: What is the difference between PTQ and quantization-aware training? A: PTQ compresses weights after training using a small calibration set. QAT simulates low precision during training itself, producing better accuracy at extreme compression but requiring the full training pipeline and dataset.
Q: How low can you quantize with PTQ before quality breaks down? A: According to PTQ Benchmark (2025), three-bit is the practical floor for pure PTQ methods. Below that, accuracy collapses sharply, particularly on reasoning and structured output tasks.
Q: Which PTQ method should I try first? A: AWQ or GPTQ at 4-bit precision are the safest starting points. Both have mature tooling and broad framework support, offering the best tradeoff between memory savings and output quality.
Sources
- PTQ Benchmark (2025): Benchmarking Post-Training Quantization in LLMs - Systematic comparison of PTQ methods across bit widths and model families
- AWS ML Blog: Accelerating LLM Inference with PTQ using AWQ and GPTQ on SageMaker - Practical deployment guide for PTQ on cloud infrastructure
Expert Takes
Post training quantization is a lossy compression problem. The algorithm maps continuous weight distributions to a discrete set of values, and the information lost in that mapping is not recoverable. What makes PTQ research interesting is the search for transformation spaces — rotations, grouping, mixed precision — where rounding errors cancel rather than compound. The practical floor exists because below a certain precision, the error surface becomes too chaotic for compensation strategies to manage.
If you are evaluating PTQ for deployment, run your actual task suite at your target bit width before committing. The tooling has matured — GPTQ and AWQ both have stable libraries with clear documentation. Start at four-bit precision, measure latency and accuracy on your workload, and only drop lower if the memory budget demands it. Treat the quantized model as a new artifact that needs its own validation pass, not a smaller copy of the original.
PTQ is the reason open-weight models compete with API providers on cost. A team with a single consumer GPU can now serve a model that required a data center two years ago. The strategic question is not whether to quantize — everyone does — but where the quality floor sits for your product. The companies shipping quantized models fastest are capturing the deployment layer while others debate precision thresholds.
Every bit you remove is an editorial decision disguised as engineering. When a quantized model drops accuracy on reasoning tasks but maintains fluency on casual chat, who decides which capability matters more? The users running compressed models rarely see benchmark comparisons for their specific task. They trust the provider’s choice of bit width and method without knowing what was traded away. Informed consent about model compression barely exists outside research circles.