Batch API
Also known as: Message Batches API, async batch inference, batch inference endpoint
- Batch API
- A Batch API is an asynchronous endpoint for submitting large volumes of LLM requests as a single job, processed offline with results retrieved by polling. Providers including Anthropic and OpenAI offer batch endpoints at a significant cost reduction versus real-time calls, trading delivery latency for lower token costs.
A Batch API is an asynchronous LLM endpoint that queues large volumes of requests for offline processing, returning results within hours at a significant cost discount versus real-time calls.
What It Is
Model tiering and prompt caching are the two most-discussed cost levers in LLM infrastructure. Batch API is the third — and for the right workloads, often the most straightforward one. Instead of calling an LLM and waiting for an immediate response, you submit a file of requests as a single job, let the provider process them in the background, and retrieve the results when they are ready.
Think of it like cargo shipping versus a taxi. A real-time API call holds compute capacity idle until your request arrives — the equivalent of a driver sitting at the curb waiting. A batch job packages all your requests together and hands them off; the provider loads them across available capacity whenever conditions are right. You give up immediate delivery. The provider gives you a lower rate.
According to Anthropic Docs, the Anthropic Batch API accepts up to 100,000 requests per batch submitted as a JSONL file — a structured text format where each line is one complete request — processes them asynchronously, and returns results typically within an hour, with a maximum processing window of 24 hours. The technical constraints are fixed: no streaming, no multi-turn tool-use loops, no synchronous responses. Requests go in, the queue runs, results come back via polling.
The savings come from how providers allocate compute. Real-time APIs must reserve capacity and hold it available per request. Batch submissions fill GPU time opportunistically across a processing window — providers schedule them during lower-demand periods and pass the utilization savings back as a discount. According to Anthropic Docs, batch pricing delivers a 50% reduction on both input and output token rates compared to the standard synchronous API. OpenAI Docs confirms an equivalent 50% batch discount across their models.
Batch API sits alongside model tiering and prompt caching as a cost-reduction strategy, not a replacement for either. Caching reduces cost on repeated system prompts. Tiering routes simpler tasks to cheaper models. Batching cuts cost by shifting when the workload runs — no routing logic, no quality tradeoff, just a different submission pattern on work that tolerates a delay.
How It’s Used in Practice
The most common scenario is any scheduled data pipeline that does not need results immediately: re-scoring a customer database after updating a classification prompt, generating product descriptions overnight for a new catalog, or running sentiment analysis across months of support tickets before a quarterly review.
A content team building an AI-assisted publishing workflow might queue two hundred article summaries at the end of the day, retrieve them the next morning, and feed them into an editorial queue — same model, same quality, half the API cost. No change to prompts, output format, or downstream logic.
Pro Tip: Batch API and prompt caching discounts stack. According to Anthropic Docs, combining batch pricing with cache reads on certain models can bring effective input token costs well below either discount alone — a particularly significant saving for workloads where a large system prompt repeats across thousands of requests in the same batch job.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Nightly data enrichment: tagging, scoring, classification | ✅ | |
| Customer-facing chatbot with live response requirement | ❌ | |
| Regenerating a product catalog or knowledge base overnight | ✅ | |
| Interactive workflows requiring multi-turn tool-use loops | ❌ | |
| Sentiment analysis across months of historical records | ✅ | |
| Real-time content moderation with sub-second SLA | ❌ |
Common Misconception
Myth: Batch API is an enterprise feature for teams running massive data pipelines with dedicated engineering support.
Reality: Any developer making the same LLM call hundreds of times per day can switch non-urgent requests to batch mode and cut that workload’s token cost in half. The change is a different API endpoint and a slightly different request format — not a new architecture.
One Sentence to Remember
If the job does not need a response in under a minute, batch mode costs half as much — so any scheduled workflow, pre-computation step, or overnight enrichment run is a candidate for switching.
FAQ
Q: How long does batch processing actually take? A: According to Anthropic Docs, most batches complete in under an hour. The guaranteed maximum processing window is 24 hours. Actual time depends on batch size and current queue load.
Q: Can I use Batch API for multi-turn conversations or agentic tool calls? A: No. Batch API supports single-turn, standalone requests only. Multi-turn conversations, tool-use loops, and anything requiring intermediate responses must use the synchronous real-time API.
Q: Do Batch API discounts stack with prompt caching discounts? A: Yes. According to Anthropic Docs, the pricing reductions compound — combining batch pricing with cache reads reduces effective token costs substantially below either discount applied alone.
Sources
- Anthropic Docs: Batch Processing — Claude Platform Docs - Official documentation covering request format, limits, latency guarantees, and result retrieval
- Anthropic Docs: Pricing — Claude Platform Docs - Batch pricing rates and cache stacking details per model
Expert Takes
The Batch API decouples submission from execution, which is where the savings come from. Synchronous APIs must reserve compute capacity and hold it idle until your request arrives. Batch submission drops requests into a shared queue; the provider fills that queue across available compute windows, eliminating idle reservation time. The discount reflects actual utilization efficiency — providers batch similar-format requests, maximizing GPU throughput. The tradeoff is entirely about when you need your answer, not about request quality or model capability.
Before you reach for model tiering, check whether the workload is latency-tolerant. If it is, Batch API is often the right first move — a single API change that halves costs without touching your model choice, system prompts, or routing logic. The limitations matter: no streaming, no tool loops, no multi-turn chains. For pipelines that run batch inference followed by conditional tool calls, split the pipeline at the sync/async boundary and handle each side independently.
Most teams treating model tiering as their primary cost lever are leaving money on the table. Batch processing is the simpler win — no routing logic, no quality tradeoff, just a different API call pattern on any workflow that tolerates a modest delay. The teams that get this right pair it with prompt caching and compress their LLM spend substantially without degrading output quality or user experience.
The batch discount masks a supply-side calculation that rarely gets examined. Providers are smoothing demand peaks by paying you to be flexible about timing. The downstream question is whether that optimization pressure stays invisible — or eventually nudges teams toward architectures that delay human feedback loops, aggregate requests in ways that reduce individual context, or make audit trails harder to maintain. The efficiency benefit is real; the architectural implications deserve the same scrutiny.