Langfuse

Also known as: LLM observability platform, prompt management tool, LLM tracing tool

Langfuse
Langfuse is an open-source LLM engineering platform that logs every model call, versions prompts centrally, and runs automated evaluations — giving teams a complete audit trail from prompt change to production outcome.

Langfuse is an open-source LLM engineering platform that combines prompt versioning, execution tracing, and automated evaluation in a single tool for teams shipping AI features to production.

What It Is

When you build a product on top of a language model, the model is only half the challenge. The other half is the prompt — and prompts change constantly. A small wording change can shift response quality, cost, or latency in ways that are hard to measure without the right tooling. Most teams start by tracking prompt changes in shared docs or code comments, with no record of which version was live when a quality problem appeared.

Langfuse is an open-source platform built for this operational layer. It brings three capabilities together in one place: logging every LLM call, managing and versioning prompts centrally, and running quality evaluations against those traces.

Think of it as a merge of two tools developers already know — a version control system (like Git for code) and an application monitoring tool (like Datadog for services) — applied specifically to prompts and model interactions. The key difference from using those tools separately is the link: in Langfuse, every trace knows exactly which prompt version generated it.

The core components are:

  • Tracing — every LLM call is recorded as a structured entry: the prompt sent, the response returned, the model used, latency, token count, and cost. Nested pipelines (an orchestrator calling sub-agents) appear as a full execution tree, showing where time and tokens went at each step.
  • Prompt management — prompts live in a central registry with a complete version history. Teams can label versions (draft, production) and deploy a specific version without touching application code.
  • Evaluations — automated scoring attached to traces, from rule-based checks, LLM-as-judge scoring, or human annotation. Each score links back to the prompt version that produced the output being scored.

This direct connection between prompt version and measured outcome is what makes Langfuse central to any serious prompt versioning workflow. You can compare how one prompt version performed against another not in a local test harness, but against real production traffic — with the full execution context preserved alongside each result.

How It’s Used in Practice

The most common scenario is a product team running a chatbot, document summarizer, or coding assistant. They iterate on prompts regularly. At some point a new version ships, and response quality drops — but nobody knows when the change happened, which version caused it, or for which categories of input the regression is worst.

With Langfuse, every trace in production links to the prompt version that generated it. If quality drops, the team filters traces by version, compares evaluation scores across versions, and identifies exactly which change caused the regression. They switch back to the previous version from the dashboard, and the application picks it up at runtime — no code deploy required.

A second pattern is structured evaluation before shipping: after editing a prompt, the team runs a test set through it and collects automated scores alongside any human annotations. This builds a quantitative record of how prompt quality changes over time, version by version, before the change ever touches production users.

Pro Tip: Connect Langfuse before you need it, not after a quality incident. Teams that add observability retroactively spend hours reconstructing which prompt was live at which time. Added from the start, that history is automatic and searchable.

When to Use / When Not

ScenarioUseAvoid
Team iterates on prompts in production weekly
Solo developer running one-off local scripts
Multi-step LLM pipeline needing a full execution trace
Sensitive data that cannot leave your infrastructure✅ (self-hosted)
Single static prompt with no versioning needed
Running structured evals before deploying a new prompt version

Common Misconception

Myth: Langfuse is a logging tool. You add it when you want to see what the model returns, and that’s the main value.

Reality: Logging is one layer. The prompt management system is separate and central: prompts live in Langfuse as versioned, deployable assets that applications fetch at runtime. The traces then link back to the exact prompt version that generated each response, making observability and version control inseparable rather than two separate concerns bolted together.

One Sentence to Remember

Langfuse ties together the three pieces that usually live in separate tools — prompt storage, execution tracing, and quality evaluation — so that changing a prompt and measuring its effect happens in the same place, with a full version history connecting every change to its outcome.

FAQ

Q: Does Langfuse work only in production, or can you use it during prompt development too? A: Both. You can run Langfuse locally during development to trace calls and test prompt versions, then carry that same instrumentation into production without reconfiguring anything.

Q: How does Langfuse let you switch prompt versions without a code deploy? A: Prompts are stored in Langfuse rather than your codebase. The application fetches the active version at runtime via the SDK, so switching versions is a dashboard action, not a deployment.

Q: Does Langfuse work with different LLM providers, or only one? A: Langfuse uses provider-agnostic tracing through SDKs and integrations, so it works with models from multiple providers and frameworks without requiring provider-specific configuration per model.

Expert Takes

Langfuse addresses a structural problem in LLM development: the prompt is both configuration and code, yet traditional software engineering separates these cleanly. By treating prompts as versioned, deployable assets linked to execution traces, Langfuse applies observability discipline to a layer that typically has none. The trace-to-prompt-version link is what closes the feedback loop — without it, quality measurements float detached from their cause.

What makes Langfuse fit neatly into a production architecture is the runtime fetch model. Prompts are pulled from the registry at call time, not baked into the deployment artifact. That single decision separates prompt iteration speed from deployment cadence — teams can run multiple prompt experiments in a day without touching CI. The evaluation hooks integrate cleanly with existing test infrastructure if you build them correctly from the start.

The AI tooling space has a graveyard of logging SDKs nobody used because they only told you what happened, not why it was different from yesterday. Langfuse wins because version linkage turns logs into a before/after comparison. That’s the minimum viable feature for anyone serious about production AI. Teams that treat prompt changes like configuration changes — not code changes — will outship everyone treating them like mysteries.

Any tool that stores prompts centrally and links them to model outputs, by design, creates a record of what instructions the model received. That record has value for auditing and accountability — but also creates a surveillance surface. Who can see which prompts ran against which user inputs? In regulated contexts, that question is as important as the engineering one. Langfuse’s self-hosted path is the answer for teams who can’t export that data.