Locomo Benchmark
Also known as: LoCoMo, Long Conversational Memory benchmark, LoCoMo dataset
- Locomo Benchmark
- LoCoMo (Long Conversational Memory) is a 2024 benchmark from Snap Research and UNC that tests whether AI agents can recall and reason over very long, multi-session conversations through question answering, event summarization, and multimodal dialogue tasks.
LoCoMo is a 2024 benchmark from Snap Research that measures whether AI agents can remember and reason over very long, multi-session conversations spanning hundreds of turns and thousands of tokens.
What It Is
Large language models look smart inside a single chat, but most forget everything the moment a session ends. That gap is what LoCoMo — short for Long Conversational Memory — was built to expose. If you are evaluating an agent that needs to remember a user’s preferences, project history, or last week’s decisions, you need a way to test memory across days and sessions, not just inside a single prompt. LoCoMo is the standardized way the industry now answers that question.
According to the LoCoMo paper on arXiv, the benchmark was introduced by Maharana, Lee, Tulyakov, Bansal, Barbieri, and Fang (Snap Research and UNC) at ACL 2024. The dataset contains a small number of carefully constructed multi-session conversations between two personas, each session taking place at a different point in time. According to the LoCoMo project page, conversations average roughly three hundred turns and around nine thousand tokens, with extended variants reaching about six hundred turns and twenty-six thousand tokens.
The interesting part is what the agent is asked to do with those conversations. LoCoMo defines three task families — question answering, event summarization, and multi-modal dialogue generation — and four question categories: single-hop (“what did Alice say about her job?”), multi-hop (“did Alice change jobs after she moved cities?”), temporal reasoning (“when did Bob first mention his sister?”), and open-domain commonsense inference. The dataset and scoring code are public on the snap-research/locomo repository, which is why nearly every agent memory vendor uses LoCoMo as a comparison point.
How It’s Used in Practice
If you are choosing or building an agent memory system, LoCoMo is the score you will see first. Vendors like Mem0, Zep, Letta, ByteRover, and Supermemory all publish LoCoMo numbers on their landing pages and benchmark blogs because procurement teams and engineers Google the term before they Google the product. A higher LoCoMo score signals that the system can recall facts across sessions, follow temporal references, and chain multiple memories into a single answer.
In practice, the workflow looks like this: a memory provider runs the LoCoMo conversations through their stack, lets the agent answer the benchmark’s question set, and reports an overall accuracy plus per-category breakdowns. The score gets cited in vendor decks, comparison posts, and procurement evaluations.
Pro Tip: Treat LoCoMo numbers as a starting filter, not a buying decision. The scores you see in vendor blog posts come from each vendor’s own infrastructure, often with different judge models grading the answers and different assumptions about whether the agent gets conversation hints. Before you compare two systems, check the methodology section — same model evaluating both, same category mix, same hint conditions. If the methodology is not published, the number is marketing.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Comparing memory systems on multi-session recall | ✅ | |
| Testing a chatbot that only handles one conversation at a time | ❌ | |
| Evaluating temporal reasoning across past sessions | ✅ | |
| Measuring single-shot QA accuracy on factual questions | ❌ | |
| Stress-testing event summarization across long histories | ✅ | |
| Picking a memory vendor based on a single headline number | ❌ |
Common Misconception
Myth: A high LoCoMo score means the system has solved long-term agent memory. Reality: As of 2026, the high LoCoMo scores you see in vendor blogs are self-reports run on the vendor’s own infrastructure with the vendor’s own judge model. According to the Mem0 Blog, headline numbers on the leaderboard now span a wide range, and several vendors — Supermemory among them — argue the benchmark is saturating because conversations are short relative to today’s models. Treat LoCoMo as one signal among several, not a verdict.
One Sentence to Remember
LoCoMo is the benchmark that gave the agent memory market a common scoreboard — useful as a first filter, dangerous as a final answer, and increasingly sharing the stage with newer evaluations like LongMemEval.
FAQ
Q: What does LoCoMo stand for? A: Long Conversational Memory. It is a benchmark dataset for evaluating whether AI agents can recall and reason over conversations that span many sessions and hundreds of turns, not just a single chat.
Q: Who created the LoCoMo benchmark? A: According to the LoCoMo paper on arXiv, it was introduced at ACL 2024 by Maharana, Lee, Tulyakov, Bansal, Barbieri, and Fang from Snap Research and UNC, with the dataset and code on the snap-research/locomo GitHub repository.
Q: Is LoCoMo still the gold standard for agent memory in 2026? A: It remains the most-cited benchmark, but vendors increasingly call it saturating. Successor evaluations like LongMemEval and Locomo-Plus now appear alongside LoCoMo in serious agent memory comparisons.
Sources
- LoCoMo paper on arXiv: Evaluating Very Long-Term Conversational Memory of LLM Agents - the original ACL 2024 paper introducing the benchmark, dataset, and scoring methodology.
- LoCoMo project page: snap-research.github.io/locomo - dataset description, conversation statistics, and links to the public repository.
Expert Takes
The benchmark probes capabilities older datasets couldn’t see — multi-hop reasoning across days, temporal grounding, and event summarization across sessions. Long-term memory isn’t a longer context window. It’s a different problem. The risk is that judge-model evaluation introduces variance the score doesn’t show. A higher number on the leaderboard isn’t always a better memory system underneath, and any responsible reading of LoCoMo treats the methodology section as part of the result.
Read LoCoMo numbers as a starting filter, not a buying signal. Vendors report scores from their own runs, with their own grading rubrics, on their own infrastructure. Before trusting any number, ask three questions: which model evaluated the answers, were conversation hints provided to the agent, and which question categories were included. A clean specification for “what counts as a correct memory recall” matters more than the headline score, and most vendor blogs skip that part.
The leaderboard race is real, loud, and strategically important. Every agent memory vendor wants to be on top of LoCoMo because procurement teams check it before they check architecture. The actual signal isn’t the rank — it’s which vendors publish methodology and which only publish numbers. The benchmark has become a credentialing layer for the agent memory market, even as researchers warn it is saturating. That tension shapes the next twelve months of vendor positioning.
A benchmark becomes a power structure once an industry agrees it matters. LoCoMo now decides which memory systems get funded, covered, and bought — based on test conditions designed before today’s agents existed. Who audits the judge models? Who reproduces the headline scores? When the public learns about “the best memory” through vendor self-reports rather than independent replications, the leaderboard stops measuring science and starts measuring marketing budgets.