LongMemEval
Also known as: Long Memory Evaluation, long-mem-eval, LongMemEval benchmark
- LongMemEval
- LongMemEval is an open-source benchmark that tests long-term interactive memory in chat assistants and AI agents. It scores systems across five abilities — information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention — using 500 manually written questions over multi-session conversations.
LongMemEval is an open-source benchmark that evaluates long-term memory in chat assistants and AI agents through 500 hand-written questions covering five distinct memory abilities across multi-session conversations.
What It Is
When you ship a chatbot that only answers single questions, you can test it with single-question benchmarks. Once you ship an agent that’s supposed to remember what you told it last week — your name, your project, the file you mentioned in passing — that test stops working. LongMemEval was built to fill that gap. It is the benchmark that vendors of memory-augmented agents quote when claiming their system actually remembers.
Instead of checking memory with one combined score, LongMemEval breaks the problem into five abilities. Information extraction asks whether the agent can pull a specific fact from an earlier session. Multi-session reasoning asks whether it can stitch facts together across separate conversations. Temporal reasoning asks whether it understands when things happened relative to each other. Knowledge updates asks whether it correctly overwrites old information when the user changes their mind. Abstention asks whether it refuses to answer when the relevant information was never given in the first place.
The test runs across 500 manually written questions on multi-session conversation logs. According to the LongMemEval paper, commercial chat assistants and long-context language models showed roughly a 30% accuracy drop when forced to retrieve answers from interactive history compared to reading the answer-bearing chunk directly. According to the LongMemEval paper, GPT-4o reached around 92% in the offline answer-only setting and dropped to about 58% in the full interactive setting. That gap is the main reason memory-augmented architectures exist as a product category — loading the entire history into a long context window is not enough.
How It’s Used in Practice
You will most often encounter LongMemEval as a number on a vendor pricing page or comparison chart. Companies offering agent-memory layers — Mem0, Zep, Supermemory, and others — publish LongMemEval scores to position their product against the rest of the category. According to the LongMemEval GitHub, the benchmark and its data are open source, so anyone can run it on their own stack. In practice, buyers read the vendor’s published number and assume the methodology was honest.
That assumption is the weak link. Vendors choose which split to run, whether to use the official judge model, and whether to break out the per-ability scores or just publish a single headline figure. Two systems with the same overall LongMemEval score can have very different shapes underneath — one strong at extraction and weak at abstention, the other balanced. For someone evaluating agent-memory vendors, the per-ability table tells you more than the top-line score.
Pro Tip: If a memory vendor cites a LongMemEval score, ask for the per-ability breakdown. A system that scores well on extraction but poorly on abstention will hallucinate confidently when it lacks the answer — exactly the behavior you don’t want in a memory layer your users depend on.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Comparing agent-memory vendors on a shared task | ✅ | |
| Stress-testing your own retrieval pipeline before launch | ✅ | |
| Predicting end-user satisfaction with a memory feature | ❌ | |
| Tracking memory-quality regressions across releases | ✅ | |
| Validating that an LLM is “aware” of past conversations | ❌ | |
| Estimating privacy or surveillance risk of stored memory | ❌ |
Common Misconception
Myth: A high LongMemEval score means an agent will reliably remember anything a user tells it in production. Reality: LongMemEval tests five specific abilities on a fixed set of curated multi-session conversations. Real users send weirder messages, change topics in messier ways, and contradict themselves more often. A high score is necessary, not sufficient — and a single headline number can hide a system that fails badly at one of the five abilities.
One Sentence to Remember
LongMemEval is the closest thing the agent-memory category has to a shared scorecard — read the per-ability breakdown, not the headline number, before you trust any vendor’s claim that their system remembers.
FAQ
Q: Who created LongMemEval? A: LongMemEval was introduced in October 2024 by Wu et al. in arXiv 2410.10813, with code and data released on GitHub. It was published at ICLR 2025 and updated in March 2025.
Q: Is LongMemEval the same as LoCoMo? A: No. Both benchmark long-term memory in agents, but they use different conversation sets, different question styles, and different ability decompositions. Many vendors now publish scores on both for credibility.
Q: Can I run LongMemEval on my own agent? A: Yes. The dataset and evaluation code are open source on GitHub. You need an LLM judge for grading, and you should report the per-ability scores so your results are comparable to published numbers.
Sources
- LongMemEval paper: LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory - Original benchmark paper introducing the five-ability framework and 500-question test set, published at ICLR 2025.
- LongMemEval GitHub: xiaowu0162/LongMemEval - Official open-source repository with the dataset, evaluation scripts, and pointers to current results.
Expert Takes
Not one number. Five separate abilities. LongMemEval frames memory as extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. That decomposition matters scientifically — a model can ace recall and still fail at telling you when it doesn’t know. The benchmark forces those failure modes into the open instead of letting a single score hide them. Treat the per-ability breakdown, not the headline figure, as the real signal of what an agent actually remembers.
When you wire an agent to a memory layer, you need a contract that defines what “remembered” means. LongMemEval gives you that contract by specification — five abilities, one split, public questions, public judge logic. If a memory vendor publishes a score, you can ask which abilities they passed and which they skipped. That turns vague memory promises into something an integration spec can actually pin down. Without a shared scorecard, you’re buying mood.
Every serious agent-memory startup now quotes a LongMemEval number on its homepage. That alone tells you the benchmark already became the buying scorecard for the category. The vendors who refuse to publish are signaling something. The ones who publish a headline score and quietly skip the per-ability table are signaling something else. Read the table. Read the silence. Then decide who you trust with your users’ history.
A benchmark measures what it measures. LongMemEval measures whether an assistant can answer questions about your past sessions. It does not measure whether the assistant should be holding those past sessions at all, what happens when that memory is subpoenaed, sold, or leaked, or who reviewed the user’s consent before logging began. The score tells you the system can remember. It does not tell you the system was ever asked to forget.