Benchmarking¶
memd ships four benchmark families. They exercise different parts of the
system, so their numbers should not be mixed without the workload context.
The cross-system LoCoMo retrieval result
is the public headline; the others are internal-corpus or scoped checks.
Quick runs¶
# Task-memory benchmark (internal corpus)
./evals/bench/scripts/run_task_memory_benchmark.sh
# Offline retrieval benchmark (BEIR fiqa + scidocs)
./evals/bench/scripts/run_offline_retrieval_benchmark.sh
Task memory (internal corpus)¶
Recommended local execution modes:
| Lane | Retrieval setup | Hit@3 | MRR | Avg search latency |
|---|---|---|---|---|
cli_warm |
private warm worker | 1.00 | 0.87 | 9.7 ms |
cli_batch |
streaming JSONL in one loaded process | 1.00 | 0.87 | 0.6 ms |
The same report includes a flattened chunk baseline with hit@3 = 1.00,
MRR = 0.98. The structured mode writes more retrieval projections, but
warm and batch execution keep interactive retrieval latency low. The raw
benchmark artifact also retains a startup-overhead diagnostic lane for
reproducibility; the public summary focuses on the two modes agents
should normally use.
Full report: Task-memory benchmark report.
Bright-Pro scoped adapter (biology q5/d141)¶
| Method | alpha-nDCG@25 | Recall@25 | Search time |
|---|---|---|---|
| BM25 subset | 0.77393 | 0.81111 | not separately timed |
| SuperLocalMemory Mode A | 0.78406 | 0.85333 | 31.713 s total, 6.343 s/query |
memd first search |
0.87035 | 0.98000 | 42.521 s total, 8.504 s/query |
memd repeat search |
0.87035 | 0.98000 | 33.260 s total, 6.652 s/query |
memd + MemReranker-4B |
0.90409 | 1.00000 | +92.987 s rerank |
The Bright-Pro result is a scoped gold-plus-decoy adapter check, not a full-corpus benchmark. It uses 5 biology queries, 41 gold documents, and 100 decoys. Repeat search is the fairer retrieval-speed number because it excludes fresh indexing and reuses the already-built store.
Multi-turn agent benchmark¶
| Interface | Main purpose | Result summary |
|---|---|---|
agent-context prefetch |
bounded context before the agent starts | retrieved 10/10 expected priors in the full suite5 CLI-prefetch run |
| CLI search | retrieval by shell command during the solve | strongest token condition in the interface comparison, but slower for agents |
| Warm and batch execution | reduce local retrieval overhead | preserve retrieval quality while avoiding repeated startup costs |
Raw artifacts¶
Cross-system retrieval (LoCoMo)¶
Direct retrieval benchmark on upstream
locomo10.json: each system is
seeded with the same conversation turns and scored against LoCoMo
evidence IDs (MRR@10 over categories 1–4: 10 conversations, 5,882 turns,
1,536 queries).
| System | MRR@10 | Hit@1 | Hit@3 | Hit@10 | Avg search | Seed |
|---|---|---|---|---|---|---|
memd v0.50.0 |
0.420 | 0.322 | 0.490 | 0.621 | 26.7 ms | 108 s |
superlocalmemory v3.4.46 (lexical) |
0.369 | 0.245 | 0.469 | 0.599 | 804.5 ms | 1.8 s |
mem0 v2.0.2 (LLM-extracted) |
0.354 | 0.255 | 0.412 | 0.591 | 40.9 ms | 13,424 s |
memd wins on every quality metric (+14% MRR@10 vs SuperLocalMemory, +19%
vs Mem0) and is the fastest at search. Seeding cost trades off against
quality — SuperLocalMemory has the cheapest seed (no embeddings in this
configuration), Mem0 the most expensive (LLM extraction).
Per-category¶
memd wins all four LoCoMo categories.
| Category | Description | memd |
mem0 |
slm |
|---|---|---|---|---|
| 1 | multi-hop | 0.359 | 0.292 | 0.259 |
| 2 | specific facts | 0.513 | 0.390 | 0.433 |
| 3 | open-domain | 0.279 | 0.255 | 0.227 |
| 4 | long-form | 0.421 | 0.372 | 0.397 |
Three design philosophies¶
memd— chunk-native dense + sparse hybrid retrieval. No LLM extraction during seed, no LLM rerank during search.mem0— LLM-extracts memory units from raw turns (here using a local vLLMgemma4-31bendpoint), then vector-searches over the extracted memories.superlocalmemory— atomic-fact graph with Fisher-Rao retrieval. Reported here in the lexical-only fallback because the published Mode A 74.8% MRR@10 number was not reproducible in our workspace; SLM's subprocess embedding-worker singleton deadlocked under the LoCoMo workload. The lexical result (0.369) does match prior independent fallback runs in this workspace (0.368), so the configuration itself is reproducible.
Reproducibility¶
The full benchmark harness lives in the sibling memory-benchmark
workspace (separate repo). A self-contained in-repo
evals/benchmarks/locomo/ is in progress and will land here once the
SLM embedded mode is debugged upstream and the OpenAI / Anthropic /
self-hosted vLLM LLM choice is documented for mem0 reproducers.
Same-LLM caveat: mem0 numbers above use a self-hosted vLLM
gemma4-31b endpoint, not the GPT-4-class model the upstream Mem0
README benchmarks against. Numbers are directly comparable across
the three systems in this table but not directly comparable to the
upstream Mem0 leaderboard.