Scientific Task Memory Benchmark Results¶
This directory contains the tracked corpus and summary outputs for Phase 5 benchmarking of memd's task-oriented knowledge artifact layer.
Files¶
- task_memory_benchmark_corpus.json
- structured benchmark cases and labeled queries
task_memory_benchmark_results.json- machine-readable Phase 5 benchmark report
task_memory_benchmark_results.md- human-readable summary report
Intentionally not tracked:
task_memory_benchmark_data/- large generated benchmark databases, sparse indexes, and warm-index artifacts
task_memory_benchmark.log- local benchmark log output
The results files are generated by:
To regenerate the omitted large artifacts locally:
What The Benchmark Measures¶
The benchmark compares two memd-native schema modes on the same knowledge:
chunk_baseline- each case is flattened into generic memory chunks
-
retrieval uses
memory.search -
task_memory - each case is seeded through the real
task.*lifecycle - retrieval uses
task.search
It now runs those modes across three local CLI execution lanes:
cli_cold- one executable process per operation
cli_warmmemd warm startplus--warm requiredso calls reuse the private CLI workercli_batchmemd batch --jsonl - --streamso scripted calls run inside one loaded process
The warm worker is a private Unix-socket CLI acceleration layer. It is not HTTP and is not an agent-visible integration surface.
It can also run live external comparisons when compatible checkouts are available:
gpt54_live_cligpt54_live_daemongpt54_live_tantivyclaude_livegeminipro_livegeminiultra_live
External live systems are discovered relative to the repo, or can be supplied explicitly with:
--genesism-root--gpt54-root--claude-root--geminipro-root--geminiultra-root
Optional reference numbers from GenesisM's earlier unified benchmark are imported when:
--genesism-reference-jsonis provided, orunified_benchmark_results.jsonis found under the discovered GenesisM root
Those reference numbers are included for continuity, but they are not directly apples-to-apples because the earlier GenesisM memd benchmark predated memd's real task lifecycle.
Corpus Design¶
The current 2026-03-21.v2 corpus is intentionally harder than the initial Phase 5 draft.
It contains:
8task cases23labeled queries4shared-project sibling groups
Each sibling group shares the same project scope, primary dataset, and tool family so project-scoped systems cannot separate tasks trivially.
| Project | Cases | Shared dataset | Shared tool |
|---|---|---|---|
phase5_auth_reliability |
jwt_timezone_fix, jwt_refresh_grace_window |
auth_logs@2026-03-21 |
cargo-test |
phase5_regulator_screening |
mmseqs_marker_search, mmseqs_sigma_factor_search |
screen_counts@v3 |
mmseqs |
phase5_event_bus_selection |
kafka_queue_selection, nats_jetstream_selection |
platform_requirements@adr-input-v2 |
benchmark-runner |
phase5_repo_indexing |
codebase_indexing, frontend_route_indexing |
repository_snapshot@HEAD |
index-codebase.sh |
The query set mixes:
- task lifecycle queries
- failed-attempt queries
- parameter lookup queries
- evidence lookup queries
- why-chosen queries
- paraphrased prompts that avoid copying exact benchmark wording
Metrics¶
The Phase 5 report includes:
hit@3MRR- average search latency
- p95 search latency
- freshness
- concurrency success rate
- concurrency operations per second
- seed time per benchmark mode
For external systems with flatter schemas, the benchmark currently scores task-level recovery rather than memd-style facet-level recovery. That difference is important when interpreting quality numbers.
memd-native Modes¶
The generated system names combine an execution lane with a schema mode, for
example memd_cli_warm_task_memory or memd_cli_batch_chunk_baseline.
The chunk-baseline and task-memory modes use the same underlying task knowledge, but they seed and query it differently.
memd_chunk_baseline¶
- flattens each task into generic memory chunks
- writes fewer benchmark documents overall
- queries with
memory.search - does not preserve task lifecycle structure as first-class searchable artifacts
memd_task_memory¶
- seeds the real
task.*lifecycle (task.start,task.progress,task.run_start,task.run_finish,task.add_evidence,task.finish) - writes more projection artifacts because each lifecycle event becomes searchable task memory
- queries with
task.search - uses task-aware exact filters plus candidate reranking
The relative quality and latency should be read from the current generated report. The task-memory mode usually writes more projections and is expected to cost more at seed time, while current retrieval quality depends on both the structured task.search filters and the strength of the generic memory.search baseline.
How To Run¶
You can also run the Python tool directly:
python3 evals/bench/tools/task_memory_benchmark.py \
--memd-path target/release/memd \
--corpus docs/scientific-task-memory/benchmark-results/task_memory_benchmark_corpus.json \
--memd-lanes cli_cold cli_warm cli_batch \
--workers 1 \
--ops-per-worker 1
For a full side-by-side local run, make sure these are available:
target/debug/memd- a nearby GenesisM workspace containing
gpt54,claude,geminipro, andgeminiultra, or explicit--*-rootarguments - Python with
duckdbinstalled for the external tool adapters
The runner now prints explicit stage progress and applies external CLI timeouts so long live-system legs are observable and bounded during reproduction.
Why This Benchmark Matters¶
The benchmark is meant to answer one question:
Does the task-oriented knowledge artifact schema materially help later agents recover the information that generic chunk search tends to lose?
Specifically:
- what worked
- what failed
- which parameters were used
- why a method was chosen
- what evidence supported the conclusion
That is the practical success criterion for Phase 5.