Scientific Task Memory Benchmark Results¶

This directory contains the tracked corpus and summary outputs for Phase 5 benchmarking of memd's task-oriented knowledge artifact layer.

Files¶

task_memory_benchmark_corpus.json
structured benchmark cases and labeled queries
task_memory_benchmark_results.json
machine-readable Phase 5 benchmark report
task_memory_benchmark_results.md
human-readable summary report

Intentionally not tracked:

task_memory_benchmark_data/
large generated benchmark databases, sparse indexes, and warm-index artifacts
task_memory_benchmark.log
local benchmark log output

The results files are generated by:

To regenerate the omitted large artifacts locally:

./evals/bench/scripts/run_task_memory_benchmark.sh

What The Benchmark Measures¶

The benchmark compares two memd-native schema modes on the same knowledge:

chunk_baseline
each case is flattened into generic memory chunks
retrieval uses memory.search
task_memory
each case is seeded through the real task.* lifecycle
retrieval uses task.search

It now runs those modes across three local CLI execution lanes:

cli_cold
one executable process per operation
cli_warm
memd warm start plus --warm required so calls reuse the private CLI worker
cli_batch
memd batch --jsonl - --stream so scripted calls run inside one loaded process

The warm worker is a private Unix-socket CLI acceleration layer. It is not HTTP and is not an agent-visible integration surface.

It can also run live external comparisons when compatible checkouts are available:

gpt54_live_cli
gpt54_live_daemon
gpt54_live_tantivy
claude_live
geminipro_live
geminiultra_live

External live systems are discovered relative to the repo, or can be supplied explicitly with:

--genesism-root
--gpt54-root
--claude-root
--geminipro-root
--geminiultra-root

Optional reference numbers from GenesisM's earlier unified benchmark are imported when:

--genesism-reference-json is provided, or
unified_benchmark_results.json is found under the discovered GenesisM root

Those reference numbers are included for continuity, but they are not directly apples-to-apples because the earlier GenesisM memd benchmark predated memd's real task lifecycle.

Corpus Design¶

The current 2026-03-21.v2 corpus is intentionally harder than the initial Phase 5 draft.

It contains:

8 task cases
23 labeled queries
4 shared-project sibling groups

Each sibling group shares the same project scope, primary dataset, and tool family so project-scoped systems cannot separate tasks trivially.

Project	Cases	Shared dataset	Shared tool
`phase5_auth_reliability`	`jwt_timezone_fix`, `jwt_refresh_grace_window`	`auth_logs@2026-03-21`	`cargo-test`
`phase5_regulator_screening`	`mmseqs_marker_search`, `mmseqs_sigma_factor_search`	`screen_counts@v3`	`mmseqs`
`phase5_event_bus_selection`	`kafka_queue_selection`, `nats_jetstream_selection`	`platform_requirements@adr-input-v2`	`benchmark-runner`
`phase5_repo_indexing`	`codebase_indexing`, `frontend_route_indexing`	`repository_snapshot@HEAD`	`index-codebase.sh`

The query set mixes:

task lifecycle queries
failed-attempt queries
parameter lookup queries
evidence lookup queries
why-chosen queries
paraphrased prompts that avoid copying exact benchmark wording

Metrics¶

The Phase 5 report includes:

hit@3
MRR
average search latency
p95 search latency
freshness
concurrency success rate
concurrency operations per second
seed time per benchmark mode

For external systems with flatter schemas, the benchmark currently scores task-level recovery rather than memd-style facet-level recovery. That difference is important when interpreting quality numbers.

memd-native Modes¶

The generated system names combine an execution lane with a schema mode, for example memd_cli_warm_task_memory or memd_cli_batch_chunk_baseline.

The chunk-baseline and task-memory modes use the same underlying task knowledge, but they seed and query it differently.

`memd_chunk_baseline`¶

flattens each task into generic memory chunks
writes fewer benchmark documents overall
queries with memory.search
does not preserve task lifecycle structure as first-class searchable artifacts

`memd_task_memory`¶

seeds the real task.* lifecycle (task.start, task.progress, task.run_start, task.run_finish, task.add_evidence, task.finish)
writes more projection artifacts because each lifecycle event becomes searchable task memory
queries with task.search
uses task-aware exact filters plus candidate reranking

The relative quality and latency should be read from the current generated report. The task-memory mode usually writes more projections and is expected to cost more at seed time, while current retrieval quality depends on both the structured task.search filters and the strength of the generic memory.search baseline.

How To Run¶

./evals/bench/scripts/run_task_memory_benchmark.sh

You can also run the Python tool directly:

python3 evals/bench/tools/task_memory_benchmark.py \
  --memd-path target/release/memd \
  --corpus docs/scientific-task-memory/benchmark-results/task_memory_benchmark_corpus.json \
  --memd-lanes cli_cold cli_warm cli_batch \
  --workers 1 \
  --ops-per-worker 1

For a full side-by-side local run, make sure these are available:

target/debug/memd
a nearby GenesisM workspace containing gpt54, claude, geminipro, and geminiultra, or explicit --*-root arguments
Python with duckdb installed for the external tool adapters

The runner now prints explicit stage progress and applies external CLI timeouts so long live-system legs are observable and bounded during reproduction.

Why This Benchmark Matters¶

The benchmark is meant to answer one question:

Does the task-oriented knowledge artifact schema materially help later agents recover the information that generic chunk search tends to lose?

Specifically:

what worked
what failed
which parameters were used
why a method was chosen
what evidence supported the conclusion

That is the practical success criterion for Phase 5.