Skip to content

Scientific Task Memory Benchmark Results

This directory contains the tracked corpus and summary outputs for Phase 5 benchmarking of memd's task-oriented knowledge artifact layer.

Files

  • task_memory_benchmark_corpus.json
  • structured benchmark cases and labeled queries
  • task_memory_benchmark_results.json
  • machine-readable Phase 5 benchmark report
  • task_memory_benchmark_results.md
  • human-readable summary report

Intentionally not tracked:

  • task_memory_benchmark_data/
  • large generated benchmark databases, sparse indexes, and warm-index artifacts
  • task_memory_benchmark.log
  • local benchmark log output

The results files are generated by:

To regenerate the omitted large artifacts locally:

./evals/bench/scripts/run_task_memory_benchmark.sh

What The Benchmark Measures

The benchmark compares two memd-native schema modes on the same knowledge:

  1. chunk_baseline
  2. each case is flattened into generic memory chunks
  3. retrieval uses memory.search

  4. task_memory

  5. each case is seeded through the real task.* lifecycle
  6. retrieval uses task.search

It now runs those modes across three local CLI execution lanes:

  • cli_cold
  • one executable process per operation
  • cli_warm
  • memd warm start plus --warm required so calls reuse the private CLI worker
  • cli_batch
  • memd batch --jsonl - --stream so scripted calls run inside one loaded process

The warm worker is a private Unix-socket CLI acceleration layer. It is not HTTP and is not an agent-visible integration surface.

It can also run live external comparisons when compatible checkouts are available:

  • gpt54_live_cli
  • gpt54_live_daemon
  • gpt54_live_tantivy
  • claude_live
  • geminipro_live
  • geminiultra_live

External live systems are discovered relative to the repo, or can be supplied explicitly with:

  • --genesism-root
  • --gpt54-root
  • --claude-root
  • --geminipro-root
  • --geminiultra-root

Optional reference numbers from GenesisM's earlier unified benchmark are imported when:

  • --genesism-reference-json is provided, or
  • unified_benchmark_results.json is found under the discovered GenesisM root

Those reference numbers are included for continuity, but they are not directly apples-to-apples because the earlier GenesisM memd benchmark predated memd's real task lifecycle.

Corpus Design

The current 2026-03-21.v2 corpus is intentionally harder than the initial Phase 5 draft.

It contains:

  • 8 task cases
  • 23 labeled queries
  • 4 shared-project sibling groups

Each sibling group shares the same project scope, primary dataset, and tool family so project-scoped systems cannot separate tasks trivially.

Project Cases Shared dataset Shared tool
phase5_auth_reliability jwt_timezone_fix, jwt_refresh_grace_window auth_logs@2026-03-21 cargo-test
phase5_regulator_screening mmseqs_marker_search, mmseqs_sigma_factor_search screen_counts@v3 mmseqs
phase5_event_bus_selection kafka_queue_selection, nats_jetstream_selection platform_requirements@adr-input-v2 benchmark-runner
phase5_repo_indexing codebase_indexing, frontend_route_indexing repository_snapshot@HEAD index-codebase.sh

The query set mixes:

  • task lifecycle queries
  • failed-attempt queries
  • parameter lookup queries
  • evidence lookup queries
  • why-chosen queries
  • paraphrased prompts that avoid copying exact benchmark wording

Metrics

The Phase 5 report includes:

  • hit@3
  • MRR
  • average search latency
  • p95 search latency
  • freshness
  • concurrency success rate
  • concurrency operations per second
  • seed time per benchmark mode

For external systems with flatter schemas, the benchmark currently scores task-level recovery rather than memd-style facet-level recovery. That difference is important when interpreting quality numbers.

memd-native Modes

The generated system names combine an execution lane with a schema mode, for example memd_cli_warm_task_memory or memd_cli_batch_chunk_baseline.

The chunk-baseline and task-memory modes use the same underlying task knowledge, but they seed and query it differently.

memd_chunk_baseline

  • flattens each task into generic memory chunks
  • writes fewer benchmark documents overall
  • queries with memory.search
  • does not preserve task lifecycle structure as first-class searchable artifacts

memd_task_memory

  • seeds the real task.* lifecycle (task.start, task.progress, task.run_start, task.run_finish, task.add_evidence, task.finish)
  • writes more projection artifacts because each lifecycle event becomes searchable task memory
  • queries with task.search
  • uses task-aware exact filters plus candidate reranking

The relative quality and latency should be read from the current generated report. The task-memory mode usually writes more projections and is expected to cost more at seed time, while current retrieval quality depends on both the structured task.search filters and the strength of the generic memory.search baseline.

How To Run

./evals/bench/scripts/run_task_memory_benchmark.sh

You can also run the Python tool directly:

python3 evals/bench/tools/task_memory_benchmark.py \
  --memd-path target/release/memd \
  --corpus docs/scientific-task-memory/benchmark-results/task_memory_benchmark_corpus.json \
  --memd-lanes cli_cold cli_warm cli_batch \
  --workers 1 \
  --ops-per-worker 1

For a full side-by-side local run, make sure these are available:

  • target/debug/memd
  • a nearby GenesisM workspace containing gpt54, claude, geminipro, and geminiultra, or explicit --*-root arguments
  • Python with duckdb installed for the external tool adapters

The runner now prints explicit stage progress and applies external CLI timeouts so long live-system legs are observable and bounded during reproduction.

Why This Benchmark Matters

The benchmark is meant to answer one question:

Does the task-oriented knowledge artifact schema materially help later agents recover the information that generic chunk search tends to lose?

Specifically:

  • what worked
  • what failed
  • which parameters were used
  • why a method was chosen
  • what evidence supported the conclusion

That is the practical success criterion for Phase 5.