Scientific Knowledge Artifact Schema¶
This directory documents the task-oriented knowledge artifact schema used by memd.
Purpose¶
The task schema exists to make agent reporting consistent across sessions and across different agents using the same tenant.
The system is designed so later agents can reliably recover:
- task goal
- motivation
- hypothesis
- scientific or technical question
- tool choice
- parameters
- inputs and outputs
- what worked
- what failed
- evidence
- validation
- uncertainty
- follow-up actions
That consistency is the main reason the schema exists. It is not just a richer note format.
Canonical Artifact Envelope¶
The source of truth is the canonical task artifact envelope implemented in:
Current artifact kinds:
task_starttask_progressrun_startrun_finishevidencereviewrevisionverificationdecisiondigesttask_finish
Important canonical fields include:
artifact_idartifact_kindtask_idparent_task_idtenant_idproject_idagent_idsession_idstatusartifact_rolechallenge_idthread_idreply_to_artifact_idrelation_kindgoalmotivationhypothesisscientific_questionsummaryblockerswhat_workedwhat_failedvalidationuncertaintyfollowupsdataset_refsentity_refsrelated_artifact_idscontributorstool_nametool_versioncommandparametersinputsoutputsmetricswhy_chosenconfidencerequested_actionverification_statuspromotion_statedigest_keysource_updated_at_mscompute_budgetcost_actualdata_access_levelpolicy_tagsallowed_toolsapproval_stateprovenancetimestamp_createdtimestamp_observed
Normalized Metadata Tables¶
The canonical envelope is projected into normalized SQLite side tables in:
Current tables:
task_artifactstaskstask_eventsrunsevidencedatasetsentitiestask_datasetstask_entitiesartifact_linksartifact_relationsartifact_contributorschallenges
These tables exist to support exact task-aware filters and joins without coupling retrieval to every schema field.
Retrieval Projection¶
Canonical artifacts are projected into ordinary retrieval chunks. The projection layer lives in:
Current projection kinds:
task_goaltask_summaryrunevidencedecisiondigestworkedfailedvalidation
The projection layer is intentionally separate from the canonical envelope:
- canonical artifact = source of truth
- retrieval chunk = search-optimized derived text
Summary-First Digests¶
Persisted digest artifacts currently use these artifact_role values:
project_brieftask_resumefailure_librarydecision_libraryevidence_libraryhighlight_library
These digests are stored as canonical digest artifacts and projected into ordinary retrieval chunks instead of being kept in a separate store.
The current implementation also tracks:
promotion_statewith valuesraw,summarized,canonical, andverified- stable digest identities keyed by role and scope
source_updated_at_msso regenerated digests can reflect source freshness
task.resume reuses the real task_id for its digest so resume_task retrieval stays aligned with canonical task filters.
Trust Boundary¶
memd exposes an explicit trust vocabulary in local operation results:
semantic_candidate: retrieved by similarity without canonical artifact groundingcanonical_record: linked to a non-digest canonical artifactcompiled_digest_hint: linked to a digest artifact that still requires re-groundingverified_record: linked to an explicit verification artifact or other verified record
The intended workflow is:
- use semantic search or digest helpers to generate candidates
- use
artifact.find_relatedto surface overlapping canonical artifacts - use
artifact.verificationwhen a distinct agent needs to countersign a claim - trust the supporting canonical artifact IDs, not digest text on its own
Exact Filters¶
task.search currently supports exact filters over the normalized side tables for:
task_idartifact_kindstatuschallenge_idthread_idreply_to_artifact_idartifact_roledataset_namedataset_versionentity_nameentity_typetool_nameproject_idagent_idsession_idrequested_actionverification_statusrelation_kind
These filters are resolved first, then the candidate set is reranked for retrieval.
memory.search, task.search, and artifact.search also accept mode with generic, brief_project, resume_task, find_failures, find_decisions, find_evidence, and find_highlights to bias candidate planning toward the corresponding digests and canonical summaries.
memory.search and artifact.search support compact response shaping with compact: true or token_budget. Compact output uses the response packer after ranking and visibility filtering, preserves identifiers and trust/grounding metadata, and can omit large text, matched_text, and full artifact fields so callers can fetch only selected records.
memory.compact can explicitly refresh project brief and failure/decision/evidence/highlight library digests through project_id, digest_modes, and force_digest_rebuild.
memory.dream plans tenant/project-scoped retention and compaction work. It defaults to dry_run: true; safe apply mode retires duplicate digest projections through lifecycle metadata, can refresh digests, and records a traceable dream_report artifact. Exact duplicate raw chunks and non-digest task artifacts are reported by health but remain report-only under the current safe strategy. Append-only segment rewrite is reported as blocked until recovery-safe physical rewrite support is implemented.
memory.stats uses aggregate counts and reports active_chunks, deleted_chunks, total_chunks, and active/deleted/all chunk-type maps. memory.health adds a read-only tenant/project report for duplicate canonical text, index coverage, payload-size percentiles, recent latency tails, and explicit alias scope information. include_examples controls whether duplicate previews are returned; duplicate_limit limits only those previews, not the aggregate duplicate group, row, or byte counts.
memory.metrics reports latency, index, tiered-cache, rejection, and estimated payload-size metrics. The token_usage block counts serialized operation request/response bytes and estimates tokens as ceil(serialized_payload_bytes / 4) by operation and in recent calls. This is the part memd can observe directly; full agent token deltas still require paired agent runs that capture provider/API usage or CLI token footers.
context.find_relevant_context can prepend hot-context chunks, but the hot pre-scan is wall-clock bounded so broad lookups on large tenants still continue through normal retrieval. List-style retrieval scans skip stale unreadable segment rows with a warning; strict point reads through memory.get still surface storage errors.
Conversation Event Tags¶
Raw conversational chunks can use ordinary tags for event binding without a schema migration:
event:<event_id>groups factual and relational entries from one observed event.entry:factual,entry:relational, andentry:synthesisdistinguish raw facts, relations, and derived summaries.speaker:<id>andturn:<n>are optional caller-owned labels.
memory.search keeps default ranking behavior unchanged. When callers pass
expand_event_siblings: true, each ranked hit that has an event:<id> tag can
include an expanded_siblings array containing bounded same-tenant/same-project
chunks with the same event tag. These siblings are context for the hit, not
additional ranked hits.
Durability¶
Task artifacts are WAL-backed. The relevant implementation is in:
This means canonical task side tables can be rebuilt during recovery, rather than depending on best-effort metadata writes.
Agent Guardrails¶
Agents should follow this contract:
- Search first.
- Use
task.startbefore substantive work. - Use
task.progressonly for meaningful checkpoints. - Use
task.run_start/task.run_finisharound substantive runs. - Use
task.add_evidencewhen a concrete result matters. - Use
task.finishto record worked/failed/validation/uncertainty/followups. - Use
artifact.createwhen the important event is critique, revision, verification, or thread-level coordination rather than a task lifecycle step. - Use compact
memory.search/artifact.searchfirst for broad retrieval, thenmemory.getorartifact.getfor selected full records. - Use
memory.dreamin dry-run mode before applying retention or compaction actions. - Use
artifact.search/artifact.list_threadwhen the artifact itself is the unit of exchange.
This is how memd enforces consistent reporting across agents in the same tenant. For one trusted machine or trust domain, agents should prefer a stable shared tenant_id and use project_id, thread_id, and task_id for narrower scopes. Cross-tenant project recovery is explicit: configure same-project compatibility aliases for known historical scopes and inspect scope_expansion plus per-hit origin metadata in search responses.
Documentation Status¶
This README documents the implemented schema at a high level. It does not yet provide:
- a formal external schema spec file
- migration version history
- generated schema diagrams
Those can be added later if needed, but the current behavior is documented here and in the source files above.