MEME Multi-Entity & Evolving Memory Evaluation

Examples of the six MEME task types across three categories: Retrieval (Exact Recall, Aggregation), State Management (Tracking, Deletion), and Dependency Reasoning (Cascade, Absence).

MEME defines six memory tasks spanning the full multi-entity × evolving space, including Cascade, Absence, and Deletion: three tasks that no prior benchmark scores.

Across 100 controlled episodes and six memory systems, all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% average accuracy), revealing that closure currently depends on configurations that are not practical at scale.

MEME's taxonomy along two dimensions: entity scope (single vs multi-entity) and temporal dynamics (static vs evolving), with six tasks distributed across the four quadrants.

Overall result Per-task accuracy across six memory systems

All systems use gpt-4.1-mini as both internal and answering LLM on 100 episodes (filler ≈ 32K tokens). Best per task within the six main systems is in bold. Cascade, Absence, and Deletion use trivial-pass filtering.

System	ER	Agg	Tr	Del	Cas	Abs	Overall	Cost $/ep
System	ER	Agg	Tr	Del	Cas	Abs	Overall	Ingest	Inference
Raw retrieval
BM25	1.00	0.05	0.16	0.27	0.02	0.00	0.25	—	$0.04
text-emb-3-small	0.96	0.33	0.46	0.17	0.04	0.00	0.33	—	$0.03
LLM-processed memory
Mem0	0.67	0.35	0.43	0.21	0.03	0.00	0.28	$0.10	$0.01
Graphiti	0.03	0.01	0.04	0.09	0.02	0.01	0.03	$0.55	$0.00
File-based agents
MD-flat	0.94	0.45	0.77	0.25	0.06	0.05	0.42	$0.04	$0.01
Karpathy Wiki	0.11	0.18	0.27	0.03	0.01	0.02	0.10	$1.17	$0.22
Internal-LLM swap (20-ep subset)
MD-flat × Opus 4.7 claude-opus-4-7 · 20 ep	0.60	0.80	0.20	0.80	0.32	0.59	0.55	$3.87	$0.66
Average (six main systems)	0.62	0.23	0.35	0.17	0.03	0.01	0.24	—	—

Walkthrough Episode `sw_033` — software project

17 sessions (5 evidence, 12 filler) · ~35K tokens · cascade root: team_lead

Result grid (6 tasks × 7 systems, after-questions)

	BM25	text-emb-3-small	Graphiti	Karpathy Wiki	MD-flat	Mem0	MD-flat × Opus 4.7 claude-opus-4-7
ER
Agg
Tr
Del
Cas
Abs
Total	2/6	3/6	0/6	1/6	2/6	2/6	3/6

Click a cell to inspect the agent's actual answer for that task–system pair.

Inspect a system

memory size gold-fact %

filler session evidence session ▲probe (evaluation event)