MEME Multi-Entity & Evolving Memory Evaluation

Examples of the six MEME task types across three categories: Retrieval (Exact Recall, Aggregation), State Management (Tracking, Deletion), and Dependency Reasoning (Cascade, Absence).

MEME defines six memory tasks spanning the full multi-entity × evolving space, including Cascade, Absence, and Deletion: three tasks that no prior benchmark scores.

Across 100 controlled episodes and six memory systems, all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% average accuracy), revealing that closure currently depends on configurations that are not practical at scale.

MEME's taxonomy along two dimensions: entity scope (single vs multi-entity) and temporal dynamics (static vs evolving), with six tasks distributed across the four quadrants.

Overall result Per-task accuracy across six memory systems

All systems use gpt-4.1-mini as both internal and answering LLM on 100 episodes (filler ≈ 32K tokens). Best per task within the six main systems is in bold. Cascade, Absence, and Deletion use trivial-pass filtering.

System ERAggTrDelCasAbsOverall Cost $/ep
Ingest Inference
Raw retrieval
BM25 1.00 0.05 0.16 0.27 0.02 0.00 0.25 $0.04
text-emb-3-small 0.96 0.33 0.46 0.17 0.04 0.00 0.33 $0.03
LLM-processed memory
Mem0 0.67 0.35 0.43 0.21 0.03 0.00 0.28 $0.10 $0.01
Graphiti 0.03 0.01 0.04 0.09 0.02 0.01 0.03 $0.55 $0.00
File-based agents
MD-flat 0.94 0.45 0.77 0.25 0.06 0.05 0.42 $0.04 $0.01
Karpathy Wiki 0.11 0.18 0.27 0.03 0.01 0.02 0.10 $1.17 $0.22
Internal-LLM swap (20-ep subset)
MD-flat × Opus 4.7 claude-opus-4-7 · 20 ep 0.60 0.80 0.20 0.80 0.32 0.59 0.55 $3.87 $0.66
Average (six main systems) 0.620.230.350.170.030.010.24

Walkthrough Episode sw_033 — software project

17 sessions (5 evidence, 12 filler) · ~35K tokens · cascade root: team_lead

Result grid (6 tasks × 7 systems, after-questions)

BM25 text-emb-3-small Graphiti Karpathy Wiki MD-flat Mem0 MD-flat × Opus 4.7 claude-opus-4-7
ER
Agg
Tr
Del
Cas
Abs
Total 2/63/60/61/62/62/6

Click a cell to inspect the agent's actual answer for that task–system pair.

Inspect a system

memory size gold-fact %
filler session evidence session probe (evaluation event)