#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Disclosure: first author.

Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability.

96.4% at top-50 with Gemini 3 Flash. Comparative reported scores (all Gemini 3 Pro): Mem0 94.8%, Honcho 92.6%, HydraDB 90.79%, Supermemory 85.2%.

Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered:

Query decomposition: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments.
Temporal salience scoring: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009).
Coherence re-ranking: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model.

Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions.

Category results at top-50: single-session (user) 98.6%, assistant 100%, preferences 96.7%, knowledge update 97.4%, multi-session 94.0%, temporal reasoning 95.5%.

Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested.

Above ~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream.

Paper | Results | Answerer prompt

Curious if others have explored similar cognitive-science-informed retrieval architectures for conversational memory.

submitted by /u/j-m-k-s
[link] [comments]

#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Want to read more?

Tagged with