2 min readfrom Machine Learning

#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Disclosure: first author.

Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability.

96.4% at top-50 with Gemini 3 Flash. Comparative reported scores (all Gemini 3 Pro): Mem0 94.8%, Honcho 92.6%, HydraDB 90.79%, Supermemory 85.2%.

Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered:

  • Query decomposition: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments.
  • Temporal salience scoring: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009).
  • Coherence re-ranking: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model.

Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions.

Category results at top-50: single-session (user) 98.6%, assistant 100%, preferences 96.7%, knowledge update 97.4%, multi-session 94.0%, temporal reasoning 95.5%.

Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested.

Above ~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream.

Paper | Results | Answerer prompt

Curious if others have explored similar cognitive-science-informed retrieval architectures for conversational memory.

submitted by /u/j-m-k-s
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#financial modeling with spreadsheets
#rows.com
#conversational data analysis
#large dataset processing
#cognitive automation
#AI formula generation techniques
#memory retrieval
#Gemini 3 Flash
#temporal salience scoring
#retrieval architecture
#LongMemEval
#query decomposition
#semantic similarity
#coherence re-ranking
#multi-session questions
#temporal context models
#lexical precision