Evaluating long-term memory limits in stateless LLM chatbots

Hi all,

I’m working on a research project exploring how stateless LLM-based chatbots handle long conversations and whether important earlier information is still reliably retained over time.

My idea is to:

Run a chatbot using an LLM API without any external memory system
Introduce key facts early in a long conversation
Continue with many unrelated messages (hundreds of turns)
Later test whether the model can still correctly recall those facts at different intervals

I’m planning to measure recall accuracy and how it changes as the conversation grows.

Before I go deeper, I’d really appreciate feedback on:

Is this a valid way to evaluate long-context memory limits?
Are there better benchmarks or methods already used for this?
What metrics would make this more rigorous and convincing?

Any suggestions or criticism are welcome. I’m trying to make the evaluation as solid as possible before building it out.

Thanks!

submitted by /u/QuietAccountant4237
[link] [comments]

Evaluating long-term memory limits in stateless LLM chatbots — feedback needed [D]

Want to read more?

Tagged with