Cleo: trying to fit full analyst behavior in a 2B model [P]

Hello all!

Half of all industrial "chatbots" are just text-to-SQL models in a trenchcoat (and the other half RAG!). I wanted to explore just how small you could make these models if you trained, evaluated, and ran inference in the exact same structured harness, leading to Cleo: a Qwen3.5-2B-Base finetune.

Currently, some features of cleo that are only possible/useful in a unified hardel are:

Training on the exact same gather, repair, and answer contract it uses at inference time
Searching over candidate queries with live execution evidence, not just model likelihood
Co-designing the model contract, SQL safety layer, dialect handling, timeouts, and clarification behavior as one system

Everything is completely open-source, including the harness, model, and datasets.

GitHub: https://github.com/Dreeseaw/cleo

Hugging Face model: https://huggingface.co/dreeseaw/cleo

PS: If you're also resource-constrained and trying to do RL like me, I would highly recommend experimenting with ECHO: https://arxiv.org/abs/2605.24517

submitted by /u/Dreeseaw
[link] [comments]

Cleo: trying to fit full analyst behavior in a 2B model [P]

Want to read more?

Tagged with