Cleo: trying to fit full analyst behavior in a 2B model [P]
Hello all!
Half of all industrial "chatbots" are just text-to-SQL models in a trenchcoat (and the other half RAG!). I wanted to explore just how small you could make these models if you trained, evaluated, and ran inference in the exact same structured harness, leading to Cleo: a Qwen3.5-2B-Base finetune.
Currently, some features of cleo that are only possible/useful in a unified hardel are:
- Training on the exact same gather, repair, and answer contract it uses at inference time
- Searching over candidate queries with live execution evidence, not just model likelihood
- Co-designing the model contract, SQL safety layer, dialect handling, timeouts, and clarification behavior as one system
Everything is completely open-source, including the harness, model, and datasets.
GitHub: https://github.com/Dreeseaw/cleo
Hugging Face model: https://huggingface.co/dreeseaw/cleo
PS: If you're also resource-constrained and trying to do RL like me, I would highly recommend experimenting with ECHO: https://arxiv.org/abs/2605.24517
[link] [comments]
Want to read more?
Check out the full article on the original site