Built an LLM training framework that actually runs on older GPUs without crashing [P]

Hey guys,

I was playing around with Nanotron recently and got super frustrated by how many heavy, hardware-specific dependencies it imports at the module level ( flash-attn , triton, functorch , etc.). If you try to run it on older or budget GPUs like a T4 or V100, it just crashes on import.

So I wrote Picotron (https://github.com/Syntropy-AI-Labs/picotron) to solve this. It's a clean-room rewrite that gets rid of all mandatory GPU-specific dependencies.

It runs on pretty much any GPU that supports PyTorch (defaults to FP16 on older cards under compute capability 8.0, and BF16 on newer ones). It falls back to standard PyTorch SDPA by default, but still hooks into FlashAttention-2 at runtime if it detects you have it installed.

I used an AI assistant to write a lot of the boilerplate/code modules, but I've got it working locally and just trained a tiny 2M model on

FineWeb-Edu.

Also added configs for:

• GQA / MLA (Multi-head Latent Attention)

• QK-Norm & logit soft-capping (Gemma 2 style)

• Parallel FFN/Attn runs

• ZeRO-1 wrapping on DDP

Roadmap is pretty short right now:

MoE prep (routing capacity factors and load balancing loss)
Making dataset prep easier than streaming manually

Check it out if you've been fighting with CUDA dependency hell: https://github.com/Syntropy-AI-Labs/picotron

submitted by /u/Capital_Savings_9942
[link] [comments]

Built an LLM training framework that actually runs on older GPUs without crashing [P]

Want to read more?

Tagged with