2 min readfrom Machine Learning

Built an LLM training framework that actually runs on older GPUs without crashing [P]

Hey guys,

I was playing around with Nanotron recently and got super frustrated by how many heavy, hardware-specific dependencies it imports at the module level ( flash-attn , triton, functorch , etc.). If you try to run it on older or budget GPUs like a T4 or V100, it just crashes on import.

So I wrote Picotron (https://github.com/Syntropy-AI-Labs/picotron) to solve this. It's a clean-room rewrite that gets rid of all mandatory GPU-specific dependencies.

It runs on pretty much any GPU that supports PyTorch (defaults to FP16 on older cards under compute capability 8.0, and BF16 on newer ones). It falls back to standard PyTorch SDPA by default, but still hooks into FlashAttention-2 at runtime if it detects you have it installed.

I used an AI assistant to write a lot of the boilerplate/code modules, but I've got it working locally and just trained a tiny 2M model on

FineWeb-Edu.

Also added configs for:

• GQA / MLA (Multi-head Latent Attention)

• QK-Norm & logit soft-capping (Gemma 2 style)

• Parallel FFN/Attn runs

• ZeRO-1 wrapping on DDP

Roadmap is pretty short right now:

  1. MoE prep (routing capacity factors and load balancing loss)
  2. Making dataset prep easier than streaming manually

Check it out if you've been fighting with CUDA dependency hell: https://github.com/Syntropy-AI-Labs/picotron

submitted by /u/Capital_Savings_9942
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#rows.com
#financial modeling with spreadsheets
#natural language processing for spreadsheets
#generative AI for data analysis
#enterprise-level spreadsheet solutions
#large dataset processing
#row zero
#Excel alternatives for data analysis
#no-code spreadsheet solutions
#LLM
#GPU
#PyTorch
#CUDA
#FlashAttention-2
#T4
#V100
#FP16
#BF16
#SDPA
#GQA