•1 min read•from Machine Learning
Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]
After ~3 weeks of experimentation in OpenAI's Parameter Golf competition, I wrote up why SSMs are structurally disadvantaged relative to transformers in a time- and size-constrained regime (10 min training, 16MB artifact, 25M parameters) on 8xH100s: https://mradassaad.github.io/posts/why-ssms-struggle-in-parameter-golf/
Main findings:
- SSM in_proj weights compress up to 3.26x worse than attention QKV under LZMA, directly taxing the compressed parameter budget
- Architectural wins validated at SP4096 flipped sign at SP8192 — two configs that looked like clean wins reversed direction at the target vocabulary
Also includes three kernel-level experiments on the Mamba-3 Triton kernels: a backward fusion attempt that was numerically exact but 16% slower due to SMEM pressure, a torch.compile quantizer bug that cost 5.5 mBPB, and a mixed-precision dynamics protection that recovered 0.8 mBPB at negligible size cost.
[link] [comments]
Want to read more?
Check out the full article on the original site
Tagged with
#rows.com
#enterprise-level spreadsheet solutions
#real-time data collaboration
#real-time collaboration
#SSM
#parameter-constrained
#transformers
#compression
#H100s
#training
#architectural wins
#LZMA
#Mamba-3 Triton
#compressed parameter budget
#weights
#kernel-level experiments
#mixed-precision
#backward fusion
#vocabulary
#torch.compile