1 min readfrom Machine Learning

ROCm with PyTorch and PyTorch Lightning seems to still suck for research [D]

So I asked about people's experiences with ROCm in a post a few weeks or so ago

https://www.reddit.com/r/MachineLearning/comments/1t6cng3/rocm_status_in_mid_2026_d/

I actually went and procured a RX 7900XTX reference version to give it a try

My discover is that it kind of still sucks

I have a small codebase for training flow matching models, which runs fine on my RTX3090s. But the moment I ported it across to ROCm it was NaNs absolutely everywhere. The code was kept identical, apart from altering the pip environment to point to torch2.12 with ROCm7.2 instead of CUDA

Trying everything from switching between bf16, fp32, to tweaking various environment variables yielded nothing.

Unless there's some trick I'm missing, I get the feeling that ROCm is still seriously behind.

I tried running the nanoGPT training script, which ran perfectly

My intuition is that the ROCm people have probably tested their stack on established well known codebases. But, it's still remarkably fragile on even slightly uncommon code.

submitted by /u/QuantumQuokka
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#rows.com
#financial modeling with spreadsheets
#no-code spreadsheet solutions
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#ROCm
#PyTorch
#training flow matching models
#PyTorch Lightning
#NaNs
#RX 7900XTX
#bf16
#fp32
#nanoGPT
#CUDA
#environment variables
#torch2.12
#codebase
#pip environment