1 min readfrom Machine Learning

What should a PyTorch training end-of-run performance summary show? [D]

What should a PyTorch training end-of-run performance summary show? [D]
What should a PyTorch training end-of-run performance summary show? [D]

For most slow PyTorch runs the first question isn't show me every trace event, it is just: where do I even start?

- where did step time go?
- was the run input-bound, compute-bound, or wait-heavy?
- were ranks imbalanced?
- was memory stable or creeping up?

I haven been thinking about what a compact end-of-run summary would look like: lightweight enough to run on every job, not just dedicated profiling runs.

Here's one example of what that output could look like:

https://preview.redd.it/2q71s9ltkvzg1.png?width=533&format=png&auto=webp&s=cde99ed3224d723bb6dba200b326da826ba4f587

Curious how others are solving this today. What would make something like this useful? What is missing?

submitted by /u/traceml-ai
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#rows.com
#big data performance
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#real-time collaboration
#PyTorch
#training
#end-of-run
#performance summary
#input-bound
#compute-bound
#step time
#memory stable
#profiling runs
#wait-heavy
#compact summary
#ranks imbalanced
#slow runs