1 min readfrom Machine Learning

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

We built a monokernel that runs the full decode sequence as one GPU-resident program on AMD MI300X, with some neat optimizations. The die topology is central to the result, we map memory access patterns to the physical layout, compute units group by their associated IOD, and the hardware runs at its full design performance.

Up to 3,300 output tokens/s per request, batch size 1, no speculative decoding, no quantization, on 8x MI300X.

This preview runs a small 2B coding model, and we plan to support large frontier MoE in the future.

Technical deep dive: https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus

Try it: https://playground.kog.ai

submitted by /u/averne_
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#rows.com
#large dataset processing
#financial modeling with spreadsheets
#big data performance
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#monokernel
#LLM inference
#AMD MI300X
#GPU-resident program
#output tokens/s
#decode sequence
#optimization
#die topology
#memory access patterns
#compute units
#performance
#2B coding model
#batch size