•1 min read•from Machine Learning
An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]
I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook.
Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and where the real bottlenecks live. Added mermaid diagrams for the architecture pieces so the flow is easier to follow than a wall of text.
It's a personal learning project, still growing chapter by chapter. I'd value feedback or corrections from anyone who's run inference in production, where my mental model breaks down is exactly what I want to find. Issues and PRs welcome.
[link] [comments]
Want to read more?
Check out the full article on the original site
Tagged with
#rows.com
#natural language processing for spreadsheets
#machine learning in spreadsheet applications
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#real-time collaboration
#LLM Inference
#GPU
#KV Cache
#Batching
#vLLM
#SGLang
#TensorRT-LLM
#GPU Internals
#Memory Hierarchy
#Throughput
#Bottlenecks
#GPU Execution
#Memory Internals