[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
![[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2Fqbx94xeeo2tg1.png%3Fwidth%3D140%26height%3D93%26auto%3Dwebp%26s%3D39ed7f02dad84ccf081f932903c016c7983d4fcd&w=3840&q=75)
| Hi everyone, I am from Australia : ) I just released a new research prototype It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code. Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification. Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference: sign + mantissa: exactly 1 byte per element
Some results so far: Single-user (B=1), RTX 5070 Ti
Multi-user (B=256), total tok/s
It also seems surprisingly stable across model types:
So far this is tested on BF16 safetensors only. Repo: https://github.com/cenconq25/Turbo-Lossless Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026). Happy to hear criticism, edge cases, or reasons this idea won’t scale. Thanks for your time : ) [link] [comments] |
Want to read more?
Check out the full article on the original site