DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

DeepSeek dropped the full V4 paper this week. preview from april was 58 pages, this version adds a lot of technical depth.

What stood out for me.

FP4 quantization aware training. theyre running FP4 QAT directly in late stage training. MoE expert weights quantized to FP4 (the main gpu memory consumer). QK path in the CSA indexer uses FP4 activations. 2x speedup on QK selector with 99.7% recall preserved. inference runs directly on the FP4 weights.

Efficiency table is striking:

Model	1M context FLOPs	KV cache
V3.2	baseline	baseline
V4-Pro	27% of baseline	10% of baseline
V4-Flash	10% of baseline	7% of baseline

Training stability, two mechanisms.

Trillion parameter MoE has the loss spike problem, divergence, unpredictable failures. they documented two fixes.

Anticipatory routing. they deliberately desync main model and router updates. current step uses latest params for features, but routing uses cached older params. breaks the feedback loop that amplifies anomalies. 20% overhead but only kicks in during loss spikes.

SwiGLU clamping. hard limits on the SwiGLU linear path (-10 to 10) and gate path (max 10). suppresses extreme values that would cascade.

Generative reward model. instead of separate reward models for RLHF, they use the same model to generate and evaluate. trained on scored data, model learns to judge its own outputs with reasoning attached. minimal human labeling, reasoning grounded eval, unified training.

Human eval results. chinese writing, V4-Pro 62.7% win rate vs gemini 3.1 pro, 77.5% on writing quality specifically. white collar tasks (30 advanced tasks across 13 industries), V4-Pro-Max gets 63% non loss rate vs opus 4.6 max. coding agent eval, 52% of users said V4-Pro is ready as their default coding model, 39% leaned yes, less than 9% said no. tracks my own use, swapped V4-Pro into my verdent runs last week and havent noticed a quality hit on day to day work.

The headline for me is FP4 QAT with minimal quality degradation. if this generalizes the cost structure of training and inference shifts a lot, especially noticeable on multi agent setups where one task can spawn 5-10 model calls.

Paper link in comments.

submitted by /u/Dramatic_Spirit_8436
[link] [comments]

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

Want to read more?

Tagged with