How we catch silent NPU fallback on Snapdragon in CI [D]

Posting because I've now seen this exact bug at multiple teams shipping ML to Snapdragon, and the pattern is worth writing up.

ONNX Runtime's QNN execution provider (the one that targets Qualcomm's Hexagon NPU on Snapdragon SoCs) will silently route unsupported ops to the CPU. Your accuracy is fine, your eval latency on the dev board looks fine, but production latency mysteriously triples because the input distribution stresses fallback paths differently — and the runtime never raises anything louder than a startup-log line nobody reads.

The default median-of-N latency gate doesn't catch this, because fallback creates a bimodal distribution and the median lands on the fast cluster. Three things end up being necessary:

**Run on real hardware** — emulators implement the ISA in software so every op is "supported" (for the wrong reason), and cloud x86 doesn't load the QNN EP at all
**Gate on coefficient of variation alongside median** — healthy on-NPU CV is 2–5%, intermittent fallback pushes it >15%
**Parse the ORT profiling JSON and assert NPU FLOP percentage** — the routing info is in there but you have to opt into `profiling_level=detailed` and post-process it; the default warning-level log just says "23 nodes assigned to QNN, 7 to CPU"

The third one is the diagnostic that actually identifies which op fell back, so you can either swap it for a supported equivalent, pin the QNN SDK, or escalate to firmware.

Wrote up the full pattern with the actual Python (CV gating function + ORT profile parser): https://edgegate.frozo.ai/blog/how-we-catch-silent-npu-fallback-on-snapdragon-in-ci

Curious if anyone here has hit similar silent-fallback patterns with TensorRT on Jetson or CoreML on iOS — I'd expect the symptom (bimodal latency, silent provider routing) but haven't gone digging. Same with ExecuTorch.

submitted by /u/NoAdministration6906
[link] [comments]

How we catch silent NPU fallback on Snapdragon in CI [D]

Want to read more?

Tagged with