Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Just wanted to share my research regarding probe-targeted fine-tuning (LoRa) for verbal confidence calibration.,

If you probe the hidden states of an instruct-tuned LLM, it can tell correct from incorrect answers at 0.76–0.88 AUROC. But when you ask it directly it tends to respond with confidence at 99% for everything. The model knows if it actually knows but it won't admit it.

I took the probe's output and used it as fine-tuning targets. This teaches the model to say out loud what it already knows internally. LoRA, few hundred examples, under 10 minutes on an M3 Ultra.

I tested on 8 models across 4 families (7B–70B).

Activation patching shows it's actually causal. Not just a correlation. If you swap hidden states at the confidence position you can watch confidence shift (ρ = 0.976 layer gradient). If swap occurs at a random position then nothing happens.
At 70B, the softmax distribution carries valid metacognitive signal but the argmax text is still stuck at 99% confident. The model learned the routing internally but can't get pass the text bottleneck.
Seed-level replication across 3 models . The discrimination is stable, but the shape of the confidence distribution is seed-sensitive.

I pre-registered this across 2 studies (with noted deviations) and have all my code available (Code: github.com/synthiumjp/metacog-engineering). I tried to make it as rigourous and replicable as possible. The pre-print is here: https://zenodo.org/records/20436841

submitted by /u/Synthium-
[link] [comments]

Want to read more?