[R] I think I found something about embeddings. Polysemy doesn't predict variance, frequency does. Calling it Contextual Promiscuity Index.

I was working on word-sense disambiguation research at home and kind of noticed something. I', posting to find out if this is already known or actually interesting.

The assumption I started with is that polysemous words have messy embeddings. More dictionary senses, so more geometric fragmentation. Seems obvious, but no.

I measured mean pairwise cosine similarity across 192 words using Qwen2.5-7B, extracting at layer 10 (found via layer sweep). Correlation between WordNet sense count and embedding variance: Spearman rho = -0.057, p = 0.43. Basically nothing.

What does predict it, is frequency: rho = -0.239, p = 0.0008, holding up after controlling for polysemy (partial r = -0.188). This kund of makes sense once you think about it. "Break" has 60 WordNet senses, but most are metaphorical extensions of the core idea. The model treats them as variations on a theme and the embedding stays coherent. Meanwhile "face" gets pulled in multiple directions by its various co-occurrence patterns, even though it has fewer formal senses.

I'm calling this the Contextual Promiscuity Index (CPI) It's a per-word, per-model, per-knowledge-domain score for how geometrically dispersed a word's embeddings are across contexts. High-frequency words are promiscuous not because they mean more things, but because they show up everywhere.

Possible uses I've been thinking about: flagging unreliable query terms in RAG pipelines, guiding precision allocation in embedding table compression, or identifying noisy tokens during pretraining. I ran some retrieval experiments trying to demonstrate the RAG angle and got results in the right direction, but too weak to be statistically significant. My corpus was probably too small (about 1,000 documents), and I don't have the compute to push it further right now.

I'm sharing the finding while it's still just a finding. Code available if anyone wants it.

Is this already known? And does anyone have a cleaner experiment in mind?

submitted by /u/Intraluminal
[link] [comments]

[R] I think I found something about embeddings. Polysemy doesn't predict variance, frequency does. Calling it Contextual Promiscuity Index.

Want to read more?

Tagged with