Jailbreaks as social engineering: 5 case studies suggest LLMs inherit human psychological vulnerabilities from training data [D]

Writeup documenting 5 psychological manipulation experiments on LLMs (GPT-4, GPT-4o, Claude 3.5 Sonnet) from 2023-2024. Each case applies a specific human social-engineering vector (empathetic guilt, peer/social pressure, competitive triangulation, identity destabilization via epistemic argument, simulated duress) and produces alignment failures consistent with that vector.

Central claim: contrary to the popular frame, these jailbreaks aren't mathematical exploits. They are, rather, inherited failure modes from training data. If a system simulates human empathy, reason, and social grace, it follows that it ought to inherit human vulnerabilities. The substrate is irrelevant; the vulnerabilities are social.

Full writeup with links to each case study's transcript and date:

https://ratnotes.substack.com/p/i-ran-5-social-engineering-attacks

Interested in discussion on whether the "patch as software vulnerability" framing dominant in alignment research is addressing the right attack surface, or whether the problem is more fundamentally one of social dynamics inherited through training.

submitted by /u/One-Honey6765
[link] [comments]

Jailbreaks as social engineering: 5 case studies suggest LLMs inherit human psychological vulnerabilities from training data [D]

Want to read more?

Tagged with