Extracting Books from LLMs.
The arXiv paper Extracting books from production language models by Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, and Percy Liang is alarming but not in the least surprising. The abstract:
Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model’s weights during training, and whether those memorized data can be extracted in the model’s outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure […]. With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer’s Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
Écrasez l’infâme ! And if you’re tired of thinking about the evils of LLMs, I bring you news of An Old Welsh Reader, edited by Simon Rodway:
This reader contains edited texts, with English translations, of all the independent texts extant in manuscripts of the ninth, tenth, and eleventh centuries, with a selection of twelfth-century texts. They are accompanied by extensive notes and glossaries, along with an introduction which considers the prehistory of Welsh and its relationship with other Celtic languages. The volume also contains a comprehensive list of the sources of Old Welsh and an outline grammar: the first specifically dedicated to Old Welsh to appear in English. Appendices contain editions of one of the very few ancient Celtic texts from Britain, the Bath pendant, and the only sizeable text in another early medieval Brittonic language, the Old Cornish portion of the Leiden leechbook.
Now that’s my idea of a good time.
Want to read more?
Check out the full article on the original site