2 min readfrom Machine Learning

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around... a complete Usenet archive spanning 1980 to 2013.

Here's what it ended up being:

  • 103.1 billion tokens (cl100k_base)
  • 408 million posts across 9 newsgroup hierarchies
  • 18,347 newsgroups covered
  • 33 years of continuous coverage

The processing pipeline included full deduplication, binary removal (alt.binaries.* excluded at the hierarchy level before record-level cleaning), quoted text handling, email address redaction via pattern matching and SHA-256 hashing of Message-IDs, and conversion from raw MBOX archives to gzip-compressed JSONL.

Language detection was run on every record using Meta's fasttext LID-176. The corpus is 96.6% English with meaningful representation from 100+ other languages — the soc.culture.* groups in particular have high non-English density.

The thing I find most interesting about this dataset from a training perspective is the temporal arc. Volume is sparse pre-1986, grows steadily through the early 90s, peaks around 1999–2000, then declines as Usenet gets displaced by forums and social media. That's a 33-year window of language evolution baked into a single coherent corpus — before SEO, before engagement optimization, before AI-generated content existed.

I've published a full data card, cleaning methodology, and representative samples (5K posts per hierarchy + combined sets) on Hugging Face: https://huggingface.co/datasets/OwnedByDanes/Usenet-Corpus-1980-2013

Happy to answer questions about the processing pipeline or the data itself.

submitted by /u/OwnerByDane
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#natural language processing
#large dataset processing
#data cleaning solutions
#generative AI for data analysis
#Excel alternatives for data analysis
#big data management in spreadsheets
#enterprise-level spreadsheet solutions
#conversational data analysis
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#rows.com
#automated anomaly detection
#financial modeling with spreadsheets
#Usenet
#corpus