2 min readfrom Machine Learning

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

If you've ever tried to pick an STT vendor for a phone-based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice, LibriSpeech) are clean studio recordings that have nothing to do with how STT models actually handle your G.711 encoded noisy phone calls.

Annotating production audio is slow, expensive, and usually a privacy headache. So most teams end up benchmarking on clean data, picking a vendor, then discovering in prod which one actually survives noise.

noisekit fills that gap. Take a clean annotated dataset, apply degradations that approximate your production conditions, end up with a noisy annotated corpus you can run WER on across every STT candidate.

uvx noisekit generate \ --dataset google/fleurs --config en_us --split test \ --samples 100 \ --output ./noisy-fleurs 

Feed ./noisy-fleurs through each STT candidate, normalize, and compute WER with the existing transcripts. The output is HuggingFace AudioFolder-compatible, so load_dataset("audiofolder", data_dir="./noisy-fleurs") works.

Presets cover the conditions that actually matter for voice products:

  • telecom: G.711 narrowband bandpass + 8-bit BitCrush + 16-32 kbps MP3 (sounds like a real phone call, not a synthetic low-pass filter)
  • noise: real ambient mixed at 5-15 dB SNR (auto-downloads a MUSAN noise-only subset, or bring your own --noise-dir matching your domain: call center, cafe, car, street)
  • reverb: pyroomacoustics far-field at 1-3 m mic distance
  • low_bitrate: wideband MP3 at 16-32 kbps
  • clipping: ADC / mic saturation
  • clean_reference: control / WER floor
  • compound chains stack realistically. noise_telecom = noisy room then phone codec, which is what an actual support call sounds like.

Each output gets PESQ, SNR and NISQA scores in metadata.jsonl alongside the original transcript, so you can correlate WER with measured signal quality after the fact.

Repo: https://github.com/karamouche/noisekit (MIT, uvx-runnable so zero install)

Genuinely curious to hear from people who've benchmarked STT in production: what degradation conditions am I missing?

submitted by /u/Karamouche
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#financial modeling with spreadsheets
#natural language processing for spreadsheets
#rows.com
#real-time collaboration
#large dataset processing
#no-code spreadsheet solutions
#big data management in spreadsheets
#conversational data analysis
#google sheets
#row zero
#cloud-based spreadsheet applications
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions