4 min readfrom Machine Learning

We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]
We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

We evaluated six models on English subtitle translation into Spanish, Japanese, Korean, Thai, Chinese Simplified, and Chinese Traditional - 167 segments per language pair, scored with two reference-free QE metrics.

Models tested:

  • TranslateGemma-12b
  • claude-sonnet-4-6
  • deepseek-v3.2
  • gemini-3.1-flash-lite-preview
  • gpt-5.4-mini
  • gpt-5.4-nano

Scoring

We used MetricX-24 (lower = better) and COMETKiwi (higher = better) - both reference-free QE metrics. We also developed a combined score:

TQI = COMETKiwi × exp(−MetricX / 10)

The exponential decay term converts MetricX into a multiplicative fidelity penalty. When MetricX is near 0, TQI ≈ COMETKiwi. As MetricX grows, the penalty increases exponentially. TQI is our own metric, not an industry standard.

Top-level results (avg TQI across all 6 languages)

Rank Model Avg TQI
#1 TranslateGemma-12b 0.6335
#2 gemini-3.1-flash-lite-preview 0.5981
#3 deepseek-v3.2 0.5946
#4 claude-sonnet-4-6 0.5811
#5 gpt-5.4-mini 0.5785
#6 gpt-5.4-nano 0.5562

All models sit between 0.75-0.79 on COMETKiwi (fluency). Models diverge significantly on MetricX-24 fidelity scores - that's where the TQI separation comes from.

A few things worth discussing:

1. Metric-model affinity concern One caveat worth noting: MetricX-24 is a Google metric and TranslateGemma is a Google model. COMETKiwi - from Unbabel - shows a noticeably smaller gap between TranslateGemma and the field. The direction of the result holds either way, but the size of the lead may be partially inflated by metric-model affinity.

2. Claude collapses in Japanese claude-sonnet-4-6 ranked last (#6) in Japanese - MetricX 3.90, its worst result across all languages. Its COMETKiwi (0.79) was decent. Classic fluency-fidelity mismatch: output that sounds natural but drifts from source meaning.

3. Gemini Flash Lite outperforms full-sized frontier models A "lite" model consistently ranked #2-3, beating Claude Sonnet and both GPT-5.4 variants across most languages.

4. TranslateGemma ranked #1 - then human QA found something the metrics had missed entirely TranslateGemma topped every language. When our linguists reviewed the Traditional Chinese (zh-TW) output, the model was outputting Simplified Chinese for both zh-CN and zh-TW language codes. We then investigated community reports suggesting zh-Hant as the correct explicit tag for Traditional Chinese and retested with it. Result: 76% of segments still came back Simplified, 14% Traditional, 10% ambiguous (segments too short or script-neutral to classify).

https://preview.redd.it/h6gfrd0ew4vg1.jpg?width=773&format=pjpg&auto=webp&s=fbe0afae3831528440b956167456e94004bcbe09

MetricX-24 and COMETKiwi scored both outputs identically and highly - no indication of a problem from either metric.

As it turns out, this is a confirmed, publicly documented issue caused by training data bias: TranslateGemma's fine-tuning corpus is heavily skewed toward Simplified Chinese. The locale tags are accepted without error but not honored by the model's weights. This affects all model sizes (4B, 12B, 27B) - upgrading to a larger model size won't fix it, since the root cause is training data composition, not capacity. A workaround exists (OpenCC s2twp post-processing), but standard QE metrics will look fine the whole time - that's exactly the problem for any pipeline relying on automated validation.

submitted by /u/ritis88
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing
#real-time data collaboration
#big data management in spreadsheets
#conversational data analysis
#google sheets
#rows.com
#financial modeling with spreadsheets
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#enterprise-level spreadsheet solutions
#automated anomaly detection
#large dataset processing
#real-time collaboration
We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]