1 min readfrom Machine Learning

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

Autoregressive LLM world models factorize next-state generation left-to-right, preventing them from conditioning on globally interdependent anchors (tool schemas, trailing status fields, expected outcomes) and yielding prefix-consistent but globally incoherent rollouts. MDLMs' any-order denoising objective sidesteps this by learning every conditional direction from the same training signal. Empirically, fine-tuned MDLMs (SDAR-8B, WeDLM-8B) surpass AR baselines up to 4x their total parameter count on BLEU-1, ROUGE-L, and MAUVE across in- and out-of-domain splits, with lower Self-BLEU and higher Distinct-N confirming reduced prefix mode collapse. GRPO training on MDLM-generated rollouts shows up to +15% absolute task-success gains over AR generated training on held-out ScienceWorld, ALFWorld, and AppWorld across 1.2B–7B backbones (LFM2.5, Qwen3, Mistral) in a zero-shot transfer setting.

submitted by /u/MegixistAlt
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#AI formula generation techniques
#self-service analytics tools
#machine learning in spreadsheet applications
#row zero
#rows.com
#financial modeling with spreadsheets
#self-service analytics
#generative AI for data analysis
#cloud-based spreadsheet applications
#Excel alternatives for data analysis
#natural language processing
#Masked Diffusion Language Models
#Agentic RL
#autoregressive LLM
#fine-tuned MDLMs
#next-state generation
#denoising objective
#task-success gains
#globally interdependent anchors