2 min readfrom Machine Learning

Transformers with Selective Access to Early Representations [R]

Transformers with Selective Access to Early Representations [R]
Transformers with Selective Access to Early Representations [R]

Hello everyone. I’m excited to share our new paper!

Figure 1: Comparison Across Architectures

A lot of recent Transformer variants try to improve information flow across depth by exposing later layers to earlier representations. You may have recently heard about methods like DenseFormer, MUDDFormer, and HyperConnections, which add more dense or dynamic cross-layer pathways. These are expressive, but they can also come with meaningful throughput and memory costs.

Our question was more specific: Can we improve the efficiency-performance tradeoff at scale by enabling more principled reuse of early representations?

We introduce SATFormer, which keeps the same cheap first-layer value pathway used by value residual learning, but replaces static layer-wise mixing with a per-token, per-head, context-dependent gate. Instead of uniformly copying early features into every later layer, SATFormer learns when and where each head should re-access the first-layer value stream.

Main results:

  • Across 130M–1.3B models, SATFormer improves validation loss over both Transformer and ResFormer baselines.
  • On retrieval-intensive benchmarks, SATFormer gets the best average score among the evaluated architectures, narrowly surpassing MUDDFormer and improving over ResFormer by about 1.5 average points.
  • SATFormer runs close to Transformer/ResFormer, whom are roughly 1.75×–1.82× higher throughput than HyperConnections and MUDDFormer.
  • Mechanistic analysis suggests the gate is not just acting like a dense residual shortcut: access is sparse, depth-dependent, head-specific, and stronger for specific tokens.

The core framing is that early-representation reuse may be better treated as a retrieval/control problem rather than a connectivity/maximal routing problem. OverllI am excited to discuss what some better approaches may be to improving the transformer architecture while maintaining a high throughput.

Arxiv: https://arxiv.org/pdf/2605.03953

github (still WIP): https://github.com/SkyeGunasekaran/SATFormer

submitted by /u/Skye7821
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#rows.com
#financial modeling with spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#machine learning in spreadsheet applications
#conversational data analysis
#big data performance
#data analysis tools
#SATFormer
#Transformers
#early representations
#efficiency-performance tradeoff
#context-dependent gate
#retrieval-intensive benchmarks
#information flow
#validation loss
#DenseFormer
#MUDDFormer
#value residual learning