•1 min read•from Machine Learning
Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]
The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models?
I imagine not, and I'm trying to think why:
- marginal gains?
- pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)?
- scaling laws are not well understood for input-adaptive patching therefore big players do not bet on this?
or am I simply totally wrong and under the hood all the big players are doing dynamic tokenization for vision?
[link] [comments]
Want to read more?
Check out the full article on the original site
Tagged with
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#big data management in spreadsheets
#big data performance
#rows.com
#real-time data collaboration
#real-time collaboration
#AutoML capabilities
#VLMs
#fixed-patch
#ViTs
#dynamic tokenization
#non-fixed-patches
#big player models
#tokenizations
#efficiency
#input-adaptive patching
#vision capabilities
#pipelines