1 min readfrom Machine Learning

Custom image encoder [P]

Hello, I would like to know whether building my own image encoder would be a good idea instead of using models like CLIP, SigLIP/SigLIP2, or DINO.

My use case is video frame classification.

My pipeline is the following: the client sends me a video stream, sampled at 1 frame per 1 or 2 second, forming segments of 15 frames (30 seconds). I compute embeddings for these frames and send them to a small custom Transformer (1.5M to 9M parameters).

This works very well on GPU. However, I have two main constraints: processing speed and deployment on small CPU-only devices.

A CLIP-S0 encoder processes around 10 images per second on 4 vCPUs. I would like to replace it with my own encoder trained on my dataset (a few million images), with only a few million parameters and around 4 to 5 labels.

My question is whether this is a good approach, and whether it would improve both embedding generation speed and the accuracy of my Transformer model.

submitted by /u/These_Try_656
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#large dataset processing
#financial modeling with spreadsheets
#rows.com
#AI formula generation techniques
#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing
#image encoder
#embeddings
#custom encoder
#video frame classification
#Transformer
#processing speed
#CLIP
#model accuracy
#deployment
#graphic processing unit (GPU)
#CPU-only devices
#dataset