Author: @Ghogha Atif
CLIP Paper: Click Here
SigLIP Paper: Click Here
Hi! Hope you’re doing good :)
In this blog, I will dive deep into SigLIP (Lucas Beyer et al).
Contrastive pre-training in CLIP (Contrastive Language-Image Pre-training) is a technique used to align visual and textual representations by training the model to bring matching pairs of images and text (captions or descriptions) closer together in a shared embedding space, while pushing apart non-matching pairs.
So CLIP is a combined image and text embedding model trained on 400 million image-text pairs using a self-supervised approach.
It aligns both text and images into the same embedding space, meaning that, for instance, an image of a Ayanokoji and the phrase “an image of Ayanokoji” will have similar embeddings and be positioned close together in the vector space.