Author: @Ghogha Atif

CLIP Paper: Click Here

SigLIP Paper: Click Here

Hi! Hope you’re doing good :)

In this blog, I will dive deep into SigLIP (Lucas Beyer et al).

Focus of the Paper


Understanding Contrastive Pre-training

Contrastive pre-training in CLIP (Contrastive Language-Image Pre-training) is a technique used to align visual and textual representations by training the model to bring matching pairs of images and text (captions or descriptions) closer together in a shared embedding space, while pushing apart non-matching pairs.


CLIP : An Idea

So CLIP is a combined image and text embedding model trained on 400 million image-text pairs using a self-supervised approach.

It aligns both text and images into the same embedding space, meaning that, for instance, an image of a Ayanokoji and the phrase “an image of Ayanokoji” will have similar embeddings and be positioned close together in the vector space.