SigLIP Paper - hola sigmoid!

Author: @Ghogha Atif

Hi! Hope you’re doing good :)

In this blog, I will dive deep into SigLIP (Lucas Beyer et al).

Focus of the Paper

Language Image Pre training
Contrastive Learning with Softmax Normalization
Not require a global view of the pairwise similarities for normalization.
Batch size Efficiency

Understanding Contrastive Pre-training

Contrastive pre-training in CLIP (Contrastive Language-Image Pre-training) is a technique used to align visual and textual representations by training the model to bring matching pairs of images and text (captions or descriptions) closer together in a shared embedding space, while pushing apart non-matching pairs.

CLIP : An Idea

So CLIP is a combined image and text embedding model trained on 400 million image-text pairs using a self-supervised approach.

It aligns both text and images into the same embedding space, meaning that, for instance, an image of a Ayanokoji and the phrase “an image of Ayanokoji” will have similar embeddings and be positioned close together in the vector space.