Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision
Inference Endpoints

Question About SigLIP 2’s Performance with Newline-Separated Labels

#2
by zfjerome1 - opened

I’ve been experimenting with SigLIP 1 and SigLIP 2 in the Hugging Face Space and noticed something interesting. When I input labels in two formats—comma-separated (e.g., "photojournalism photography, editorial photography") versus newline-separated (e.g., "photojournalism photography,\neditorial photography,\n...")—SigLIP 1 consistently performs more accurately, while SigLIP 2 seem to perform better when there is a new line. I have also tested with diverse set of images and noticed similar pattern.

Could you shed light on why SigLIP 2 handles newline-separated labels better? Is this an intentional design choice, like training on noisier text data, or an artifact of the tokenizer?

Comma separated:
image.png

Comma separated + New line:
image.png

Thank you!

This is unrelated to the model - you're using the space wrong. Do not use any newline, just comma separate the labels, nothing else.

zfjerome1 changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment