visheratin/mexma-siglip2

Model Summary

MEXMA-SigLIP2 is a model that combines the MEXMA multilingual text encoder and an image encoder from the SigLIP2 model. This allows us to get a high-performance CLIP model for 80 languages. MEXMA-SigLIP2 sets new state-of-the-art on the Crossmodal-3600 dataset with 62.54% R@1 for image retrieval and 59.99% R@1 for text retrieval.

How to use

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/mexma-siglip2", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip2")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip2")

img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
    text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1)
    print(probs)

Acknowledgements

I thank ML Collective for providing compute resources to train the model.

visheratin
/

mexma-siglip2

Model Summary

How to use

Acknowledgements

Space using visheratin/mexma-siglip2 1

Evaluation results