Model Summary

MEXMA-SigLIP2 is a model that combines the MEXMA multilingual text encoder and an image encoder from the SigLIP2 model. This allows us to get a high-performance CLIP model for 80 languages. MEXMA-SigLIP2 sets new state-of-the-art on the Crossmodal-3600 dataset with 62.54% R@1 for image retrieval and 59.99% R@1 for text retrieval.

How to use

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/mexma-siglip2", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip2")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip2")

img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
    text = tokenizer(["ะบะพัˆะบะฐ", "a dog", "เคเคซเคฟเคฒ เคŸเฅ‰เคตเคฐ"], return_tensors="pt", padding=True).to("cuda")
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1)
    print(probs)

Acknowledgements

I thank ML Collective for providing compute resources to train the model.

Downloads last month
180
Safetensors
Model size
1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using visheratin/mexma-siglip2 1

Evaluation results