Fine-Tuned CLIP-GPT2 Model for Image Captioning

This is a fine-tuned version of CLIP-GPT2 for real-time image captioning to aid the visually impaired.

Model Details:

  • Base Model: CLIP ViT-B/32
  • Fine-Tuned On: VizWiz dataset
  • Format: SafeTensors
  • Usage:
    from transformers import CLIPProcessor, CLIPModel
    from PIL import Image
    
    model = CLIPModel.from_pretrained("vidi-deshp/clip-gpt2-finetuned")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    
    image = Image.open("sample.jpg")
    inputs = processor(images=image, return_tensors="pt")
    
    outputs = model(**inputs)
    
Downloads last month
18
Safetensors
Model size
151M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support