BLIP Image Captioning - Arabic (Flickr8k Arabic)

This model is a fine-tuned version of Salesforce/blip-image-captioning-large, adapted for image captioning in Arabic using the Flickr8K Arabic dataset. It takes an input image and generates a relevant caption in Arabic, describing the image content.

Model Sources

How to Get Started with the Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt

# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")

# Show image
plt.imshow(image)
plt.axis('off')  
plt.title("Input Image")
plt.show()

# Generate enhanced Arabic caption with better parameters
model.eval()
with torch.no_grad():
    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
    generated_output = model.generate(
      pixel_values=pixel_values,
      max_length=75,            
      min_length=20,
      num_beams=5,             
      repetition_penalty=1.5,   
      length_penalty=1.0,
      no_repeat_ngram_size=3,       
      early_stopping=True      
                   )
    caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
    print(caption)  # Prints Arabic caption

Training Details

Training Data

This model was fine-tuned on the Flickr8k Arabic dataset, which consists of 8,000 images, each with 4 reference Arabic captions. The dataset provides a diverse collection of everyday scenes and activities described in Modern Standard Arabic.

  • Dataset: Flickr8k Arabic
  • Size: 8,000 images with 32,000 captions

Training Procedure

The model was fine-tuned from the original BLIP model by adapting its language generation capabilities to Arabic text.

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Optimizer: AdamW
  • Learning rate: 5e-5
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 14
  • Total training batch size: 28
  • Epochs: 5
  • LR scheduler: Cosine with warmup
  • Weight decay: 0.01

Evaluation

Testing Data, & Metrics

Testing Data

The model was evaluated on the Flickr8k Arabic test split, which contains 1,000 images with 4 reference captions each.

Metrics

  • BLEU-1: 65.80
  • BLEU-2: 51.33
  • BLEU-3: 38.72
  • BLEU-4: 28.75
  • METEOR: 46.29

Results

The model performs well on common scenes and activities, generating grammatically correct and contextually appropriate Arabic captions. Performance decreases slightly for unusual scenes or culturally specific contexts not well-represented in the training data.

Bias, Risks, and Limitations

  • The model was trained on Flickr8k Arabic, which may not represent the full diversity of images and linguistic expressions in Arabic-speaking regions
  • May produce stereotypical or culturally insensitive descriptions
  • Performance may vary across different Arabic dialects and regional expressions
  • Limited ability to correctly describe culturally specific items, events, or contexts
  • May struggle with complex scenes or unusual visual elements

Recommendations

  • Users should review generated captions before using them in sensitive contexts
  • Consider post-processing or human review for public-facing applications
  • Test across diverse image types relevant to your use case
  • Be aware that the model may reflect biases present in the training data
  • Consider regional and dialectal differences when evaluating caption quality
Downloads last month
47
Safetensors
Model size
470M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support