BLIP Image Captioning - Arabic (Flickr8k Arabic)

This model is a fine-tuned version of Salesforce/blip-image-captioning-large, adapted for image captioning in Arabic using the Flickr8K Arabic dataset. It takes an input image and generates a relevant caption in Arabic, describing the image content.

Model Sources

Paper: Based on "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"

How to Get Started with the Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt

# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")

# Show image
plt.imshow(image)
plt.axis('off')  
plt.title("Input Image")
plt.show()

# Generate enhanced Arabic caption with better parameters
model.eval()
with torch.no_grad():
    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
    generated_output = model.generate(
      pixel_values=pixel_values,
      max_length=75,            
      min_length=20,
      num_beams=5,             
      repetition_penalty=1.5,   
      length_penalty=1.0,
      no_repeat_ngram_size=3,       
      early_stopping=True      
                   )
    caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
    print(caption)  # Prints Arabic caption

Training Details

Training Data

This model was fine-tuned on the Flickr8k Arabic dataset, which consists of 8,000 images, each with 4 reference Arabic captions. The dataset provides a diverse collection of everyday scenes and activities described in Modern Standard Arabic.

Dataset: Flickr8k Arabic
Size: 8,000 images with 32,000 captions

Training Procedure

The model was fine-tuned from the original BLIP model by adapting its language generation capabilities to Arabic text.

Training Hyperparameters

Training regime: fp16 mixed precision
Optimizer: AdamW
Learning rate: 5e-5
per_device_train_batch_size: 2
per_device_eval_batch_size: 16
gradient_accumulation_steps: 14
Total training batch size: 28
Epochs: 5
LR scheduler: Cosine with warmup
Weight decay: 0.01

Evaluation

Testing Data, & Metrics

Testing Data

The model was evaluated on the Flickr8k Arabic test split, which contains 1,000 images with 4 reference captions each.

Metrics

BLEU-1: 65.80
BLEU-2: 51.33
BLEU-3: 38.72
BLEU-4: 28.75
METEOR: 46.29

Results

The model performs well on common scenes and activities, generating grammatically correct and contextually appropriate Arabic captions. Performance decreases slightly for unusual scenes or culturally specific contexts not well-represented in the training data.

Bias, Risks, and Limitations

The model was trained on Flickr8k Arabic, which may not represent the full diversity of images and linguistic expressions in Arabic-speaking regions
May produce stereotypical or culturally insensitive descriptions
Performance may vary across different Arabic dialects and regional expressions
Limited ability to correctly describe culturally specific items, events, or contexts
May struggle with complex scenes or unusual visual elements

Recommendations

Users should review generated captions before using them in sensitive contexts
Consider post-processing or human review for public-facing applications
Test across diverse image types relevant to your use case
Be aware that the model may reflect biases present in the training data
Consider regional and dialectal differences when evaluating caption quality