BLIP Image Captioning - Arabic (Flickr8k Arabic)
This model is a fine-tuned version of Salesforce/blip-image-captioning-large
, adapted for image captioning in Arabic using the Flickr8K Arabic dataset. It takes an input image and generates a relevant caption in Arabic, describing the image content.
Model Sources
- Paper: Based on "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"
How to Get Started with the Model
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt
# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-Arabic-flickr-8k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")
# Show image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()
# Generate enhanced Arabic caption with better parameters
model.eval()
with torch.no_grad():
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
generated_output = model.generate(
pixel_values=pixel_values,
max_length=75,
min_length=20,
num_beams=5,
repetition_penalty=1.5,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True
)
caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
print(caption) # Prints Arabic caption
Training Details
Training Data
This model was fine-tuned on the Flickr8k Arabic dataset, which consists of 8,000 images, each with 4 reference Arabic captions. The dataset provides a diverse collection of everyday scenes and activities described in Modern Standard Arabic.
- Dataset: Flickr8k Arabic
- Size: 8,000 images with 32,000 captions
Training Procedure
The model was fine-tuned from the original BLIP model by adapting its language generation capabilities to Arabic text.
Training Hyperparameters
- Training regime: fp16 mixed precision
- Optimizer: AdamW
- Learning rate: 5e-5
- per_device_train_batch_size: 2
- per_device_eval_batch_size: 16
- gradient_accumulation_steps: 14
- Total training batch size: 28
- Epochs: 5
- LR scheduler: Cosine with warmup
- Weight decay: 0.01
Evaluation
Testing Data, & Metrics
Testing Data
The model was evaluated on the Flickr8k Arabic test split, which contains 1,000 images with 4 reference captions each.
Metrics
- BLEU-1: 65.80
- BLEU-2: 51.33
- BLEU-3: 38.72
- BLEU-4: 28.75
- METEOR: 46.29
Results
The model performs well on common scenes and activities, generating grammatically correct and contextually appropriate Arabic captions. Performance decreases slightly for unusual scenes or culturally specific contexts not well-represented in the training data.
Bias, Risks, and Limitations
- The model was trained on Flickr8k Arabic, which may not represent the full diversity of images and linguistic expressions in Arabic-speaking regions
- May produce stereotypical or culturally insensitive descriptions
- Performance may vary across different Arabic dialects and regional expressions
- Limited ability to correctly describe culturally specific items, events, or contexts
- May struggle with complex scenes or unusual visual elements
Recommendations
- Users should review generated captions before using them in sensitive contexts
- Consider post-processing or human review for public-facing applications
- Test across diverse image types relevant to your use case
- Be aware that the model may reflect biases present in the training data
- Consider regional and dialectal differences when evaluating caption quality
- Downloads last month
- 47