Behind Maya: Building a Multilingual Vision Language Model
Abstract
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
Community
Accepted at VLMs4ALLCVPR 2025 Workshop
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization (2025)
- Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders (2025)
- Aya Vision: Advancing the Frontier of Multilingual Multimodality (2025)
- Is LLM the Silver Bullet to Low-Resource Languages Machine Translation? (2025)
- Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation (2025)
- Lugha-Llama: Adapting Large Language Models for African Languages (2025)
- Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper