arxiv:2505.08910

Behind Maya: Building a Multilingual Vision Language Model

Published on May 13

· Submitted by

kkr5155 on May 15

Upvote

Authors:

Karthik Reddy Kanjula ,

Abstract

In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.

View arXiv page View PDF Project page GitHub repository Add to collection