Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Abstract
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
Community
I noticed you included Bloom Library in your comparison of datasets. Please add the appropriate citation for this work. If you used the prepared datasets, cite https://arxiv.org/abs/2210.14712 . If you used the website directly, cite bloomlibrary.org. Thanks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Diffusion Models Through a Global Lens: Are They Culturally Inclusive? (2025)
- SEA-HELM: Southeast Asian Holistic Evaluation of Language Models (2025)
- Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems (2025)
- PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian (2025)
- Cross-Cultural Fashion Design via Interactive Large Language Models and Diffusion Models (2025)
- Scaling Pre-training to One Hundred Billion Data for Vision Language Models (2025)
- RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper