MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
Abstract
Two new benchmarks, MangaOCR and MangaVQA, and a specialized model, MangaLMM, are introduced to evaluate and advance large multimodal models in understanding manga narratives.
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
Community
Our vision for AGI is unlike the mainstream for the community.
Yes โ we aim to build a super-human AI manga assistant ๐จ๐ค
As the first step, our team developed MangaLMM, a LMM that can solve both MangaOCR and our newly created MangaVQA tasks!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding (2025)
- LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs (2025)
- RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding (2025)
- HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation (2025)
- Emerging Properties in Unified Multimodal Pretraining (2025)
- MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation (2025)
- TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper