Papers
arxiv:2503.23730

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

Published on Mar 31
Β· Submitted by lastdefiance20 on Apr 1

Abstract

The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA

Community

Paper author Paper submitter

πŸ” Key Features of KOFFVQA: A Korean Free-form VQA Benchmark

πŸ“Š KOFFVQA enables open-ended evaluation, allowing models to generate free-form answers rather than choosing from predefined options.

πŸ‡°πŸ‡· It focuses exclusively on the Korean language, addressing a critical gap in VLM benchmarks and recognizing that model performance can vary significantly by language.

πŸ–ΌοΈ The benchmark includes 275 carefully curated image-question pairs, each accompanied by grading criteria that evaluate 10 diverse aspects of VLM performance.

βš–οΈ Evaluation is based on a partial scoring approach using human-authored grading criteria, which enhances consistency and reduces subjectivity. This also allows for reliable evaluation using small open-source judge models.

πŸ§ͺ Thanks to this design, KOFFVQA enables the use of LLMs as judges without visual input. While VLM-based judges often hallucinate visual details and misgrade responses, LLM-based judges focus solely on the criteria and align more closely with human judgment.

πŸ’» The evaluation code and dataset are open-source, supporting reproducibility and encouraging further research.

KOFFVQA is a Korean-language, fine-grained, and reliable benchmark for evaluating VLMs, and it highlights the effectiveness of LLM-based evaluation in VQA as a novel and practical alternative.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.23730 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 1