π Introducing Video-MMLU, a new benchmark for evaluating large multimodal models on classroom-style lectures in math, physics, and chemistry!
π§βπ«πVideo-MMLU requires strong reasoning capabilities and world knowledge compared to the previous benchmarks for video LMMs.
Each video comes with two tasks: π Take Notes β detailed captioning of multi-discipline lectures π§ Do Quiz β open-ended QA to test reasoning over visuals & proofs
We evaluated 90+ models, including vision-blind baselines, open-source models and proprietary ones. π We find that existing models generally perform poorly, with accuracy ranging from only 10% to 50%. πWe also explore how the number of visual tokens and the base LLMs influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.