Multimodal Large Language Models for Detailed Localized Image and Video Captioning
Generate descriptions from masked images