Release the UMRB
Great work!!!!
@zyznull
it would be better is you can release the UMR benchmark, thanks.
Thanks for your attention. We are in the process of integrating the UMRB behchmark into MTEB. We will notice you as soon as it is ready.
Now, all tasks already exist in the mieb branch https://github.com/embeddings-benchmark/mteb/tree/mieb
Task name list refer to https://github.com/izhx/mteb/blob/mieb/mteb/benchmarks/benchmarks.py#L864
But the ViDoRe subsets have some inconsistencies and we are working on fixes https://github.com/embeddings-benchmark/mteb/pull/1607
Now MIBE has been merged into the main branch, you can directly use mteb
tool to evaluate, the task list:
umrb_tasks=[
# Single-modal
## Text -to- text
'ArguAna', 'ClimateFEVER',
'CQADupstackAndroidRetrieval',
'CQADupstackEnglishRetrieval',
'CQADupstackGamingRetrieval',
'CQADupstackGisRetrieval',
'CQADupstackMathematicaRetrieval',
'CQADupstackPhysicsRetrieval',
'CQADupstackProgrammersRetrieval',
'CQADupstackStatsRetrieval',
'CQADupstackTexRetrieval',
'CQADupstackUnixRetrieval',
'CQADupstackWebmastersRetrieval',
'CQADupstackWordpressRetrieval',
'DBPedia', 'FEVER', 'FiQA2018', 'HotpotQA', 'MSMARCO', 'NFCorpus', 'NQ',
'QuoraRetrieval', 'SCIDOCS', 'SciFact', 'Touche2020', 'TRECCOVID',
'WebQAT2TRetrieval',
## Image -to- image
'NIGHTSI2IRetrieval',
# Cross-modal
## Text -to- image
'VisualNewsT2IRetrieval', 'Fashion200kT2IRetrieval', 'MSCOCOT2IRetrieval', 'Flickr30kT2IRetrieval',
## Text -to- visual document
'VidoreArxivQARetrieval', 'VidoreDocVQARetrieval', 'VidoreInfoVQARetrieval',
'VidoreTabfquadRetrieval', 'VidoreTatdqaRetrieval', 'VidoreShiftProjectRetrieval',
'VidoreSyntheticDocQAAIRetrieval', 'VidoreSyntheticDocQAEnergyRetrieval',
'VidoreSyntheticDocQAGovernmentReportsRetrieval', 'VidoreSyntheticDocQAHealthcareIndustryRetrieval',
## Image -to- text
'VisualNewsI2TRetrieval', 'Fashion200kI2TRetrieval', 'MSCOCOI2TRetrieval', 'Flickr30kI2TRetrieval',
# Fused-modal
## Text -to- image,text
'WebQAT2ITRetrieval', 'EDIST2ITRetrieval',
## Image,text -to- text
'OVENIT2TRetrieval', 'InfoSeekIT2TRetrieval',
'ReMuQIT2TRetrieval', 'OKVQAIT2TRetrieval', 'LLaVAIT2TRetrieval',
## Image,text -to- image
'FashionIQIT2IRetrieval', 'CIRRIT2IRetrieval',
## Text,image -to- text,image
'OVENIT2ITRetrieval', 'InfoSeekIT2ITRetrieval', 'EncyclopediaVQAIT2ITRetrieval'
]
## Evaluation script
import mteb
# Define the model name
model_name = "Alibaba-NLP/gme-Qwen2-VL-2B-Instruct"
model = mteb.get_model(model_name)
tasks = mteb.get_tasks(tasks=umrb_tasks)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")
Thank you so much for your wonderful work and for providing such a great codebase!
I have a question about the evaluation metrics used in your paper. In Table 7, you described metrics for all tasks. However, when I evaluate the models using the mteb
tool, I get different results compared to those in Table 7. I also noticed another metric called cv_recall
.
Could you clarify if the recall
in Table 7 is actually referring to cv_recall
? Any insight you can provide would be greatly appreciated.
Thank you again for your time and help!