Release the UMRB

pinned

by BestWishYsh - opened Dec 25, 2024

Discussion

BestWishYsh

Dec 25, 2024

Great work!!!!
@zyznull it would be better is you can release the UMR benchmark, thanks.

zyznull

Alibaba-NLP org Dec 25, 2024

Thanks for your attention. We are in the process of integrating the UMRB behchmark into MTEB. We will notice you as soon as it is ready.

izhx

Alibaba-NLP org Dec 25, 2024

Now, all tasks already exist in the mieb branch https://github.com/embeddings-benchmark/mteb/tree/mieb
Task name list refer to https://github.com/izhx/mteb/blob/mieb/mteb/benchmarks/benchmarks.py#L864
But the ViDoRe subsets have some inconsistencies and we are working on fixes https://github.com/embeddings-benchmark/mteb/pull/1607

izhx

Alibaba-NLP org Feb 25

Now MIBE has been merged into the main branch, you can directly use mteb tool to evaluate, the task list:

umrb_tasks=[
    # Single-modal
    ## Text -to- text
    'ArguAna', 'ClimateFEVER',
    'CQADupstackAndroidRetrieval',
    'CQADupstackEnglishRetrieval',
    'CQADupstackGamingRetrieval',
    'CQADupstackGisRetrieval',
    'CQADupstackMathematicaRetrieval',
    'CQADupstackPhysicsRetrieval',
    'CQADupstackProgrammersRetrieval',
    'CQADupstackStatsRetrieval',
    'CQADupstackTexRetrieval',
    'CQADupstackUnixRetrieval',
    'CQADupstackWebmastersRetrieval',
    'CQADupstackWordpressRetrieval',
    'DBPedia', 'FEVER', 'FiQA2018', 'HotpotQA', 'MSMARCO', 'NFCorpus', 'NQ',
    'QuoraRetrieval', 'SCIDOCS', 'SciFact', 'Touche2020', 'TRECCOVID',
    'WebQAT2TRetrieval',
    ## Image -to- image
    'NIGHTSI2IRetrieval',
    # Cross-modal
    ## Text -to- image
    'VisualNewsT2IRetrieval', 'Fashion200kT2IRetrieval', 'MSCOCOT2IRetrieval', 'Flickr30kT2IRetrieval',
    ## Text -to- visual document
    'VidoreArxivQARetrieval', 'VidoreDocVQARetrieval', 'VidoreInfoVQARetrieval',
    'VidoreTabfquadRetrieval', 'VidoreTatdqaRetrieval', 'VidoreShiftProjectRetrieval',
    'VidoreSyntheticDocQAAIRetrieval', 'VidoreSyntheticDocQAEnergyRetrieval',
    'VidoreSyntheticDocQAGovernmentReportsRetrieval', 'VidoreSyntheticDocQAHealthcareIndustryRetrieval',
    ## Image -to- text
    'VisualNewsI2TRetrieval', 'Fashion200kI2TRetrieval', 'MSCOCOI2TRetrieval', 'Flickr30kI2TRetrieval',
    # Fused-modal
    ## Text -to- image,text
    'WebQAT2ITRetrieval', 'EDIST2ITRetrieval',
    ## Image,text -to- text
    'OVENIT2TRetrieval', 'InfoSeekIT2TRetrieval',
    'ReMuQIT2TRetrieval', 'OKVQAIT2TRetrieval', 'LLaVAIT2TRetrieval',
    ## Image,text -to- image
    'FashionIQIT2IRetrieval', 'CIRRIT2IRetrieval',
    ## Text,image -to- text,image
    'OVENIT2ITRetrieval', 'InfoSeekIT2ITRetrieval', 'EncyclopediaVQAIT2ITRetrieval'
]

## Evaluation script
import mteb

# Define the model name
model_name = "Alibaba-NLP/gme-Qwen2-VL-2B-Instruct"

model = mteb.get_model(model_name)
tasks = mteb.get_tasks(tasks=umrb_tasks)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")

izhx pinned discussion Feb 25

Y-J-Ju

Mar 21

Thank you so much for your wonderful work and for providing such a great codebase!

I have a question about the evaluation metrics used in your paper. In Table 7, you described metrics for all tasks. However, when I evaluate the models using the mteb tool, I get different results compared to those in Table 7. I also noticed another metric called cv_recall.

Could you clarify if the recall in Table 7 is actually referring to cv_recall? Any insight you can provide would be greatly appreciated.

Thank you again for your time and help!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment