Training data

#2
by kardosdrur - opened

Hi @Gameselo
I'm Márton, maintainer of MTEB. I'm writing to you as we have been collecting metadata on models to provide our users a realistic estimate of how much models' scores on MTEB can be considered to be indicative of their generalized performance (if models train on MTEB, they obviously perform better).
We are still lacking annotations on your model, and I failed to find information about what your model has been trained on.
Can you please tell us, which datasets in MTEB, in particular in the multilingual benchmark were or were not used to train this model?
Thanks in advance, Márton

Hi Márton, hope you're fine.
Sorry for my late answer, but here it is.
My model has been trained on my whole dataset (https://huggingface.co/datasets/Gameselo/monolingual-wideNLI), on its page you'll find all the specs. I used all the data that wasn't in MTEB evaluation datasets.
I actually used MTEB training sets to train my model too, but not the other ones (I used the dev split for validation and test split for evaluation then).
Hope it will help you.
Léo

Hi again, no worries, and thanks for getting back to me.
When you say you have trained on the MTEB training sets, do you mean all of them or just the ones that are in your dataset (monolingual-wideNLI)?

All of them !

Thanks! I'm adding the annotations!

For your convenience, I used the train datasets of the May-June 2024 version of MTEB leaderboard. New datasets used in the MTEB leaderboard aren't used at all

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment