--- pipeline_tag: sentence-similarity tags: - feature-extraction - sentence-similarity - setfit - e5 license: mit datasets: - KnutJaegersberg/wikipedia_categories - KnutJaegersberg/wikipedia_categories_labels --- This English model predicts the top 2 levels of the wikipedia categories (roundabout 1100 labels). It is trained on the concatenation of the headlines of the lower level categories articles in few shot setting (i.e. 8 subcategories with their headline concatenations per level 2 category). Accuracy on test data split of the higher category level (37 labels) is 73 % and on level 2 is 60%. Note that these numbers are just an indicator that training worked, it will differ in production settings, which is why this classifier is meant for corpus exploration. Use the wikipedia_categories_labels dataset as key. from setfit import SetFitModel Download from Hub and run inference model = SetFitModel.from_pretrained("KnutJaegersberg/wikipedia_categories_setfit") Run inference preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"])