Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru Paper • 2503.07587 • Published 6 days ago • 10
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published 5 days ago • 90
JurisTCU: A Brazilian Portuguese Information Retrieval Dataset with Query Relevance Judgments Paper • 2503.08379 • Published 5 days ago • 2
EuroBERT: Scaling Multilingual Encoders for European Languages Paper • 2503.05500 • Published 9 days ago • 72
view article Article Introducing EuroBERT: A High-Performance Multilingual Encoder Model By EuroBERT and 3 others • 6 days ago • 121
view article Article HuggingFace, IISc partner to supercharge model building on India's diverse languages 17 days ago • 14
rank1 Collection rank1 is the first test-time compute reasoning model in IR • 15 items • Updated 17 days ago • 3
OWLS: Scaling Laws for Speech Recognition and Translation Collection 🦉 A suite of Whisper-style models from 250M to 18B parameters. Trained on up to 360K hours of data. 16k sampling rate. • 7 items • Updated 6 days ago • 4
Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models Paper • 2502.15964 • Published 23 days ago • 1
"Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts Paper • 2502.16839 • Published 20 days ago • 1
Slam Collection All resources for SpeechLMs from "Slamming: Training a Speech Language Model on One GPU in a Day". We provide tokeniser, lm, and datasets • 6 items • Updated 19 days ago • 13
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models Paper • 2502.17387 • Published 20 days ago • 5
KB-Whisper Collection Whisper models trained on over 50,000 hours of Swedish speech data. • 5 items • Updated 30 days ago • 5