FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian
Abstract
FAMA, an open science family of speech foundation models, provides transparency and competitive performance by leveraging open-source training data and code.
The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.
Community
๐ New tech report out! Meet FAMA, a new open-science speech foundation model family for both Automatic Speech Recognition (ASR) and Speech Translation (ST) in ๐ฌ๐ง English and ๐ฎ๐น Italian.
๐ The models are live and ready to try on here on Huggingface
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Granary: Speech Recognition and Translation Dataset in 25 European Languages (2025)
- Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities (2025)
- From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition (2025)
- Speechless: Speech Instruction Training Without Speech for Low Resource Languages (2025)
- Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget (2025)
- GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task (2025)
- LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper