lbourdois commited on
Commit
a0dc294
·
verified ·
1 Parent(s): 2a83e72

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +98 -86
README.md CHANGED
@@ -1,87 +1,99 @@
1
- ---
2
- library_name: transformers
3
- license: mit
4
- datasets:
5
- - slprl/sTinyStories
6
- language:
7
- - en
8
- base_model:
9
- - Qwen/Qwen2.5-7B
10
- pipeline_tag: audio-to-audio
11
- ---
12
-
13
- # Scaling Analysis of Interleaved Speech-Text Language Models
14
-
15
- The model was presented in the paper [Scaling Analysis of Interleaved Speech-Text Language Models](https://arxiv.org/abs/2504.02398).
16
-
17
- # Paper abstract
18
- Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data
19
- compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from
20
- pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - _Do interleaved SLMs scale more efficiently than textless-SLMs?_
21
- In this paper we answer a resounding _yes!_ We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the
22
- scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the
23
- scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for
24
- increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential.
25
- Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less
26
- compute and data than other approaches.
27
-
28
- # Model Card for Model ID
29
- This is a Speech Language Model (SLM) trained for generating speech or text continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz) given speech-text prompts.
30
-
31
-
32
- ## Model Details
33
-
34
- ### Model Description
35
- This Speech Language Model, introduced in ["Scaling Analysis of Interleaved Speech-Text Language Models"](https://arxiv.org/abs/2504.02398), focuses on scaling analysis of interleaved speech-text SLMs.
36
- It was fine-tuned from [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) by extending its vocabulary with 500 speech tokens extracted from
37
- the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
38
-
39
- - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
40
- - **Model type:** SpeechLM
41
- - **License:** MIT
42
- - **Finetuned from model:** [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
43
-
44
- ### Model Sources
45
-
46
- - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
47
- - **Paper:** [https://arxiv.org/abs/2504.02398](https://arxiv.org/abs/2504.02398)
48
- - **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/sims/](https://pages.cs.huji.ac.il/adiyoss-lab/sims/)
49
-
50
- ## Uses
51
- This base SpeechLM can be used to generate continuations for speech segments, or cross-modal e.g generate a text contiuation to a speech prompt, or as a base for further tuning. See the _SlamKit_
52
- [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/sims/) for some generation examples
53
-
54
- ### Out-of-Scope Use
55
- This model was trained on diverse speech datasets, as such the outputs should not be treated as factual in any way.
56
-
57
-
58
- ## How to Get Started with the Model
59
- We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
60
-
61
-
62
- ## Training Details
63
- We highly encourage users to read the full [paper](https://arxiv.org/abs/2504.02398), for full training details.
64
-
65
-
66
- ### Compute Infrastructure
67
- #### Hardware
68
- This model was trained using 8 Nvidia H100 GPUs.
69
-
70
- #### Software
71
- The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
72
- easy and efficient training of Speech Language Models.
73
-
74
- ## Citation
75
-
76
- **BibTeX:**
77
- ```
78
- @misc{maimon2025scaling,
79
- title={Scaling Analysis of Interleaved Speech-Text Language Models},
80
- author={Gallil Maimon and Michael Hassid and Amit Roth and Yossi Adi},
81
- year={2025},
82
- eprint={2504.02398},
83
- archivePrefix={arXiv},
84
- primaryClass={cs.CL},
85
- url={https://arxiv.org/abs/2504.02398},
86
- }
 
 
 
 
 
 
 
 
 
 
 
 
87
  ```
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ datasets:
5
+ - slprl/sTinyStories
6
+ language:
7
+ - zho
8
+ - eng
9
+ - fra
10
+ - spa
11
+ - por
12
+ - deu
13
+ - ita
14
+ - rus
15
+ - jpn
16
+ - kor
17
+ - vie
18
+ - tha
19
+ - ara
20
+ base_model:
21
+ - Qwen/Qwen2.5-7B
22
+ pipeline_tag: audio-to-audio
23
+ ---
24
+
25
+ # Scaling Analysis of Interleaved Speech-Text Language Models
26
+
27
+ The model was presented in the paper [Scaling Analysis of Interleaved Speech-Text Language Models](https://arxiv.org/abs/2504.02398).
28
+
29
+ # Paper abstract
30
+ Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data
31
+ compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from
32
+ pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - _Do interleaved SLMs scale more efficiently than textless-SLMs?_
33
+ In this paper we answer a resounding _yes!_ We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the
34
+ scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the
35
+ scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for
36
+ increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential.
37
+ Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less
38
+ compute and data than other approaches.
39
+
40
+ # Model Card for Model ID
41
+ This is a Speech Language Model (SLM) trained for generating speech or text continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz) given speech-text prompts.
42
+
43
+
44
+ ## Model Details
45
+
46
+ ### Model Description
47
+ This Speech Language Model, introduced in ["Scaling Analysis of Interleaved Speech-Text Language Models"](https://arxiv.org/abs/2504.02398), focuses on scaling analysis of interleaved speech-text SLMs.
48
+ It was fine-tuned from [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) by extending its vocabulary with 500 speech tokens extracted from
49
+ the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
50
+
51
+ - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
52
+ - **Model type:** SpeechLM
53
+ - **License:** MIT
54
+ - **Finetuned from model:** [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)
55
+
56
+ ### Model Sources
57
+
58
+ - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
59
+ - **Paper:** [https://arxiv.org/abs/2504.02398](https://arxiv.org/abs/2504.02398)
60
+ - **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/sims/](https://pages.cs.huji.ac.il/adiyoss-lab/sims/)
61
+
62
+ ## Uses
63
+ This base SpeechLM can be used to generate continuations for speech segments, or cross-modal e.g generate a text contiuation to a speech prompt, or as a base for further tuning. See the _SlamKit_
64
+ [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/sims/) for some generation examples
65
+
66
+ ### Out-of-Scope Use
67
+ This model was trained on diverse speech datasets, as such the outputs should not be treated as factual in any way.
68
+
69
+
70
+ ## How to Get Started with the Model
71
+ We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
72
+
73
+
74
+ ## Training Details
75
+ We highly encourage users to read the full [paper](https://arxiv.org/abs/2504.02398), for full training details.
76
+
77
+
78
+ ### Compute Infrastructure
79
+ #### Hardware
80
+ This model was trained using 8 Nvidia H100 GPUs.
81
+
82
+ #### Software
83
+ The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
84
+ easy and efficient training of Speech Language Models.
85
+
86
+ ## Citation
87
+
88
+ **BibTeX:**
89
+ ```
90
+ @misc{maimon2025scaling,
91
+ title={Scaling Analysis of Interleaved Speech-Text Language Models},
92
+ author={Gallil Maimon and Michael Hassid and Amit Roth and Yossi Adi},
93
+ year={2025},
94
+ eprint={2504.02398},
95
+ archivePrefix={arXiv},
96
+ primaryClass={cs.CL},
97
+ url={https://arxiv.org/abs/2504.02398},
98
+ }
99
  ```