kunaldhawan commited on
Commit
37c6f8d
·
1 Parent(s): 94bebd4

Updated the model card information with datasets and performance

Browse files
Files changed (1) hide show
  1. README.md +362 -0
README.md CHANGED
@@ -1,3 +1,365 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ library_name: nemo
5
+ datasets:
6
+ - librispeech_asr
7
+ - fisher_corpus
8
+ - mozilla-foundation/common_voice_11_0
9
+ - National-Singapore-Corpus-Part-1
10
+ - vctk
11
+ - spgi
12
+ - VoxPopuli-(EN)
13
+ - Europarl-ASR-(EN)
14
+ - Multilingual-LibriSpeech-(2000-hours)
15
+ thumbnail: null
16
+ tags:
17
+ - automatic-speech-recognition
18
+ - speech
19
+ - audio
20
+ - Transducer
21
+ - FastConformer
22
+ - CTC
23
+ - Transformer
24
+ - pytorch
25
+ - NeMo
26
+ - hf-asr-leaderboard
27
  license: cc-by-4.0
28
+ widget:
29
+ - example_title: Librispeech sample 1
30
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
31
+ - example_title: Librispeech sample 2
32
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
33
+ model-index:
34
+ - name: stt_en_fastconformer_hybrid_large_pc
35
+ results:
36
+ - task:
37
+ name: Automatic Speech Recognition
38
+ type: automatic-speech-recognition
39
+ dataset:
40
+ name: LibriSpeech (clean)
41
+ type: librispeech_asr
42
+ config: clean
43
+ split: test
44
+ args:
45
+ language: en
46
+ metrics:
47
+ - name: Test WER
48
+ type: wer
49
+ value: 2.03
50
+ - task:
51
+ name: Automatic Speech Recognition
52
+ type: automatic-speech-recognition
53
+ dataset:
54
+ name: LibriSpeech (other)
55
+ type: librispeech_asr
56
+ config: other
57
+ split: test
58
+ args:
59
+ language: en
60
+ metrics:
61
+ - name: Test WER
62
+ type: wer
63
+ value: 4.07
64
+ - task:
65
+ type: Automatic Speech Recognition
66
+ name: automatic-speech-recognition
67
+ dataset:
68
+ name: Multilingual LibriSpeech
69
+ type: facebook/multilingual_librispeech
70
+ config: english
71
+ split: test
72
+ args:
73
+ language: en
74
+ metrics:
75
+ - name: Test WER
76
+ type: wer
77
+ value: 4.53
78
+ - task:
79
+ type: Automatic Speech Recognition
80
+ name: automatic-speech-recognition
81
+ dataset:
82
+ name: Mozilla Common Voice 11.0
83
+ type: mozilla-foundation/common_voice_11_0
84
+ config: en
85
+ split: test
86
+ args:
87
+ language: en
88
+ metrics:
89
+ - name: Test WER
90
+ type: wer
91
+ value: 8.23
92
+ - task:
93
+ name: Automatic Speech Recognition
94
+ type: automatic-speech-recognition
95
+ dataset:
96
+ name: National Singapore Corpus
97
+ type: nsc_part_1
98
+ split: test
99
+ args:
100
+ language: en
101
+ metrics:
102
+ - name: Test WER
103
+ type: wer
104
+ value: 4.6
105
+ - task:
106
+ name: Automatic Speech Recognition
107
+ type: automatic-speech-recognition
108
+ dataset:
109
+ name: Fisher
110
+ type: fisher
111
+ split: test
112
+ args:
113
+ language: en
114
+ metrics:
115
+ - name: Test WER
116
+ type: wer
117
+ value: 10.34
118
+ - task:
119
+ name: Automatic Speech Recognition
120
+ type: automatic-speech-recognition
121
+ dataset:
122
+ name: VoxPopuli
123
+ type: voxpopuli
124
+ split: test
125
+ args:
126
+ language: en
127
+ metrics:
128
+ - name: Test WER
129
+ type: wer
130
+ value: 4.54
131
+ - task:
132
+ name: Automatic Speech Recognition
133
+ type: automatic-speech-recognition
134
+ dataset:
135
+ name: LibriSpeech (clean)
136
+ type: librispeech_asr
137
+ config: clean
138
+ split: test
139
+ args:
140
+ language: en
141
+ metrics:
142
+ - name: Test WER P&C
143
+ type: wer
144
+ value: 7.35
145
+ - task:
146
+ name: Automatic Speech Recognition
147
+ type: automatic-speech-recognition
148
+ dataset:
149
+ name: LibriSpeech (other)
150
+ type: librispeech_asr
151
+ config: other
152
+ split: test
153
+ args:
154
+ language: en
155
+ metrics:
156
+ - name: Test WER P&C
157
+ type: wer
158
+ value: 9.16
159
+ - task:
160
+ type: Automatic Speech Recognition
161
+ name: automatic-speech-recognition
162
+ dataset:
163
+ name: Multilingual LibriSpeech
164
+ type: facebook/multilingual_librispeech
165
+ config: english
166
+ split: test
167
+ args:
168
+ language: en
169
+ metrics:
170
+ - name: Test WER P&C
171
+ type: wer
172
+ value: 12.65
173
+ - task:
174
+ type: Automatic Speech Recognition
175
+ name: automatic-speech-recognition
176
+ dataset:
177
+ name: Mozilla Common Voice 11.0
178
+ type: mozilla-foundation/common_voice_11_0
179
+ config: en
180
+ split: test
181
+ args:
182
+ language: en
183
+ metrics:
184
+ - name: Test WER P&C
185
+ type: wer
186
+ value: 10.1
187
+ - task:
188
+ name: Automatic Speech Recognition
189
+ type: automatic-speech-recognition
190
+ dataset:
191
+ name: National Singapore Corpus
192
+ type: nsc_part_1
193
+ split: test
194
+ args:
195
+ language: en
196
+ metrics:
197
+ - name: Test WER P&C
198
+ type: wer
199
+ value: 7.19
200
+ - task:
201
+ name: Automatic Speech Recognition
202
+ type: automatic-speech-recognition
203
+ dataset:
204
+ name: Fisher
205
+ type: fisher
206
+ split: test
207
+ args:
208
+ language: en
209
+ metrics:
210
+ - name: Test WER P&C
211
+ type: wer
212
+ value: 19.02
213
+ - task:
214
+ name: Automatic Speech Recognition
215
+ type: automatic-speech-recognition
216
+ dataset:
217
+ name: VoxPopuli
218
+ type: voxpopuli
219
+ split: test
220
+ args:
221
+ language: en
222
+ metrics:
223
+ - name: Test WER P&C
224
+ type: wer
225
+ value: 6.73
226
  ---
227
+
228
+ # NVIDIA FastConformer-Hybrid Large (en)
229
+
230
+ <style>
231
+ img {
232
+ display: inline;
233
+ }
234
+ </style>
235
+
236
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
237
+ | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
238
+ | [![Language](https://img.shields.io/badge/Language-en-lightgrey#model-badge)](#datasets)
239
+
240
+
241
+ This model transcribes speech in upper and lower case English alphabet along with spaces, periods, commas, and question marks.
242
+ It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model.
243
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
244
+
245
+ ## NVIDIA NeMo: Training
246
+
247
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
248
+ ```
249
+ pip install nemo_toolkit['all']
250
+ ```
251
+
252
+ ## How to Use this Model
253
+
254
+ The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
255
+
256
+ ### Automatically instantiate the model
257
+
258
+ ```python
259
+ import nemo.collections.asr as nemo_asr
260
+ asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_hybrid_large_pc")
261
+ ```
262
+
263
+ ### Transcribing using Python
264
+ First, let's get a sample
265
+ ```
266
+ wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
267
+ ```
268
+ Then simply do:
269
+ ```
270
+ asr_model.transcribe(['2086-149220-0033.wav'])
271
+ ```
272
+
273
+ ### Transcribing many audio files
274
+
275
+ Using Transducer mode inference:
276
+ ```shell
277
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
278
+ pretrained_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
279
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
280
+ ```
281
+
282
+ Using CTC mode inference:
283
+ ```shell
284
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
285
+ pretrained_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
286
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
287
+ decoder_type="ctc"
288
+ ```
289
+
290
+ ### Input
291
+
292
+ This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
293
+
294
+ ### Output
295
+
296
+ This model provides transcribed speech as a string for a given audio sample.
297
+
298
+ ## Model Architecture
299
+
300
+ FastConformer is an optimized version of the Conformer model [1] with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) and about Hybrid Transducer-CTC training here: [Hybrid Transducer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc).
301
+
302
+ ## Training
303
+
304
+ The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_transducer_ctc_bpe.yaml).
305
+
306
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
307
+
308
+ ### Datasets
309
+
310
+ The model in this collection is trained on a composite dataset (NeMo ASRSet En PC) comprising several thousand hours of English speech:
311
+
312
+ - LibriSpeech (874 hrs)
313
+ - Fisher (998 hrs)
314
+ - MCV11 (1474 hrs)
315
+ - NSC1 (1381 hours)
316
+ - VCTK (82 hours)
317
+ - VoxPopuli (353 hours)
318
+ - Europarl-ASR (763 hours)
319
+ - MLS (1860 hours)
320
+ - SPGI (795 hours)
321
+
322
+ ## Performance
323
+
324
+ The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.
325
+
326
+ The following tables summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
327
+
328
+
329
+ a) On data without Punctuation and Capitalization with Transducer decoder
330
+
331
+ |**Version** | **Tokenizer** | **Vocabulary Size** | **MCV11 DEV** | **MCV11 TEST** | **MLS DEV** | **MLS TEST** | **VOXPOPULI DEV** | **VOXPOPULI TEST** | **EUROPARL DEV** | **EUROPARL TEST** | **FISHER DEV** | **FISHER TEST** | **SPGI DEV** | **SPGI TEST** | **LIBRISPEECH DEV CLEAN** | **LIBRISPEECH TEST CLEAN** | **LIBRISPEECH DEV OTHER** | **LIBRISPEECH TEST OTHER** | **NSC DEV** | **NSC TEST** |
332
+ |:-----------:|:-------------:|:-------------------:|:-------------:|:--------------:|:-----------:|:------------:|:-----------------:|:------------------:|:-----------:|:-------------:|:-------------------:|:-------------:|:--------------:|:-----------:|:------------:|:-----------------:|:------------------:|:------------:|:-----------------:|:------------------:|
333
+ | 1.18.0 | SentencePiece Unigram | 1024 | 7.39 | 8.23 | 4.48 | 4.53 | 4.22 | 4.54 | 9.69 | 8.02 | 10.53 | 10.34 | 2.32 | 2.26 | 1.74 | 2.03 | 4.02 | 4.07 | 4.71 | 4.6 |
334
+
335
+ b) On data with Punctuation and Capitalization with Transducer decoder|
336
+
337
+ |**Version** | **Tokenizer** | **Vocabulary Size** | **MCV11 DEV** | **MCV11 TEST** | **MLS DEV** | **MLS TEST** | **VOXPOPULI DEV** | **VOXPOPULI TEST** | **EUROPARL DEV** | **EUROPARL TEST** | **FISHER DEV** | **FISHER TEST** | **SPGI DEV** | **SPGI TEST** | **LIBRISPEECH DEV CLEAN** | **LIBRISPEECH TEST CLEAN** | **LIBRISPEECH DEV OTHER** | **LIBRISPEECH TEST OTHER** | **NSC DEV** | **NSC TEST** |
338
+ |:-----------:|:-------------:|:-------------------:|:-------------:|:--------------:|:-----------:|:------------:|:-----------------:|:------------------:|:-----------:|:-------------:|:-------------------:|:-------------:|:--------------:|:-----------:|:------------:|:-----------------:|:------------------:|:------------:|:-----------------:|:------------------:|
339
+ | 1.18.0 | SentencePiece Unigram | 1024 | 9.32 | 10.1 | 9.73 | 12.65 | 6.72 | 6.73 | 14.55 | 12.52 | 19.14 | 19.02 | 5.25 | 5.06 | 6.74 | 7.35 | 8.98 | 9.16 | 9.77 | 7.19 |
340
+
341
+ ## Limitations
342
+ Since this model was trained on publically available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. The model only outputs the punctuations: ```'.', ',', '?' ``` and hence might not do well in scenarios where other punctuations are also expected.
343
+
344
+ ## NVIDIA Riva: Deployment
345
+
346
+ [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
347
+ Additionally, Riva provides:
348
+
349
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
350
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
351
+ * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
352
+
353
+ Although this model isn’t supported yet by Riva, the [list of supported models is here](https://huggingface.co/models?other=Riva).
354
+ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
355
+
356
+ ## References
357
+ [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
358
+
359
+ [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
360
+
361
+ [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
362
+
363
+ ## Licence
364
+
365
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.