File size: 6,438 Bytes
6b863c6 0742dfa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: cc-by-4.0
language:
- ru
library_name: nemo
datasets:
- rulibrispeech
- common_voice_21_ru
tags:
- automatic-speech-recognition
- automatic-speech-translation
- speech
- audio
- Transformer
- FastConformer
- Conformer
- pytorch
- NeMo
---
# Canary 180M Flash
<style>
img {
display: inline;
}
</style>
## Description:
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and an inference speed of more than 1200 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC).
Additionally, canary-180m-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish.
This model is released under the permissive CC-BY-4.0 license and is available for commercial use.
## Model Architecture:
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-180m-flash model has 17 encoder layers and 4 decoder layers, leading to a total of 182M parameters. For more details about the architecture, please refer to [1].
## NVIDIA NeMo
To train, fine-tune or transcribe with canary-180m-flash, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
## How to Use this Model
The model is available for use in the NeMo framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Please refer to [our tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Canary_Multitask_Speech_Model.ipynb) for more details.
A few inference examples listed below:
### Loading the Model
```python
from nemo.collections.asr.models import EncDecMultiTaskModel
# load model
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-180m-flash')
# update decode params
decode_cfg = canary_model.cfg.decoding
decode_cfg.beam.beam_size = 1
canary_model.change_decoding_strategy(decode_cfg)
```
## Input:
**Input Type(s):** Audio <br>
**Input Format(s):** .wav or .flac files<br>
**Input Parameters(s):** 1D <br>
**Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed <br>
Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.
### Inference with canary-180m-flash:
If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.
```python
output = canary_model.transcribe(
['path1.wav', 'path2.wav'],
batch_size=16, # batch size to run the inference with
pnc='True', # generate output with Punctuation and Capitalization
)
predicted_text = output[0].text
```
canary-180m-flash can also predict word-level and segment-level timestamps
```python
output = canary_model.transcribe(
['filepath.wav'],
timestamps=True, # generate output with timestamps
)
predicted_text = output[0].text
word_level_timestamps = output[0].timestamp['word']
segment_level_timestamps = output[0].timestamp['segment']
```
To predict timestamps for audio files longer than 10 seconds, we recommend using the longform inference script (explained in the next section) with `chunk_len_in_secs=10.0`.
To use canary-180m-flash for transcribing other supported languages or perform Speech-to-Text translation or provide word-level timestamps, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
```yaml
# Example of a line in input_manifest.json
{
"audio_filepath": "/path/to/audio.wav", # path to the audio file
"source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
"target_lang": "en", # language of the text output, choices=['en','de','es','fr']
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no']
"timestamp": "yes", # whether to output word-level timestamps, choices=['yes', 'no']
}
```
and then use:
```python
output = canary_model.transcribe(
"<path to input manifest file>",
batch_size=16, # batch size to run the inference with
)
```
### Longform inference with canary-180m-flash:
Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts.
The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`.
```
python scripts/speech_to_text_aed_chunked_infer.py \
pretrained_name="nvidia/canary-180m-flash" \
audio_dir=$audio_dir \
output_filename=$output_json_path \
chunk_len_in_secs=40.0 \
batch_size=1 \
decoding.beam.beam_size=1 \
timestamps=False
```
**Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds.
## Output:
**Output Type(s):** Text <br>
**Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br>
**Output Parameters:** 1-Dimensional text string <br>
**Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters <br>
## License/Terms of Use:
canary-180m-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
|