|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- ru |
|
library_name: nemo |
|
datasets: |
|
- rulibrispeech |
|
- common_voice_21_ru |
|
tags: |
|
- automatic-speech-recognition |
|
- automatic-speech-translation |
|
- speech |
|
- audio |
|
- Transformer |
|
- FastConformer |
|
- Conformer |
|
- pytorch |
|
- NeMo |
|
--- |
|
|
|
# Canary 180M Flash |
|
|
|
<style> |
|
img { |
|
display: inline; |
|
} |
|
</style> |
|
|
|
## Description: |
|
NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and an inference speed of more than 1200 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). |
|
Additionally, canary-180m-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish. |
|
This model is released under the permissive CC-BY-4.0 license and is available for commercial use. |
|
|
|
|
|
## Model Architecture: |
|
Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as \<target language\>, \<task\>, \<toggle timestamps\> and \<toggle PnC\> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-180m-flash model has 17 encoder layers and 4 decoder layers, leading to a total of 182M parameters. For more details about the architecture, please refer to [1]. |
|
|
|
## NVIDIA NeMo |
|
|
|
To train, fine-tune or transcribe with canary-180m-flash, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). |
|
|
|
## How to Use this Model |
|
|
|
The model is available for use in the NeMo framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. |
|
|
|
Please refer to [our tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Canary_Multitask_Speech_Model.ipynb) for more details. |
|
|
|
A few inference examples listed below: |
|
|
|
### Loading the Model |
|
|
|
```python |
|
from nemo.collections.asr.models import EncDecMultiTaskModel |
|
# load model |
|
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-180m-flash') |
|
# update decode params |
|
decode_cfg = canary_model.cfg.decoding |
|
decode_cfg.beam.beam_size = 1 |
|
canary_model.change_decoding_strategy(decode_cfg) |
|
``` |
|
|
|
## Input: |
|
**Input Type(s):** Audio <br> |
|
**Input Format(s):** .wav or .flac files<br> |
|
**Input Parameters(s):** 1D <br> |
|
**Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed <br> |
|
|
|
Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file. |
|
|
|
### Inference with canary-180m-flash: |
|
If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR. |
|
```python |
|
output = canary_model.transcribe( |
|
['path1.wav', 'path2.wav'], |
|
batch_size=16, # batch size to run the inference with |
|
pnc='True', # generate output with Punctuation and Capitalization |
|
) |
|
|
|
predicted_text = output[0].text |
|
|
|
``` |
|
|
|
canary-180m-flash can also predict word-level and segment-level timestamps |
|
```python |
|
output = canary_model.transcribe( |
|
['filepath.wav'], |
|
timestamps=True, # generate output with timestamps |
|
) |
|
|
|
predicted_text = output[0].text |
|
word_level_timestamps = output[0].timestamp['word'] |
|
segment_level_timestamps = output[0].timestamp['segment'] |
|
|
|
``` |
|
To predict timestamps for audio files longer than 10 seconds, we recommend using the longform inference script (explained in the next section) with `chunk_len_in_secs=10.0`. |
|
|
|
To use canary-180m-flash for transcribing other supported languages or perform Speech-to-Text translation or provide word-level timestamps, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields: |
|
|
|
```yaml |
|
# Example of a line in input_manifest.json |
|
{ |
|
"audio_filepath": "/path/to/audio.wav", # path to the audio file |
|
"source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr'] |
|
"target_lang": "en", # language of the text output, choices=['en','de','es','fr'] |
|
"pnc": "yes", # whether to have PnC output, choices=['yes', 'no'] |
|
"timestamp": "yes", # whether to output word-level timestamps, choices=['yes', 'no'] |
|
} |
|
``` |
|
|
|
and then use: |
|
```python |
|
output = canary_model.transcribe( |
|
"<path to input manifest file>", |
|
batch_size=16, # batch size to run the inference with |
|
) |
|
``` |
|
|
|
### Longform inference with canary-180m-flash: |
|
Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes [speech_to_text_aed_chunked_infer.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py) script that handles chunking, performs inference on the chunked files, and stitches the transcripts. |
|
|
|
The script will perform inference on all `.wav` files in `audio_dir`. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at `output_json_path`. |
|
|
|
``` |
|
python scripts/speech_to_text_aed_chunked_infer.py \ |
|
pretrained_name="nvidia/canary-180m-flash" \ |
|
audio_dir=$audio_dir \ |
|
output_filename=$output_json_path \ |
|
chunk_len_in_secs=40.0 \ |
|
batch_size=1 \ |
|
decoding.beam.beam_size=1 \ |
|
timestamps=False |
|
``` |
|
|
|
**Note** that for longform inference with timestamps, it is recommended to use `chunk_len_in_secs` of 10 seconds. |
|
|
|
|
|
## Output: |
|
**Output Type(s):** Text <br> |
|
**Output Format:** Text output as a string (w/ timestamps) depending on the task chosen for decoding <br> |
|
**Output Parameters:** 1-Dimensional text string <br> |
|
**Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters <br> |
|
|
|
|
|
## License/Terms of Use: |
|
canary-180m-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br> |
|
|
|
|