Canary 180M Flash

Description:

NVIDIA NeMo Canary Flash [1] is a family of multilingual multi-tasking models based on Canary architecture [2] that achieves state-of-the art performance on multiple speech benchmarks. With 182 million parameters and an inference speed of more than 1200 RTFx (on open-asr-leaderboard sets), canary-180m-flash supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). Additionally, canary-180m-flash offers an experimental feature for word-level and segment-level timestamps in English, German, French, and Spanish. This model is released under the permissive CC-BY-4.0 license and is available for commercial use.

Model Architecture:

Canary is an encoder-decoder model with FastConformer [3] Encoder and Transformer Decoder [4]. With audio features extracted from the encoder, task tokens such as <target language>, <task>, <toggle timestamps> and <toggle PnC> are fed into the Transformer Decoder to trigger the text generation process. Canary uses a concatenated tokenizer [5] from individual SentencePiece [6] tokenizers of each language, which makes it easy to scale up to more languages. The canary-180m-flash model has 17 encoder layers and 4 decoder layers, leading to a total of 182M parameters. For more details about the architecture, please refer to [1].

NVIDIA NeMo

To train, fine-tune or transcribe with canary-180m-flash, you will need to install NVIDIA NeMo.

How to Use this Model

The model is available for use in the NeMo framework [7], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Please refer to our tutorial for more details.

A few inference examples listed below:

Loading the Model

from nemo.collections.asr.models import EncDecMultiTaskModel
# load model
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-180m-flash')
# update decode params
decode_cfg = canary_model.cfg.decoding
decode_cfg.beam.beam_size = 1
canary_model.change_decoding_strategy(decode_cfg)

Input:

Input Type(s): Audio
Input Format(s): .wav or .flac files
Input Parameters(s): 1D
Other Properties Related to Input: 16000 Hz Mono-channel Audio, Pre-Processing Not Needed

Input to canary-180m-flash can be either a list of paths to audio files or a jsonl manifest file.

Inference with canary-180m-flash:

If the input is a list of paths, canary-180m-flash assumes that the audio is English and transcribes it. I.e., canary-180m-flash default behavior is English ASR.

output = canary_model.transcribe(
    ['path1.wav', 'path2.wav'],
    batch_size=16,  # batch size to run the inference with
    pnc='True',        # generate output with Punctuation and Capitalization
)

predicted_text = output[0].text

canary-180m-flash can also predict word-level and segment-level timestamps

output = canary_model.transcribe(
  ['filepath.wav'],
  timestamps=True,  # generate output with timestamps
)

predicted_text = output[0].text
word_level_timestamps = output[0].timestamp['word']
segment_level_timestamps = output[0].timestamp['segment']

To predict timestamps for audio files longer than 10 seconds, we recommend using the longform inference script (explained in the next section) with chunk_len_in_secs=10.0.

To use canary-180m-flash for transcribing other supported languages or perform Speech-to-Text translation or provide word-level timestamps, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:

# Example of a line in input_manifest.json
{
    "audio_filepath": "/path/to/audio.wav",  # path to the audio file
    "source_lang": "en",  # language of the audio input, set `source_lang`==`target_lang` for ASR, choices=['en','de','es','fr']
    "target_lang": "en",  # language of the text output, choices=['en','de','es','fr']
    "pnc": "yes",  # whether to have PnC output, choices=['yes', 'no']
    "timestamp": "yes", # whether to output word-level timestamps, choices=['yes', 'no']
}

and then use:

output = canary_model.transcribe(
    "<path to input manifest file>",
    batch_size=16,  # batch size to run the inference with
)

Longform inference with canary-180m-flash:

Canary models are designed to handle input audio smaller than 40 seconds. In order to handle longer audios, NeMo includes speech_to_text_aed_chunked_infer.py script that handles chunking, performs inference on the chunked files, and stitches the transcripts.

The script will perform inference on all .wav files in audio_dir. Alternatively you can also pass a path to a manifest file as shown above. The decoded output will be saved at output_json_path.

python scripts/speech_to_text_aed_chunked_infer.py \
    pretrained_name="nvidia/canary-180m-flash" \
    audio_dir=$audio_dir \
    output_filename=$output_json_path \
    chunk_len_in_secs=40.0 \
    batch_size=1 \
    decoding.beam.beam_size=1 \
    timestamps=False

Note that for longform inference with timestamps, it is recommended to use chunk_len_in_secs of 10 seconds.

Output:

Output Type(s): Text
Output Format: Text output as a string (w/ timestamps) depending on the task chosen for decoding
Output Parameters: 1-Dimensional text string
Other Properties Related to Output: May Need Inverse Text Normalization; Does Not Handle Special Characters

License/Terms of Use:

canary-180m-flash is released under the CC-BY-4.0 license. By using this model, you are agreeing to the terms and conditions of the license.