|
--- |
|
license: mit |
|
pipeline_tag: automatic-speech-recognition |
|
library_name: nemo |
|
--- |
|
## MahaDhwani Pretrained Conformer |
|
|
|
It is a self-supervised pre-trained conformer encoder model trained on MahaDhwani dataset. |
|
|
|
### Language |
|
|
|
Contains training data from 22 scheduled languages of India. |
|
|
|
### Input |
|
|
|
This model accepts 16000 KHz Mono-channel Audio (wav files) as input. |
|
|
|
### Output |
|
|
|
This model provides conformer encoder embeddings as the output for a given audio sample. |
|
|
|
## Model Architecture |
|
|
|
This model is a conformer-Large model, consisting of 120M parameters, as the encoder. The model has 17 conformer blocks with |
|
512 as the model dimension. |
|
|
|
|
|
## AI4Bharat NeMo: |
|
|
|
To load, train, fine-tune or play with the model you will need to install [AI4Bharat NeMo](https://github.com/AI4Bharat/NeMo). We recommend you install it using the command shown below |
|
``` |
|
git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh |
|
``` |
|
|
|
## Usage |
|
Download and load the model from Huggingface. |
|
``` |
|
import pydub |
|
import numpy as np |
|
import torch |
|
import nemo.collections.asr as nemo_asr |
|
|
|
model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/MahaDhwani_pretrained_conformer") |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.freeze() # inference mode |
|
model = model.to(device) # transfer model to device |
|
``` |
|
Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel. |
|
``` |
|
ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav |
|
``` |
|
|
|
### Inference |
|
``` |
|
wavpath = 'sample.wav' |
|
wav = pydub.AudioSegment.from_file(wavpath).set_frame_rate(16000).set_channels(1) |
|
sarray = wav.get_array_of_samples() |
|
fp_arr = np.array(sarray).T.astype(np.float64) |
|
fp_arr = fp_arr.reshape((1,-1)) |
|
feature = torch.from_numpy(fp_arr).float().to(device='cuda') |
|
length=torch.tensor([fp_arr.shape[1]]).to(device='cuda') |
|
|
|
spectrograms, spec_masks, encoded, encoded_len = model(input_signal=feature,input_signal_length=length) |
|
``` |