File size: 2,130 Bytes
07fb434 9a34af6 07fb434 ab520bd 07fb434 3d21cd5 ab520bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
license: mit
pipeline_tag: automatic-speech-recognition
library_name: nemo
---
## MahaDhwani Pretrained Conformer
It is a self-supervised pre-trained conformer encoder model trained on MahaDhwani dataset.
### Language
Contains training data from 22 scheduled languages of India.
### Input
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
### Output
This model provides conformer encoder embeddings as the output for a given audio sample.
## Model Architecture
This model is a conformer-Large model, consisting of 120M parameters, as the encoder. The model has 17 conformer blocks with
512 as the model dimension.
## AI4Bharat NeMo:
To load, train, fine-tune or play with the model you will need to install [AI4Bharat NeMo](https://github.com/AI4Bharat/NeMo). We recommend you install it using the command shown below
```
git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh
```
## Usage
Download and load the model from Huggingface.
```
import pydub
import numpy as np
import torch
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/MahaDhwani_pretrained_conformer")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.freeze() # inference mode
model = model.to(device) # transfer model to device
```
Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.
```
ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav
```
### Inference
```
wavpath = 'sample.wav'
wav = pydub.AudioSegment.from_file(wavpath).set_frame_rate(16000).set_channels(1)
sarray = wav.get_array_of_samples()
fp_arr = np.array(sarray).T.astype(np.float64)
fp_arr = fp_arr.reshape((1,-1))
feature = torch.from_numpy(fp_arr).float().to(device='cuda')
length=torch.tensor([fp_arr.shape[1]]).to(device='cuda')
spectrograms, spec_masks, encoded, encoded_len = model(input_signal=feature,input_signal_length=length)
``` |