ai4bharat
/

MahaDhwani_pretrained_conformer

Automatic Speech Recognition

Model card Files Files and versions Community

MahaDhwani_pretrained_conformer / README.md

Deovrat's picture

Update README.md

3d21cd5 verified 5 months ago

|

history blame contribute delete

2.13 kB

	---
	license: mit
	pipeline_tag: automatic-speech-recognition
	library_name: nemo
	---
	## MahaDhwani Pretrained Conformer

	It is a self-supervised pre-trained conformer encoder model trained on MahaDhwani dataset.

	### Language

	Contains training data from 22 scheduled languages of India.

	### Input

	This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

	### Output

	This model provides conformer encoder embeddings as the output for a given audio sample.

	## Model Architecture

	This model is a conformer-Large model, consisting of 120M parameters, as the encoder. The model has 17 conformer blocks with
	512 as the model dimension.


	## AI4Bharat NeMo:

	To load, train, fine-tune or play with the model you will need to install [AI4Bharat NeMo](https://github.com/AI4Bharat/NeMo). We recommend you install it using the command shown below
	```
	git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh
	```

	## Usage
	Download and load the model from Huggingface.
	```
	import pydub
	import numpy as np
	import torch
	import nemo.collections.asr as nemo_asr

	model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/MahaDhwani_pretrained_conformer")

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.freeze() # inference mode
	model = model.to(device) # transfer model to device
	```
	Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.
	```
	ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav
	```

	### Inference
	```
	wavpath = 'sample.wav'
	wav = pydub.AudioSegment.from_file(wavpath).set_frame_rate(16000).set_channels(1)
	sarray = wav.get_array_of_samples()
	fp_arr = np.array(sarray).T.astype(np.float64)
	fp_arr = fp_arr.reshape((1,-1))
	feature = torch.from_numpy(fp_arr).float().to(device='cuda')
	length=torch.tensor([fp_arr.shape[1]]).to(device='cuda')

	spectrograms, spec_masks, encoded, encoded_len = model(input_signal=feature,input_signal_length=length)
	```