File size: 2,130 Bytes
07fb434
 
 
 
 
9a34af6
07fb434
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab520bd
07fb434
 
 
 
 
 
 
3d21cd5
ab520bd
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: mit
pipeline_tag: automatic-speech-recognition
library_name: nemo
---
  ## MahaDhwani Pretrained Conformer

  It is a self-supervised pre-trained conformer encoder model trained on MahaDhwani dataset.

  ### Language

  Contains training data from 22 scheduled languages of India.

  ### Input

  This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

  ### Output

  This model provides conformer encoder embeddings as the output for a given audio sample.

  ## Model Architecture

  This model is a conformer-Large model, consisting of 120M parameters, as the encoder. The model has 17 conformer blocks with
  512 as the model dimension.


  ## AI4Bharat NeMo:

  To load, train, fine-tune or play with the model you will need to install [AI4Bharat NeMo](https://github.com/AI4Bharat/NeMo). We recommend you install it using the command shown below
  ```
  git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh
  ```

  ## Usage
  Download and load the model from Huggingface.
  ```
  import pydub
  import numpy as np
  import torch
  import nemo.collections.asr as nemo_asr

  model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/MahaDhwani_pretrained_conformer")

  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  model.freeze() # inference mode
  model = model.to(device) # transfer model to device
  ```
  Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.
  ```
  ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav
  ```

  ### Inference
  ```
  wavpath = 'sample.wav'
  wav = pydub.AudioSegment.from_file(wavpath).set_frame_rate(16000).set_channels(1)
  sarray = wav.get_array_of_samples()
  fp_arr = np.array(sarray).T.astype(np.float64)
  fp_arr = fp_arr.reshape((1,-1))
  feature = torch.from_numpy(fp_arr).float().to(device='cuda')
  length=torch.tensor([fp_arr.shape[1]]).to(device='cuda')

  spectrograms, spec_masks, encoded, encoded_len = model(input_signal=feature,input_signal_length=length)
  ```