Model Card for MultiBridge/wav2vec-LnNor-IPA-ft

This model is built for phoneme recognition tasks. It was developed by fine-tuning the wav2vec2 base model on TIMIT and LnNor datasets. The predictions are in IPA.

Model Details

Model Description

Developed by: Multibridge
Funded by [optional]: EEA Financial Mechanism and Norwegian Financial Mechanism
Shared by [optional]: Multibridge
Model type: Transformer
Language(s) (NLP): English
License: cc-by-4.0
Finetuned from model [optional]: facebook/wav2vec2-base

Uses

Automatic phonetic transcription: Converting raw speech into phoneme sequences.
Speech processing applications: Serving as a component in speech processing pipelines or prototyping.

Bias, Risks, and Limitations

data specificity: By excluding recordings shorter than 2 seconds or longer than 30 seconds, and labels with fewer than 5 phonemes, some natural speech variations are ignored. This might affect the model's performance in real-world applications. The model's performance is influenced by the characteristics of TIMIT and LnNor datasets. This can lead to potential biases, especially if the target application involves speakers or dialects not well-represented in these datasets. LnNor contains non-native speech and automaticly generated annotations that don't reflect true pronunciation rather canonical pronunciation. This could result in a model that fails to accurately predict non-native speech.
frozen encoder: Freezing the encoder retains useful pre-learned features but also prevents the model from adapting fully to the new datasets.

Recommendations

Evaluate the model's performance for your specific use case.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
 
# load model and processor
processor = Wav2Vec2Processor.from_pretrained("MultiBridge/wav2vec-LnNor-IPA-ft")
model = Wav2Vec2ForCTC.from_pretrained("MultiBridge/wav2vec-LnNor-IPA-ft")
     
# load dummy dataset and read soundfiles
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", split="validation")
 
# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values
 
# retrieve logits
with torch.no_grad():
    logits = model(input_values).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

# => should give ['mɪstɝkwɪltɝɪzðəəpɑslʌvðəmɪdəlklæsəzændwiɑəɡlædtəwɛlkəmhɪzɡɑspəl'] for MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL

Training Details

Training Data

The training data comes from two key sources:

TIMIT: A widely-used dataset for phonetic transcription, providing a standard benchmark in speech research.
LnNor: A multilingual dataset of high-quality speech recordings in Norwegian, English, and Polish. The dataset compiled from non-native speakers with various language proficiencies. The phoneme annotations in LnNor were generated using the WebMAUS tool, meaning they represent canonical phonemes rather than the true pronunciations typical of spontaneous speech or non native pronunciation.

Training Procedure

The original, pre-trained encoder representations were preserved - the encoder was kept frozen during fine-tuning in order to minimizes training time and resource consumption. The model was trained with CTC loss and AdamW optimizer, with no learning rate scheduler.

Preprocessing [optional]

The training dataset was filtered. Recordings shorter than 2 seconds or longer than 30 seconds were removed. Any labels consisting of fewer than 5 phonemes were discarded.

Training Hyperparameters

Training regime:

learning rate: 1e-5
optimizer: AdamW
batch size: 64
weight decay: 0.001
epochs: 40

Speeds, Sizes, Times [optional]

Avg epoch training time: 650s
Number of updates: ~25k
Final training loss: 0.09713
Final validation loss: 0.2142

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on TIMIT's test split.

Metrics

CER/PER (Phoneme Error Rate)

Results

PER (Phoneme Error Rate) on TIMIT's test split: 0.0416

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: Nvidia A100-80
Hours used: [More Information Needed]
Cloud Provider: Poznan University of Technology
Compute Region: Poland
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

Transformer model + CTC loss

Compute Infrastructure

Hardware

2 x Nvidia A100-80

Software

python 3.12

transformers 4.50.0

torch 2.6.0

Citation [optional]

BibTeX:

I you use the LnNor dataset for research, cite these papers:

@article{magdalena2024lnnor,
  title={The LnNor Corpus: A spoken multilingual corpus of non-native and native Norwegian, English and Polish (Part 1)},
  author={Magdalena, Wrembel and Hwaszcz, Krzysztof and Agnieszka, Pludra and Ska{\l}ba, Anna and Weckwerth, Jaros{\l}aw and Walczak, Angelika and Sypia{\'n}ska, Jolanta and {\.Z}ychli{\'n}ski, Sylwiusz and Malarski, Kamil and K{\k{e}}dzierska, Hanna and others},
  year={2024},
  publisher={Adam Mickiewicz University}
}

@article{wrembel2024lnnor,
  title={The LnNor Corpus: A spoken multilingual corpus of non-native and native Norwegian, English and Polish--Part 2},
  author={Wrembel, Magdalena and Hwaszcz, Krzysztof and Pludra, Agnieszka and Ska{\l}ba, Anna and Weckwerth, Jaros{\l}aw and Malarski, Kamil and Cal, Zuzanna Ewa and K{\k{e}}dzierska, Hanna and Czarnecki-Verner, Tristan and Balas, Anna and others},
  year={2024},
  publisher={Adam Mickiewicz University}
}

Model Card Authors [optional]

Agnieszka Pludra

Izabela Krysińska

Piotr Kabaciński

Model Card Contact

[email protected]

MultiBridge
/

wav2vec-LnNor-IPA-ft