accent-id-distilhubert-finetuned-l2-arctic2

This model is a fine-tuned version of ntu-spml/distilhubert on 50% of the l2-arctic2 dataset from https://psi.engr.tamu.edu/l2-arctic-corpus/. It achieves the following results on the evaluation set:

Loss: 0.0004
Accuracy: 1.0

Model description

The goal of this project is to create an accent classifier for people who learned English as a second language by fine-tuning a speech recognition model to classify accents from 24 people speaking English whose first language is Hindi, Korean, Arabic, Vietnamese, Spanish, or Mandarin.

How to use this model on an audio file

from huggingface_hub import notebook_login
notebook_login()

from transformers import pipeline
pipe = pipeline("audio-classification", model="kaysrubio/accent-id-distilhubert-finetuned-l2-arctic2")

import torch
import torchaudio

audio, sr = torchaudio.load('path_to_file/audio.wav')  # Load audio, make sure it is mono, not stereo
audio = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(audio)
audio = audio.squeeze().numpy()

result = pipe(audio, top_k=6)

print(result)
print('First language of this speaker is predicted to be ' + result[0]['label'] + ' with ' + str(result[0]['score']*100) + '% confidence')

Intended uses & limitations

The model is very accurate for novel recordings from the original dataset that were not used for train/test. However, the model is not accurate for voices from outside the dataset. Unfortunetely with only 24 speakers represented, it seems like the model memorized other characteristics of these voices besides accent, thus not creating a model very generalizable to the real world.

Training and evaluation data

The L2-Arctic data is ~8GB and comes via email. It includes approximately 24-30 hours of recordings where 24 speakers read passages in English. The first languages of the speakers are Arabic, Hindi, Korean, Mandarin, Spanish, and Vietnamese. There's 2 women and 2 men in each language group. For this model, 50% of the L2-Arctic data was used (half the files from each speaker), which were then split 90/10 for train/test.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
0.5216	1.0	196	0.4383	1.0
0.0106	2.0	392	0.0067	1.0
0.0038	3.0	588	0.0024	1.0
0.0021	4.0	784	0.0013	1.0
0.0014	5.0	980	0.0009	1.0
0.0011	6.0	1176	0.0007	1.0
0.0009	7.0	1372	0.0006	1.0
0.0008	8.0	1568	0.0005	1.0
0.0007	9.0	1764	0.0004	1.0
0.0007	10.0	1960	0.0004	1.0

Framework versions

Transformers 4.48.3
Pytorch 2.5.1+cu124
Datasets 3.3.2
Tokenizers 0.21.0

kaysrubio
/

accent-id-distilhubert-finetuned-l2-arctic2