Speech Emotion Recognition Model
Wav2Vec2-Large-Robust
model fine-tuned on the MSP-Podcast
(v1.11) dataset for classifying emotions into four categories: Anger (A), Happiness (H), Neutral (N), and Sadness (S).
Installation
To use the model, install autrainer, e.g., via pip:
pip install autrainer
Usage
The model can be applied to all audio files in a folder (<data-root>
) and stores the predictions in another folder (<output-root>
):
autrainer inference hf:autrainer/msp-podcast-emo-class-big4-w2v2-l-emo <data-root> <output-root>
Training
Pretraining
The model has been originally trained on the MSP-Podcast (v1.7) dataset by audEERING to predict three emotional dimensions: arousal, dominance, and valence.
Dataset
The model was further fine-tuned on the MSP-Podcast (v1.11) dataset, a large corpus of spontaneous emotional speech collected from various podcast recordings. The dataset includes natural emotional expressions which cover a broad range of speakers, recording conditions, and conversation topics.
Note: The MSP-Podcast dataset is not yet included in the autrainer 0.5.0 release but can be found in this Pull Request.
Training Process
The model has been fine-tuned for 5 epochs.
At the end of each epoch, the model was evaluated on the validation set.
We release the state that achieved the best performance on this validation set.
All training hyperparameters can be found in the main configuration file (conf/config.yaml
).
Evaluation
We evaluate the model on the Test1
split of the MSP-Podcast dataset.
The model achieves a classification unweighted average recall of 0.650 on the test set.
Acknowledgements
Please acknowledge the work which produced the original model and the MSP-Podcast dataset. We would also appreciate an acknowledgment to autrainer.
Model tree for autrainer/msp-podcast-emo-class-big4-w2v2-l-emo
Evaluation results
- Accuracyself-reported0.617
- F1self-reported0.572
- Unweighted Average Recallself-reported0.650