Speech Emotion Recognition Model

Wav2Vec2-Large-Robust model fine-tuned on the MSP-Podcast (v1.11) dataset for classifying emotions into four categories: Anger (A), Happiness (H), Neutral (N), and Sadness (S).

Installation

To use the model, install autrainer, e.g., via pip:

pip install autrainer

Usage

The model can be applied to all audio files in a folder (<data-root>) and stores the predictions in another folder (<output-root>):

autrainer inference hf:autrainer/msp-podcast-emo-class-big4-w2v2-l-emo <data-root> <output-root>

Training

Pretraining

The model has been originally trained on the MSP-Podcast (v1.7) dataset by audEERING to predict three emotional dimensions: arousal, dominance, and valence.

Dataset

The model was further fine-tuned on the MSP-Podcast (v1.11) dataset, a large corpus of spontaneous emotional speech collected from various podcast recordings. The dataset includes natural emotional expressions which cover a broad range of speakers, recording conditions, and conversation topics.

Note: The MSP-Podcast dataset is not yet included in the autrainer 0.5.0 release but can be found in this Pull Request.

Training Process

The model has been fine-tuned for 5 epochs. At the end of each epoch, the model was evaluated on the validation set. We release the state that achieved the best performance on this validation set. All training hyperparameters can be found in the main configuration file (conf/config.yaml).

Evaluation

We evaluate the model on the Test1 split of the MSP-Podcast dataset. The model achieves a classification unweighted average recall of 0.650 on the test set.

Acknowledgements

Please acknowledge the work which produced the original model and the MSP-Podcast dataset. We would also appreciate an acknowledgment to autrainer.

autrainer
/

msp-podcast-emo-class-big4-w2v2-l-emo