arxiv:2207.10333

Jointly Predicting Emotion, Age, and Country Using Pre-Trained Acoustic Embedding

Published on Jul 21, 2022

Authors:

Abstract

In this paper, we demonstrated the benefit of using pre-trained model to extract acoustic embedding to jointly predict (multitask learning) three tasks: emotion, age, and native country. The pre-trained model was trained with wav2vec 2.0 large robust model on the speech emotion corpus. The emotion and age tasks were regression problems, while country prediction was a classification task. A single harmonic mean from three metrics was used to evaluate the performance of multitask learning. The classifier was a linear network with two independent layers and shared layers, including the output layers. This study explores multitask learning on different acoustic features (including the acoustic embedding extracted from a model trained on an affective speech dataset), seed numbers, batch sizes, and normalizations for predicting paralinguistic information from speech.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2207.10333 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2207.10333 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2207.10333 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.