|
--- |
|
license: openrail |
|
language: |
|
- en |
|
metrics: |
|
- f1 |
|
library_name: fairseq |
|
pipeline_tag: audio-classification |
|
--- |
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
We explore benefits of unsupervised pretraining of wav2vec 2.0 (W2V2) using large-scale unlabeled home recordings collected using LittleBeats and LENA (Language Environment Analysis) devices. |
|
LittleBeats (LB) is a new infant wearable multi-modal device that we developed, which simultaneously records audio, movement of the infant, as well as heart-rate variablity. |
|
We use W2V2 to advance LB audio pipeline such that it automatically provides reliable labels of speaker diarization and vocalization classifications for family members, including infants, parents, and siblings, at home. |
|
We show that W2V2 pretrained on thousands hours of large-scale unlabeled home audio outperforms oracle W2V2 pretrained on 52k-hours released by Facebook/Meta in terms of automatic family audio analysis tasks. |
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
Two versions of pretrained W2V2 models are available: |
|
|
|
- **LB1100/checkpoint_best.pt** pretrained using 1100-hour of LB home recordings collected from 110 families of children under 5-year-old |
|
- **LL4300/checkpoint_best.pt** pretrained using 1100-hour of LB home recordings collected from 110 families + 3200-hour of LENA home recordings from 275 families of children under 5-year-old |
|
|
|
## Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
For more information regarding this model, please checkout our paper |
|
- **Paper [optional]:** [More Information Needed] |
|
|
|
# Uses |
|
We develop fine-tuning recipe using SpeechBrain toolkit available at |
|
|
|
- **Repository:** https://github.com/jialuli3/speechbrain/tree/infant-voc-classification/recipes/wav2vec_kic |
|
|
|
|
|
## Quick Start [optional] |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
If you wish to use fairseq framework, the following code snippet can be used to load the pretrained model |
|
|
|
[More Information Needed] |
|
|
|
# Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
We test 4 unlabeled datasets on unsupervised pretrained W2V2-base models: |
|
- **base (oracle version):** originally released version pretrained on ~52k-hour unlabeled audio |
|
- **Libri960h:** oracle version fine-tuned using 960h Librispeech |
|
- **LB1100h:** pretrain W2V2 using 1100h LB home recordings |
|
- **LL4300h:** pretrain W2V2 using 4300h LB+LENA home recordings |
|
We then fine-tune pretrained models on 11.7h of LB labeled home recordings, the f1 scores across three tasks are |
|
|
|
|
|
# Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
If you found this model helpful to you, please cite us as |
|
**BibTeX:** |
|
|
|
# Model Card Contact |
|
Jialu Li (she, her, hers) |
|
Ph.D candidate @ Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign |
|
E-mail: [email protected] |
|
Homepage: https://sites.google.com/view/jialuli/ |