Whispered TIA
Whispered TIA is a fine-tuned ASR model based on Whisper. It is adapted to
TIA (Totally Integrated Automation) from Siemens AG and is able to predict domain specific words and to transcribe them correctly.
This model card is utilized as an overview of the twelve fine-tuned models. Detailed information can be found in the named models. This model ("Whispered_TIA_all") includes the results for Whisper pre-trained as well. The best model achieves a WER of 1.59%.
The model names are structured as follows:
"whispered_TIA_MODELSIZE_FINETUNING_DATASET"
Base Model Whisper
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here.
Training Results
The False HallucER indicates how many hallucinations and deletions were produced.
Normal Dataset
Fine-Tuning | WER | False HallucER | Runtime | Batch Size | Memory Usage |
---|---|---|---|---|---|
pre-trained | 3.91 | ~ | ~ | ~ | ~ |
standard fine-tuning | 1.59 | 545.30 | 1.83 | 32 | 20407 |
~ | ~ | Predictions > References: 31% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 32% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 36% | ~ | ~ | ~ |
encoder freezing | 1.6 | 280.30 | 1.6 | 64 | 20053 |
~ | ~ | Predictions > References: 31% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 34% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 35% | ~ | ~ | ~ |
adaptive tokenization | 1.63 | 394.65 | 1.78 | 32 | 20403 |
~ | ~ | Predictions > References: 36% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 30% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 34% | ~ | ~ | ~ |
adaptive tokenization + encoder freezing | 1.6 | 499.76 | 1.72 | 64 | 20049 |
~ | ~ | Predictions > References: 34% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 30% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 35% | ~ | ~ | ~ |
Nosil Dataset
Fine-Tuning | WER | False HallucER | Runtime | Batch Size | Memory Usage |
---|---|---|---|---|---|
pre-trained | 3.62 | ~ | ~ | ~ | ~ |
standard fine-tuning | 1.68 | 248.59 | 1.75 | 32 | 20407 |
~ | ~ | Predictions > References: 32% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 34% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 34% | ~ | ~ | ~ |
encoder freezing | 1.68 | 901.25 | 1.66 | 64 | 20053 |
~ | ~ | Predictions > References: 33% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 36% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 31% | ~ | ~ | ~ |
adaptive tokenization | 1.66 | 287.03 | 1.75 | 32 | 20403 |
~ | ~ | Predictions > References: 32% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 33% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 35% | ~ | ~ | ~ |
adaptive tokenization + encoder freezing | 1.76 | 1034.76 | 1.78 | 64 | 20049 |
~ | ~ | Predictions > References: 34% | ~ | ~ | ~ |
~ | ~ | Predictions < References: 34% | ~ | ~ | ~ |
~ | ~ | Predictions = References: 32% | ~ | ~ | ~ |
Dataset
The presented models are trained on two datasets containing .MP3 files.
First dataset: normal
Second dataset: nosil
Inference
import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Insert audio file
file = "/path/to/audio"
# Convert to Mel Spectrogram
arr, sampling_rate = librosa.load(file, sr=16000)
# Load whisper model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-MODELSIZE")
model = WhisperForConditionalGeneration.from_pretrained("masters-thesis-vm/MODELNAME")
# Preprocessing
input_features = processor(arr, return_tensors="pt", sampling_rate=sampling_rate).input_features
# Prediction
forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)