language: en
license: mit
tags:
- audio
- automatic-speech-recognition
- whisper
- atc
- aviation
datasets:
- jlvdoorn/atco2-asr-atcosim
metrics:
- wer
model-index:
- name: whisper-large-v3-turbo-atcosim-finetune
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
type: jlvdoorn/atco2-asr-atcosim
name: ATCOSIM
metrics:
- type: wer
value: 3.73
name: Word Error Rate
library_name: transformers
pipeline_tag: automatic-speech-recognition
inference:
parameters:
chunk_length_s: 30
batch_size: 16
return_timestamps: false
widget:
- example_title: ATC Sample 1
src: >-
https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-1.wav
- example_title: ATC Sample 2
src: >-
https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-2.wav
- example_title: ATC Sample 3
src: >-
https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-3.wav
Whisper Large V3 Turbo: Fine-tuned for ATC Domain
Model Description
This model is a fine-tuned version of OpenAI's Whisper Large V3 Turbo specifically optimized for Air Traffic Control (ATC) communications transcription.
The model was fine-tuned on the ATCOSIM dataset, which contains real ATC communications from operational environments.
Intended Use
This model is designed for:
- Transcribing ATC radio communications
- Supporting aviation safety research
- Analyzing ATC communications for congestion patterns
- Enabling data-driven decision making in airspace management
Training Methodology
The model was fine-tuned using a partial freezing approach to balance efficiency and adaptability:
- First 24 encoder layers were frozen
- All convolution layers and positional embeddings were frozen
- Later encoder layers and decoder were fine-tuned
Training hyperparameters:
- Learning rate: 1e-5
- Training steps: 5000
- Warmup steps: 500
- Gradient checkpointing enabled
- FP16 precision
- Batch size: 16 per device
- Evaluation metric: Word Error Rate (WER)
Performance
The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in:
- ATC terminology recognition
- Callsign transcription accuracy
- Handling of radio transmission noise
- Recognition of standardized phraseology
Training Metrics
Training progress over 5000 steps (10 epochs):
Step | Training Loss | Validation Loss | WER |
---|---|---|---|
1000 | 0.090100 | 0.081074 | 5.81697 |
2000 | 0.021100 | 0.080030 | 4.00939 |
3000 | 0.010000 | 0.080892 | 5.67438 |
4000 | 0.002500 | 0.080460 | 3.88357 |
5000 | 0.001400 | 0.080753 | 3.73678 |
The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications.
Limitations
- The model is specifically optimized for English ATC communications
- Performance may vary across different accents and regional phraseologies
- Not optimized for general speech recognition outside the aviation domain
- May struggle with extremely noisy transmissions or overlapping communications
Usage
Basic Usage with Pipeline
import torch
from transformers import pipeline
# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load the model with pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="tclin/whisper-large-v3-turbo-atcosim-finetune",
chunk_length_s=30,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device
)
# Transcribe audio file
result = transcriber("path_to_atc_audio.wav")
print(f"Transcription: {result['text']}")
Advanced Usage with Audio Processing
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load and preprocess audio
audio_path = "path_to_atc_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)
# Resample to 16kHz (required for Whisper models)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
# Convert stereo to mono if needed
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Convert to numpy array
waveform_np = waveform.squeeze().cpu().numpy()
# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
model = model.to(device=device, dtype=torch_dtype) # Explicit device and dtype setting
processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
# Method 1: Using processor directly (recommended for precise control)
input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device=device, dtype=torch_dtype)
generated_ids = model.generate(input_features, max_new_tokens=128)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
# Method 2: Using pipeline with preprocessed audio
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
torch_dtype=torch_dtype,
device=device
)
result = pipe(waveform_np)
print(f"Transcription: {result['text']}")
Important Notes
- Always ensure audio is resampled to 16kHz before processing
- Explicitly set both device and dtype when using GPU with
model.to(device=device, dtype=torch_dtype)
- For processing longer audio files, use the
chunk_length_s
parameter - The model performs best on clean ATC communications with standard phraseology
Broader Application
This model serves as a component in a larger speech-to-analysis pipeline for ATC communications that includes:
- Audio-to-text transcription (this model)
- Domain-specific text reformatting using contextual knowledge
- Congestion analysis based on transcribed communications
Citation
If you use this model in your research, please cite:
@misc{ta-chun_lin_2025,
author = { Ta-Chun Lin },
title = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) },
year = 2025,
url = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune },
doi = { 10.57967/hf/5272 },
publisher = { Hugging Face }
}
Acknowledgments
- OpenAI for the base Whisper model
- The ATCOSIM dataset for providing high-quality ATC communications data
- The open-source community for tools and frameworks that made this fine-tuning possible