metadata

language: en
license: mit
tags:
  - audio
  - automatic-speech-recognition
  - whisper
  - atc
  - aviation
datasets:
  - jlvdoorn/atco2-asr-atcosim
metrics:
  - wer
model-index:
  - name: whisper-large-v3-turbo-atcosim-finetune
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: jlvdoorn/atco2-asr-atcosim
          name: ATCOSIM
        metrics:
          - type: wer
            value: 3.73
            name: Word Error Rate
library_name: transformers
pipeline_tag: automatic-speech-recognition
inference:
  parameters:
    chunk_length_s: 30
    batch_size: 16
    return_timestamps: false
widget:
  - example_title: ATC Sample 1
    src: >-
      https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-1.wav
  - example_title: ATC Sample 2
    src: >-
      https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-2.wav
  - example_title: ATC Sample 3
    src: >-
      https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-3.wav

Whisper Large V3 Turbo: Fine-tuned for ATC Domain

Model Description

This model is a fine-tuned version of OpenAI's Whisper Large V3 Turbo specifically optimized for Air Traffic Control (ATC) communications transcription.

The model was fine-tuned on the ATCOSIM dataset, which contains real ATC communications from operational environments.

Intended Use

This model is designed for:

Transcribing ATC radio communications
Supporting aviation safety research
Analyzing ATC communications for congestion patterns
Enabling data-driven decision making in airspace management

Training Methodology

The model was fine-tuned using a partial freezing approach to balance efficiency and adaptability:

First 24 encoder layers were frozen
All convolution layers and positional embeddings were frozen
Later encoder layers and decoder were fine-tuned

Training hyperparameters:

Learning rate: 1e-5
Training steps: 5000
Warmup steps: 500
Gradient checkpointing enabled
FP16 precision
Batch size: 16 per device
Evaluation metric: Word Error Rate (WER)

Performance

The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in:

ATC terminology recognition
Callsign transcription accuracy
Handling of radio transmission noise
Recognition of standardized phraseology

Training Metrics

Training progress over 5000 steps (10 epochs):

Step	Training Loss	Validation Loss	WER
1000	0.090100	0.081074	5.81697
2000	0.021100	0.080030	4.00939
3000	0.010000	0.080892	5.67438
4000	0.002500	0.080460	3.88357
5000	0.001400	0.080753	3.73678

The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications.

Limitations

The model is specifically optimized for English ATC communications
Performance may vary across different accents and regional phraseologies
Not optimized for general speech recognition outside the aviation domain
May struggle with extremely noisy transmissions or overlapping communications

Usage

Basic Usage with Pipeline

import torch
from transformers import pipeline

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the model with pipeline
transcriber = pipeline(
    "automatic-speech-recognition", 
    model="tclin/whisper-large-v3-turbo-atcosim-finetune",
    chunk_length_s=30,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device
)

# Transcribe audio file
result = transcriber("path_to_atc_audio.wav")
print(f"Transcription: {result['text']}")

Advanced Usage with Audio Processing

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load and preprocess audio
audio_path = "path_to_atc_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16kHz (required for Whisper models)
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert stereo to mono if needed
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
    
# Convert to numpy array
waveform_np = waveform.squeeze().cpu().numpy()

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
model = model.to(device=device, dtype=torch_dtype)  # Explicit device and dtype setting
processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")

# Method 1: Using processor directly (recommended for precise control)
input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device=device, dtype=torch_dtype)

generated_ids = model.generate(input_features, max_new_tokens=128)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# Method 2: Using pipeline with preprocessed audio
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    torch_dtype=torch_dtype,
    device=device
)

result = pipe(waveform_np)
print(f"Transcription: {result['text']}")

Important Notes

Always ensure audio is resampled to 16kHz before processing
Explicitly set both device and dtype when using GPU with model.to(device=device, dtype=torch_dtype)
For processing longer audio files, use the chunk_length_s parameter
The model performs best on clean ATC communications with standard phraseology

Broader Application

This model serves as a component in a larger speech-to-analysis pipeline for ATC communications that includes:

Audio-to-text transcription (this model)
Domain-specific text reformatting using contextual knowledge
Congestion analysis based on transcribed communications

Citation

If you use this model in your research, please cite:

@misc{ta-chun_lin_2025,
    author       = { Ta-Chun Lin },
    title        = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) },
    year         = 2025,
    url          = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune },
    doi          = { 10.57967/hf/5272 },
    publisher    = { Hugging Face }
}

Acknowledgments

OpenAI for the base Whisper model
The ATCOSIM dataset for providing high-quality ATC communications data
The open-source community for tools and frameworks that made this fine-tuning possible