tclin's picture
Update README.md
94db4b4 verified
metadata
language: en
license: mit
tags:
  - audio
  - automatic-speech-recognition
  - whisper
  - atc
  - aviation
datasets:
  - jlvdoorn/atco2-asr-atcosim
metrics:
  - wer
model-index:
  - name: whisper-large-v3-turbo-atcosim-finetune
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: jlvdoorn/atco2-asr-atcosim
          name: ATCOSIM
        metrics:
          - type: wer
            value: 3.73
            name: Word Error Rate
library_name: transformers
pipeline_tag: automatic-speech-recognition
inference:
  parameters:
    chunk_length_s: 30
    batch_size: 16
    return_timestamps: false
widget:
  - example_title: ATC Sample 1
    src: >-
      https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-1.wav
  - example_title: ATC Sample 2
    src: >-
      https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-2.wav
  - example_title: ATC Sample 3
    src: >-
      https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-3.wav

DOI

Whisper Large V3 Turbo: Fine-tuned for ATC Domain

Model Description

This model is a fine-tuned version of OpenAI's Whisper Large V3 Turbo specifically optimized for Air Traffic Control (ATC) communications transcription.

The model was fine-tuned on the ATCOSIM dataset, which contains real ATC communications from operational environments.

Intended Use

This model is designed for:

  • Transcribing ATC radio communications
  • Supporting aviation safety research
  • Analyzing ATC communications for congestion patterns
  • Enabling data-driven decision making in airspace management

Training Methodology

The model was fine-tuned using a partial freezing approach to balance efficiency and adaptability:

  • First 24 encoder layers were frozen
  • All convolution layers and positional embeddings were frozen
  • Later encoder layers and decoder were fine-tuned

Training hyperparameters:

  • Learning rate: 1e-5
  • Training steps: 5000
  • Warmup steps: 500
  • Gradient checkpointing enabled
  • FP16 precision
  • Batch size: 16 per device
  • Evaluation metric: Word Error Rate (WER)

Performance

The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in:

  • ATC terminology recognition
  • Callsign transcription accuracy
  • Handling of radio transmission noise
  • Recognition of standardized phraseology

Training Metrics

Training progress over 5000 steps (10 epochs):

Step Training Loss Validation Loss WER
1000 0.090100 0.081074 5.81697
2000 0.021100 0.080030 4.00939
3000 0.010000 0.080892 5.67438
4000 0.002500 0.080460 3.88357
5000 0.001400 0.080753 3.73678

The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications.

Limitations

  • The model is specifically optimized for English ATC communications
  • Performance may vary across different accents and regional phraseologies
  • Not optimized for general speech recognition outside the aviation domain
  • May struggle with extremely noisy transmissions or overlapping communications

Usage

Basic Usage with Pipeline

import torch
from transformers import pipeline

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the model with pipeline
transcriber = pipeline(
    "automatic-speech-recognition", 
    model="tclin/whisper-large-v3-turbo-atcosim-finetune",
    chunk_length_s=30,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device
)

# Transcribe audio file
result = transcriber("path_to_atc_audio.wav")
print(f"Transcription: {result['text']}")

Advanced Usage with Audio Processing

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load and preprocess audio
audio_path = "path_to_atc_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16kHz (required for Whisper models)
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert stereo to mono if needed
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
    
# Convert to numpy array
waveform_np = waveform.squeeze().cpu().numpy()

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
model = model.to(device=device, dtype=torch_dtype)  # Explicit device and dtype setting
processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")

# Method 1: Using processor directly (recommended for precise control)
input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device=device, dtype=torch_dtype)

generated_ids = model.generate(input_features, max_new_tokens=128)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# Method 2: Using pipeline with preprocessed audio
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    torch_dtype=torch_dtype,
    device=device
)

result = pipe(waveform_np)
print(f"Transcription: {result['text']}")

Important Notes

  • Always ensure audio is resampled to 16kHz before processing
  • Explicitly set both device and dtype when using GPU with model.to(device=device, dtype=torch_dtype)
  • For processing longer audio files, use the chunk_length_s parameter
  • The model performs best on clean ATC communications with standard phraseology

Broader Application

This model serves as a component in a larger speech-to-analysis pipeline for ATC communications that includes:

  1. Audio-to-text transcription (this model)
  2. Domain-specific text reformatting using contextual knowledge
  3. Congestion analysis based on transcribed communications

Citation

If you use this model in your research, please cite:

@misc{ta-chun_lin_2025,
    author       = { Ta-Chun Lin },
    title        = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) },
    year         = 2025,
    url          = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune },
    doi          = { 10.57967/hf/5272 },
    publisher    = { Hugging Face }
}

Acknowledgments

  • OpenAI for the base Whisper model
  • The ATCOSIM dataset for providing high-quality ATC communications data
  • The open-source community for tools and frameworks that made this fine-tuning possible