---
language: en
license: mit
tags:
  - audio
  - automatic-speech-recognition
  - whisper
  - atc
  - aviation
datasets:
  - jlvdoorn/atco2-asr-atcosim
metrics:
  - wer
model-index:
  - name: whisper-large-v3-turbo-atcosim-finetune
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: jlvdoorn/atco2-asr-atcosim
          name: ATCOSIM
        metrics:
          - type: wer
            value: 3.73
            name: Word Error Rate
library_name: transformers
pipeline_tag: automatic-speech-recognition
inference:
  parameters:
    chunk_length_s: 30
    batch_size: 16
    return_timestamps: false
widget:
  - example_title: ATC Sample 1
    src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-1.wav
  - example_title: ATC Sample 2
    src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-2.wav
  - example_title: ATC Sample 3
    src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-3.wav
---
[![DOI](https://img.shields.io/badge/DOI-10.57967%2Fhf%2F5272-blue)](https://doi.org/10.57967/hf/5272)
# Whisper Large V3 Turbo: Fine-tuned for ATC Domain

## Model Description

This model is a fine-tuned version of OpenAI's [Whisper Large V3 Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) specifically optimized for Air Traffic Control (ATC) communications transcription. 

The model was fine-tuned on the [ATCOSIM dataset](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim), which contains real ATC communications from operational environments.

## Intended Use

This model is designed for:
- Transcribing ATC radio communications
- Supporting aviation safety research
- Analyzing ATC communications for congestion patterns
- Enabling data-driven decision making in airspace management

## Training Methodology

The model was fine-tuned using a partial freezing approach to balance efficiency and adaptability:
- First 24 encoder layers were frozen
- All convolution layers and positional embeddings were frozen
- Later encoder layers and decoder were fine-tuned

Training hyperparameters:
- Learning rate: 1e-5
- Training steps: 5000
- Warmup steps: 500
- Gradient checkpointing enabled
- FP16 precision
- Batch size: 16 per device
- Evaluation metric: Word Error Rate (WER)

## Performance

The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in:
- ATC terminology recognition
- Callsign transcription accuracy
- Handling of radio transmission noise
- Recognition of standardized phraseology

### Training Metrics

Training progress over 5000 steps (10 epochs):

| Step | Training Loss | Validation Loss | WER     |
|------|---------------|----------------|---------|
| 1000 | 0.090100      | 0.081074       | 5.81697 |
| 2000 | 0.021100      | 0.080030       | 4.00939 |
| 3000 | 0.010000      | 0.080892       | 5.67438 |
| 4000 | 0.002500      | 0.080460       | 3.88357 |
| 5000 | 0.001400      | 0.080753       | 3.73678 |

The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications.

## Limitations

- The model is specifically optimized for English ATC communications
- Performance may vary across different accents and regional phraseologies
- Not optimized for general speech recognition outside the aviation domain
- May struggle with extremely noisy transmissions or overlapping communications

## Usage

### Basic Usage with Pipeline

```python
import torch
from transformers import pipeline

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the model with pipeline
transcriber = pipeline(
    "automatic-speech-recognition", 
    model="tclin/whisper-large-v3-turbo-atcosim-finetune",
    chunk_length_s=30,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device
)

# Transcribe audio file
result = transcriber("path_to_atc_audio.wav")
print(f"Transcription: {result['text']}")
```

### Advanced Usage with Audio Processing

```python
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load and preprocess audio
audio_path = "path_to_atc_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16kHz (required for Whisper models)
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert stereo to mono if needed
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
    
# Convert to numpy array
waveform_np = waveform.squeeze().cpu().numpy()

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
model = model.to(device=device, dtype=torch_dtype)  # Explicit device and dtype setting
processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")

# Method 1: Using processor directly (recommended for precise control)
input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device=device, dtype=torch_dtype)

generated_ids = model.generate(input_features, max_new_tokens=128)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# Method 2: Using pipeline with preprocessed audio
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    torch_dtype=torch_dtype,
    device=device
)

result = pipe(waveform_np)
print(f"Transcription: {result['text']}")
```

### Important Notes

- Always ensure audio is resampled to 16kHz before processing
- Explicitly set both device and dtype when using GPU with `model.to(device=device, dtype=torch_dtype)`
- For processing longer audio files, use the `chunk_length_s` parameter
- The model performs best on clean ATC communications with standard phraseology

## Broader Application

This model serves as a component in a larger speech-to-analysis pipeline for ATC communications that includes:
1. Audio-to-text transcription (this model)
2. Domain-specific text reformatting using contextual knowledge
3. Congestion analysis based on transcribed communications

## Citation

If you use this model in your research, please cite:

```
@misc{ta-chun_lin_2025,
	author       = { Ta-Chun Lin },
	title        = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) },
	year         = 2025,
	url          = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune },
	doi          = { 10.57967/hf/5272 },
	publisher    = { Hugging Face }
}
```

## Acknowledgments

- OpenAI for the base Whisper model
- The ATCOSIM dataset for providing high-quality ATC communications data
- The open-source community for tools and frameworks that made this fine-tuning possible