--- language: en license: mit tags: - audio - automatic-speech-recognition - whisper - atc - aviation datasets: - jlvdoorn/atco2-asr-atcosim metrics: - wer model-index: - name: whisper-large-v3-turbo-atcosim-finetune results: - task: type: automatic-speech-recognition name: Speech Recognition dataset: type: jlvdoorn/atco2-asr-atcosim name: ATCOSIM metrics: - type: wer value: 3.73 name: Word Error Rate library_name: transformers pipeline_tag: automatic-speech-recognition inference: parameters: chunk_length_s: 30 batch_size: 16 return_timestamps: false widget: - example_title: ATC Sample 1 src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-1.wav - example_title: ATC Sample 2 src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-2.wav - example_title: ATC Sample 3 src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-3.wav --- [![DOI](https://img.shields.io/badge/DOI-10.57967%2Fhf%2F5272-blue)](https://doi.org/10.57967/hf/5272) # Whisper Large V3 Turbo: Fine-tuned for ATC Domain ## Model Description This model is a fine-tuned version of OpenAI's [Whisper Large V3 Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) specifically optimized for Air Traffic Control (ATC) communications transcription. The model was fine-tuned on the [ATCOSIM dataset](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim), which contains real ATC communications from operational environments. ## Intended Use This model is designed for: - Transcribing ATC radio communications - Supporting aviation safety research - Analyzing ATC communications for congestion patterns - Enabling data-driven decision making in airspace management ## Training Methodology The model was fine-tuned using a partial freezing approach to balance efficiency and adaptability: - First 24 encoder layers were frozen - All convolution layers and positional embeddings were frozen - Later encoder layers and decoder were fine-tuned Training hyperparameters: - Learning rate: 1e-5 - Training steps: 5000 - Warmup steps: 500 - Gradient checkpointing enabled - FP16 precision - Batch size: 16 per device - Evaluation metric: Word Error Rate (WER) ## Performance The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in: - ATC terminology recognition - Callsign transcription accuracy - Handling of radio transmission noise - Recognition of standardized phraseology ### Training Metrics Training progress over 5000 steps (10 epochs): | Step | Training Loss | Validation Loss | WER | |------|---------------|----------------|---------| | 1000 | 0.090100 | 0.081074 | 5.81697 | | 2000 | 0.021100 | 0.080030 | 4.00939 | | 3000 | 0.010000 | 0.080892 | 5.67438 | | 4000 | 0.002500 | 0.080460 | 3.88357 | | 5000 | 0.001400 | 0.080753 | 3.73678 | The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications. ## Limitations - The model is specifically optimized for English ATC communications - Performance may vary across different accents and regional phraseologies - Not optimized for general speech recognition outside the aviation domain - May struggle with extremely noisy transmissions or overlapping communications ## Usage ### Basic Usage with Pipeline ```python import torch from transformers import pipeline # Configure device and precision device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Load the model with pipeline transcriber = pipeline( "automatic-speech-recognition", model="tclin/whisper-large-v3-turbo-atcosim-finetune", chunk_length_s=30, max_new_tokens=128, torch_dtype=torch_dtype, device=device ) # Transcribe audio file result = transcriber("path_to_atc_audio.wav") print(f"Transcription: {result['text']}") ``` ### Advanced Usage with Audio Processing ```python import torch import torchaudio from transformers import WhisperProcessor, WhisperForConditionalGeneration # Load and preprocess audio audio_path = "path_to_atc_audio.wav" waveform, sample_rate = torchaudio.load(audio_path) # Resample to 16kHz (required for Whisper models) if sample_rate != 16000: resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000) waveform = resampler(waveform) # Convert stereo to mono if needed if waveform.shape[0] > 1: waveform = waveform.mean(dim=0, keepdim=True) # Convert to numpy array waveform_np = waveform.squeeze().cpu().numpy() # Configure device and precision device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Load model and processor model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune") model = model.to(device=device, dtype=torch_dtype) # Explicit device and dtype setting processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune") # Method 1: Using processor directly (recommended for precise control) input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features input_features = input_features.to(device=device, dtype=torch_dtype) generated_ids = model.generate(input_features, max_new_tokens=128) transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(f"Transcription: {transcription}") # Method 2: Using pipeline with preprocessed audio from transformers import pipeline pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, chunk_length_s=30, torch_dtype=torch_dtype, device=device ) result = pipe(waveform_np) print(f"Transcription: {result['text']}") ``` ### Important Notes - Always ensure audio is resampled to 16kHz before processing - Explicitly set both device and dtype when using GPU with `model.to(device=device, dtype=torch_dtype)` - For processing longer audio files, use the `chunk_length_s` parameter - The model performs best on clean ATC communications with standard phraseology ## Broader Application This model serves as a component in a larger speech-to-analysis pipeline for ATC communications that includes: 1. Audio-to-text transcription (this model) 2. Domain-specific text reformatting using contextual knowledge 3. Congestion analysis based on transcribed communications ## Citation If you use this model in your research, please cite: ``` @misc{ta-chun_lin_2025, author = { Ta-Chun Lin }, title = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) }, year = 2025, url = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune }, doi = { 10.57967/hf/5272 }, publisher = { Hugging Face } } ``` ## Acknowledgments - OpenAI for the base Whisper model - The ATCOSIM dataset for providing high-quality ATC communications data - The open-source community for tools and frameworks that made this fine-tuning possible