--- language: - uz license: apache-2.0 tags: - whisper - automatic-speech-recognition - audio-transcription - uzbek - fine-tuned - speech-recognition --- # NavaiSTT-1v Medium - Uzbek Speech-to-Text Model Classic Whisper medium model fine-tuned for Uzbek language. The dataset included ~700 hours of diverse audio: publicly available podcasts, Tashkent dialect podcasts, audiobooks, and Common Voice 17. Data quality was mixed with 60% human transcribed and 40% pseudo-transcribed using Gemini 2.5 Pro. Special attention was given to Tashkent dialect audio materials, resulting in strong performance on this dialect. Future versions will include other regional dialects to improve overall coverage. # Whitepaper For more details on the methodology and research behind this model, visit: https://uz-speech.web.app/navaistt01m ## Model Details - **Base Model:** Whisper Medium - **Parameters:** 769M - **Performance:** - WER: ~13% - CER: ~3.5% ## Training Data This model was fine-tuned on approximately 700 hours of diverse Uzbek audio data including: - Publicly available podcasts - Tashkent dialect podcasts - Audiobooks - Common Voice 17 dataset The dataset consisted of 60% human-transcribed and 40% pseudo-transcribed material (using Gemini 2.5 Pro). Special attention was given to Tashkent dialect audio materials to ensure strong performance on this dialect. ## Usage Example ```python import torch import torchaudio from transformers import WhisperProcessor, WhisperForConditionalGeneration # Load model and processor processor = WhisperProcessor.from_pretrained("islomov/navaistt_v1_medium") model = WhisperForConditionalGeneration.from_pretrained("islomov/navaistt_v1_medium") def transcribe_audio(audio_path): global model, processor # Move to GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) # Load and preprocess audio waveform, sample_rate = torchaudio.load(audio_path) if sample_rate != 16000: waveform = torchaudio.functional.resample(waveform, sample_rate, 16000) # Convert to mono if needed if waveform.shape[0] > 1: waveform = waveform.mean(dim=0, keepdim=True) # Process audio input_features = processor( waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", language="uz" ).input_features.to(device) # Generate transcription with torch.no_grad(): predicted_ids = model.generate(input_features) # Decode transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] return transcription # Example usage if __name__ == "__main__": audio_file = "some_audio_max_30_sec.wav" text = transcribe_audio(audio_file) print(f"Transcription: {text}") ``` # Future Improvements Future versions will include more regional Uzbek dialects to improve overall coverage.