NavaiSTT-1v Medium - Uzbek Speech-to-Text Model

Classic Whisper medium model fine-tuned for Uzbek language. The dataset included ~700 hours of diverse audio: publicly available podcasts, Tashkent dialect podcasts, audiobooks, and Common Voice 17. Data quality was mixed with 60% human transcribed and 40% pseudo-transcribed using Gemini 2.5 Pro.

Special attention was given to Tashkent dialect audio materials, resulting in strong performance on this dialect. Future versions will include other regional dialects to improve overall coverage.

Whitepaper

For more details on the methodology and research behind this model, visit: https://uz-speech.web.app/navaistt01m

Model Details

  • Base Model: Whisper Medium
  • Parameters: 769M
  • Performance:
    • WER: ~13%
    • CER: ~3.5%

Training Data

This model was fine-tuned on approximately 700 hours of diverse Uzbek audio data including:

  • Publicly available podcasts
  • Tashkent dialect podcasts
  • Audiobooks
  • Common Voice 17 dataset

The dataset consisted of 60% human-transcribed and 40% pseudo-transcribed material (using Gemini 2.5 Pro). Special attention was given to Tashkent dialect audio materials to ensure strong performance on this dialect.

Usage Example

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model and processor
processor = WhisperProcessor.from_pretrained("islomov/navaistt_v1_medium")
model = WhisperForConditionalGeneration.from_pretrained("islomov/navaistt_v1_medium")

def transcribe_audio(audio_path):

    global model, processor

    # Move to GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)

    # Load and preprocess audio
    waveform, sample_rate = torchaudio.load(audio_path)
    if sample_rate != 16000:
        waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

    # Convert to mono if needed
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Process audio
    input_features = processor(
        waveform.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        language="uz"
    ).input_features.to(device)

    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(input_features)

    # Decode
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# Example usage
if __name__ == "__main__":
    audio_file = "some_audio_max_30_sec.wav"

    text = transcribe_audio(audio_file)
    print(f"Transcription: {text}")

Future Improvements

Future versions will include more regional Uzbek dialects to improve overall coverage.

Downloads last month
0
Safetensors
Model size
764M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 4 Ask for provider support