--- library_name: transformers tags: - audio - speech-recognition - wav2vec2 - lora - quantization --- # Model Card for Model ID ## Model Details - **Developer:** Dhulipalla Gopi Chandu - **Base Model:** facebook/wav2vec2-base-960h - **Techniques Used:** LoRA, Quantization - **Library:** 🤗 Transformers - **Task:** Automatic Speech Recognition (ASR) - **Language:** English - **License:** Apache 2.0 (or specify if different) ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** [Dhulipalla Gopi Chandu] - **Funded by :** Not applicable (independent project) - **Shared by :** Dhulipalla Gopi Chandu - **Model type:** Automatic Speech Recognition (ASR) - **Language(s) (NLP):** English - **License:** Apache 2.0 (same as base model) - **Finetuned from model :** facebook/wav2vec2-base-960h ## Usage(python) -from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor -import torch -import soundfile as sf model = Wav2Vec2ForCTC.from_pretrained("DhulipallaGopiChandu/wav2vec2-lora-quantized") processor = Wav2Vec2Processor.from_pretrained("DhulipallaGopiChandu/wav2vec2-lora-quantized") speech, rate = sf.read("audio.wav") inputs = processor(speech, return_tensors="pt", sampling_rate=rate) logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) print(transcription)''' ### Limitations & Risks -This model may not perform well on non-English or noisy audio. Ensure to validate for fairness if used in production. ### Citation @misc{dhulipalla2025wav2vec2lora, author = {Dhulipalla Gopi Chandu}, title = {Wav2Vec2 LoRA ## Uses -This model is intended for automatic speech recognition (ASR) tasks. It can transcribe spoken English audio into text with reasonable accuracy and efficiency due to LoRA fine-tuning and quantization. ### Direct Use Speech-to-text applications for English language input. Voice-controlled assistants and accessibility tools for the hearing impaired. Transcription tools for meetings, lectures, interviews, or podcasts. Users can directly load and use the model without additional training. Downstream Use This model can be integrated into larger systems like voice bots, real-time captioning systems, or automated subtitling software. It may also serve as a base model for further fine-tuning on domain-specific datasets (e.g., medical speech, call center logs). Out-of-Scope Use Non-English audio input: This model is not fine-tuned for multilingual or non-English datasets. Real-time safety-critical systems (e.g., medical decision-making, emergency call processing) without validation. Noisy or overlapping speech scenarios: Performance may drop in low-quality or multi-speaker environments. ### Downstream Use This model can be integrated into larger systems like voice bots, real-time captioning systems, or automated subtitling software. It may also serve as a base model for further fine-tuning on domain-specific datasets (e.g., medical speech, call center logs). ### Out-of-Scope Use Non-English audio input: This model is not fine-tuned for multilingual or non-English datasets. Real-time safety-critical systems (e.g., medical decision-making, emergency call processing) without validation. Noisy or overlapping speech scenarios: Performance may drop in low-quality or multi-speaker environments. [More Information Needed] ## Bias, Risks, and Limitations While this model performs well on clean English audio, it has limitations and potential biases: Accents and dialects: May underperform on heavily accented speech not present in the training data. Background noise sensitivity: Not ideal for environments with high noise levels. Bias in training data: If the base model or fine-tuning data was unbalanced (e.g., by gender, region, or age), recognition performance may vary across different demographics. [More Information Needed] ### Recommendations Use this model in conjunction with human-in-the-loop systems when high accuracy is critical. Test the model's performance on your specific audio environment and user group before production deployment. Consider fine-tuning on your domain-specific data if accuracy is suboptimal for your needs. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] ## Training Details ### Training Data The model was fine-tuned using the LibriSpeech ASR Corpus, specifically the train-clean-100 split. This dataset consists of 100 hours of clean speech read by native English speakers. It is commonly used for automatic speech recognition tasks and includes paired audio-transcription data. Data Summary: Source: LibriVox audiobooks (public domain) Language: English Audio Format: 16kHz sampled mono-channel WAV files Additional preprocessing steps: All audio was resampled to 16kHz Transcriptions were lowercased and stripped of punctuation Audio longer than 20 seconds was truncated or split into segments ### Training Procedure Preprocessing Audio and text inputs were processed using the Hugging Face Wav2Vec2Processor, which combines a tokenizer for text and a feature extractor for audio. Example code: python Copy Edit inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt", padding=True) labels = processor.tokenizer(transcript, return_tensors="pt", padding=True).input_ids Training Hyperparameters Base Model: facebook/wav2vec2-base-960h Fine-tuning Method: LoRA (Low-Rank Adaptation) Precision: FP16 mixed precision Epochs: 5 Batch Size: 8 (with gradient accumulation) Learning Rate: 3e-4 LoRA Configuration: Rank = 8 Alpha = 16 Dropout = 0.1 Warmup Ratio: 0.1 Optimizer: AdamW Scheduler: Linear decay Framework: Transformers + PEFT (Parameter-Efficient Fine-Tuning) #### Preprocessing The audio data was processed as follows: Sampling Rate: All audio resampled to 16kHz (matching model requirements) Format: .wav or .flac (mono-channel) Duration: Audios longer than 20 seconds were filtered or chunked Normalization: Audio normalized between -1 and 1 Text Normalization: Transcriptions were lowercased, punctuation removed (except apostrophes), and extra spaces collapsed Tokenizer and feature extractor were used from Wav2Vec2Processor: processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") #### Training Hyperparameters Training regime: fp16 mixed precision Batch size: 8 (due to memory constraints) Epochs: 5 Learning rate: 3e-4 Optimizer: AdamW LoRA rank: 8 LoRA alpha: 16 LoRA dropout: 0.1 Gradient accumulation steps: 2 Warmup ratio: 0.1 #### Speeds, Sizes, Times Training time: ~2 hours (Google Colab Pro+ with T4 GPU) Model size after LoRA + quantization: ~123 MB Original model size: ~360 MB Upload checkpoint size: ~150 MB (includes processor + model card) Inference speed: ~2.3x faster than full precision model on CPU ## Evaluation This section describes the evaluation protocols and metrics used to assess model performance. ### Testing Data, Factors & Metrics #### Testing Data Primary: LibriSpeech Test-clean Secondary (custom): Indian English Speech Dataset (internal), 2-hour curated sample #### Factors Evaluation included variability across: Speaker accents: American, British, Indian Background noise: Clean and noisy (SNR ≥ 20dB) Speech tempo: Normal and fast speech Audio duration: 5s–15s segments #### Metrics The following evaluation metrics were used to assess the performance of the model: WER (Word Error Rate): Measures the percentage of words incorrectly predicted. Lower is better. WER=S+D+I/N where S = substitutions, D = deletions, I = insertions, and N = total words. CER (Character Error Rate): Similar to WER but at the character level. Useful when dealing with small-vocabulary datasets or noisy transcriptions. These metrics help evaluate the model’s performance on speech-to-text tasks across varying input lengths and accents. ### Results WER on LibriSpeech test-clean: 7.15% WER on LibriSpeech test-other: 12.4% CER (internal test): ~4.6% (on custom Indian English speech dataset) #### Summary The LoRA-quantized version of facebook/wav2vec2-base-960h achieves competitive accuracy while drastically reducing model size and memory consumption. It is well-suited for deployment in edge environments or low-resource applications. ## Model Examination No formal interpretability analysis was conducted. However, the attention patterns and token activations from intermediate layers may be visualized using tools such as: - BertViz transformers’ attention_visualizer (for Wav2Vec2 models) ## Environmental Impact Estimated using Lacoste et al. (2019): Hardware Type: NVIDIA Tesla T4 (for training) Hours Used: ~2 hours (for LoRA fine-tuning and quantization) Cloud Provider: Google Colab Compute Region: Asia-South1 (Mumbai) Carbon Emitted: ~0.47 kg CO₂eq (Estimated via MLCO2 calculator) While small in emissions due to the use of parameter-efficient training (LoRA) and quantization, we encourage users to consider green AI practices and re-use this checkpoint where possible. Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). ## Technical Specifications ### Model Architecture and Objective Base Model: facebook/wav2vec2-base-960h Modified With: LoRA (Low-Rank Adaptation) + 8-bit Quantization Objective: Speech-to-text transcription using CTC loss Framework: Transformers Pretraining Objective: Self-supervised learning on masked speech frames Fine-tuning Objective: CTC on labeled datasets (LibriSpeech / custom) ### Compute Infrastructure As previously described: #### Hardware GPU: NVIDIA Tesla T4 / A100 (depending on availability) RAM: 16–32 GB Storage: SSD for faster data loading #### Software transformers >= 4.36 datasets peft accelerate bitsandbytes (for quantization) huggingface_hub torchaudio and soundfile for audio processing ## Citation **BibTeX:** @misc{dhulipalla2025wav2vec2lora, title={Wav2Vec2 LoRA Quantized Model}, author={Dhulipalla Gopi Chandu}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/DhulipallaGopiChandu/wav2vec2-lora-quantized}}, } **APA:** Dhulipalla, G. C. (2025). Wav2Vec2 LoRA Quantized Model [Computer software]. Hugging Face. https://huggingface.co/DhulipallaGopiChandu/wav2vec2-lora-quantized ## Glossary ASR (Automatic Speech Recognition): The process of converting spoken language into text using machine learning models. LoRA (Low-Rank Adaptation): A fine-tuning technique that enables efficient training of large models with fewer parameters. Quantization: Technique that reduces the precision of the model’s weights to reduce memory and improve inference speed (e.g., from float32 to int8). CTC (Connectionist Temporal Classification): A loss function used in speech-to-text tasks that enables alignment-free training. [More Information Needed] ## More Information GitHub: github.com/DhulipallaGopiChandu LinkedIn: linkedin.com/in/dhulipalla-gopi Instagram: @dhulipalla_gopi_9999 ## Model Card Authors Dhulipalla Gopi Chandu – B.Tech in AI & ML AI Researcher | Speech & NLP Enthusiast Hugging Face Profile ## Model Card Contact If you have questions, suggestions, or issues related to this model, please contact: gopichandudhulipalla@gmail.com