--- library_name: transformers datasets: - ai4bharat/samanantar language: - en - te metrics: - bleu base_model: - aryaumesh/english-to-telugu --- --- # English-to-Telugu Translation Model ## Overview This project is a deep learning-based English-to-Telugu translation model trained on a custom dataset. It uses Hugging Face Transformers for NLP and was developed in Google Colab. The model can be used for translating sentences with improved contextual accuracy. ## Features ✅ Translates English text to Telugu ✅ Trained on a custom bilingual dataset ✅ Uses Transformer-based model ✅ Implemented and trained in Google Colab ✅ Can be fine-tuned for better accuracy ## Tech Stack - **Programming Language**: Python - **Framework**: Hugging Face Transformers - **Model**: mBART (Fine-tuned) - **Libraries**: - transformers (Hugging Face) - `torch` (PyTorch) - `sentencepiece` (Tokenization) - **Platform**: Google Colab ## Dataset - Used a custom English-Telugu parallel corpus - Preprocessed using: - **Tokenization** (SentencePiece / WordPiece) - **Lowercasing & Cleaning** - **Removing noisy data** ## Model Training Training was done in Google Colab using a GPU. Here’s a snippet of the fine-tuning process: from transformers import MarianMTModel, MarianTokenizer, Trainer, TrainingArguments # Load pre-trained model & tokenizer model_name = "aryaumesh/english-to-telugu" # Base model tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) # Preprocess dataset (example) def encode_data(texts): return tokenizer(texts, padding=True, truncation=True, return_tensors="pt") # Training arguments training_args = TrainingArguments( output_dir="./results", per_device_train_batch_size=8, num_train_epochs=3, save_steps=1000, save_total_limit=2, ) trainer = Trainer( model=model, args=training_args, train_dataset=custom_dataset, ) trainer.train() ## Run the Model def translate(text): inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) translated = model.generate(**inputs) return tokenizer.decode(translated[0], skip_special_tokens=True) english_text = "Good morning, how are you?" telugu_translation = translate(english_text) print("Translated Text:", telugu_translation) ## Future Improvements 🔹 Train on a larger dataset for better accuracy 🔹 Optimize inference speed for real-time use 🔹 Deploy as a cloud-based API (AWS/GCP) ---