--- language: - en - ta license: cc-by-4.0 tags: - translation - tamil - colloquial-tamil - fine-tuned - text-to-text datasets: - janisrebekahv/colloquial_tamil - jarvisvasu/english-to-colloquial-tamil - chatgpt-generated - youtube-comments model-index: - name: janisrebekahv/finetuned-colloquial-tamil results: - task: type: translation name: English to Colloquial Tamil dataset: name: janisrebekahv/colloquial_tamil type: text metrics: - name: BLEU Score type: bleu value: 38.5 - name: ROUGE Score type: rouge value: 0.72 --- # janisrebekahv/finetuned-colloquial-tamil ## 📌 Model Overview This is a **fine-tuned version of [suriya7/English-to-Tamil](https://huggingface.co/suriya7/English-to-Tamil)**, trained to produce **colloquial Tamil translations** instead of formal Tamil. ✅ Translates **English → Colloquial Tamil** ✅ Incorporates **slang, informal speech, and real-world phrasing** ✅ Useful for **chatbots, conversational AI, and social media applications** --- ## 📜 Dataset 🔹 **Custom Dataset Used for Fine-Tuning:** 📂 **[janisrebekahv/colloquial_tamil](https://huggingface.co/datasets/janisrebekahv/colloquial_tamil)** This dataset was specifically curated to train this model, improving its ability to translate **English to Colloquial Tamil** accurately. This model was fine-tuned on a **custom dataset**, which includes: 1️⃣ **[jarvisvasu/english-to-colloquial-tamil](https://huggingface.co/datasets/jarvisvasu/english-to-colloquial-tamil)** – A publicly available dataset for informal Tamil translations. 2️⃣ **YouTube Comments Dataset (Custom-Created)** – Extracted using the **YouTube API** and manually converted to colloquial Tamil for authenticity. 3️⃣ **ChatGPT-Generated Data** – Additional colloquial Tamil phrases aligned with natural speech patterns. 📝 **Total dataset size**: **16,269 sentence pairs** --- ## 🔥 Example Usage Load and test the model using **Hugging Face Transformers**: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer # Load model and tokenizer model_name = "janisrebekahv/finetuned-colloquial-tamil" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Function to translate text def translate(text): inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_length=128) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Example translations test_sentences = [ "This is so beautiful", "Bro, are you coming or not?", "My mom is gonna kill me if I don't reach home now!" ] for sentence in test_sentences: print(f"English: {sentence}") print(f"Colloquial Tamil: {translate(sentence)}\n")