Hindi Causal Language Model (convaiinnovations/hindi-foundational-model-base)

This repository contains a custom-trained Hindi Causal Language Model designed for Hindi text generation.

Model Description

Model Size: 113M (YAH !!! Its very small)
Architecture: Custom Transformer (12 layers, hidden=768, 16 heads, ffn=3072, act=swiglu, norm=rmsnorm) based on the HindiCausalLM class with Hindi-specific optimizations:
- Multi-resolution attention to capture both character-level and word-level patterns
- Morphology-aware feed-forward layers
- Script-mix processing for Hindi-English code-mixing
Language: Hindi (hi)
Training Data: 2.7 million high-quality Hindi text samples from:
- IITB Parallel Corpus (1.2M sentences)
- Samanantar (750K samples)
- Oscar Hindi (450K sentences)
- CC-100 Hindi (300K sentences)
- Hindi Wikipedia (150K articles)
- Hindi news articles (100K pieces)
- XNLI Hindi (50K premise-hypothesis pairs)
- IndicGLUE (30K samples)
- Hindi literature (5K passages)
Tokenizer: SentencePiece trained on Hindi text with vocab size of 16,000
Training Details: Trained on 4xL4 24GB VRAM GPUs for 8 hours. 2 epochs, hidden size=768, num_layers=12, block_size=512, batch_size=64, learning_rate=5e-5, swiglu activation, rope positional encoding, and rms normalization

How to Use

⚠️ Important: This model uses custom Python classes (HindiCausalLM, HindiCausalLMConfig, SentencePieceTokenizerWrapper) which are not part of the standard Hugging Face transformers library. The custom Python files are included in this repository.

Download Required Files

import os
from huggingface_hub import hf_hub_download

# Configuration
repo_id = "convaiinnovations/hindi-foundational-model-base"
model_dir = "."  # Use current directory for downloaded files

# Download model files
print(f"Downloading files for {repo_id}...")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json", local_dir=model_dir)
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.model", local_dir=model_dir)

# Download custom module files (these are crucial!)
hindi_model_path = hf_hub_download(repo_id=repo_id, filename="hindi_language_model.py", local_dir=model_dir)
hindi_embeddings_path = hf_hub_download(repo_id=repo_id, filename="hindi_embeddings.py", local_dir=model_dir)

# Try safetensors first, then bin
try:
    weights_path = hf_hub_download(repo_id=repo_id, filename="model.safetensors", local_dir=model_dir)
    using_safetensors = True
except:
    weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin", local_dir=model_dir)
    using_safetensors = False

print("All necessary files downloaded.")

Debug and Inference Script

import os
import json
import torch
import argparse # Keep argparse for potential future use
import numpy as np
import time
import traceback  # For detailed exception info

# Try importing safetensors
try:
    import safetensors.torch
    SAFE_TENSORS_AVAILABLE = True
except ImportError:
    SAFE_TENSORS_AVAILABLE = False

print("[INFO] --- Debug Inference Script Started ---")
if SAFE_TENSORS_AVAILABLE: print("[INFO] safetensors library found.")
else: print("[WARNING] safetensors library not found.")

# --- Attempt to import custom modules ---
print("[DEBUG] Attempting to import custom modules...")
try:
    from hindi_language_model import HindiCausalLM, HindiCausalLMConfig
    from hindi_embeddings import SentencePieceTokenizerWrapper
    print("[INFO] Successfully imported custom modules.")
except ImportError as e:
    print(f"[ERROR] Failed to import custom modules: {e}"); traceback.print_exc()

# --- End Custom Module Import ---


# --- Main Generation Function Definition ---
def run_generation(
    model_path: str,
    prompt: str,
    max_len: int,
    temp: float,
    top_k: int,
    seed: int,
    device_str: str
):
    """Loads model and generates text, printing debug info."""
    print(f"\nINFO: --- Starting Generation ---")
    print(f"[DEBUG] Args: path='{model_path}', max_len={max_len}, temp={temp}, top_k={top_k}, seed={seed}, device='{device_str}'")

    # --- Setup ---
    t_start_setup = time.time()
    try:
        torch.manual_seed(seed); np.random.seed(seed); device = torch.device(device_str)
        if device.type == 'cuda': torch.cuda.manual_seed_all(seed)
        print(f"[INFO] Using device: {device}")
        print(f"[DEBUG] Setup took {time.time()-t_start_setup:.4f}s")
    except Exception as e: print(f"[ERROR] Device/Seed setup failed: {e}"); traceback.print_exc(); return None

    # --- Load Tokenizer ---
    print("\n[INFO] --- Loading Tokenizer ---")
    t_start_load = time.time(); tokenizer = None
    try:
        tokenizer_model_file = os.path.join(model_path, "tokenizer.model")
        print(f"[DEBUG] Looking for tokenizer at: {tokenizer_model_file}")
        assert os.path.exists(tokenizer_model_file), "tokenizer.model not found!"
        tokenizer = SentencePieceTokenizerWrapper(tokenizer_model_file) # Use imported class
        print(f"[INFO] Tokenizer loaded. Vocab: {getattr(tokenizer, 'vocab_size', 'N/A')}")
        # Get BOS/EOS (handle if missing)
        bos_id = getattr(tokenizer, 'bos_token_id', 1) # Default 1
        eos_id = getattr(tokenizer, 'eos_token_id', 2) # Default 2
        print(f"[INFO] BOS ID: {bos_id}, EOS ID: {eos_id}")
    except Exception as e: print(f"[ERROR] Tokenizer loading failed: {e}"); traceback.print_exc(); return None

    # --- Load Config ---
    print("\n[INFO] --- Loading Config ---")
    lm_config = None
    try:
        config_file = os.path.join(model_path, "config.json")
        print(f"[DEBUG] Looking for config at: {config_file}")
        assert os.path.exists(config_file), "config.json not found!"
        with open(config_file, 'r', encoding='utf-8') as f: config_dict = json.load(f)
        print(f"[DEBUG] Config JSON loaded.")
        # Check/fix vocab size
        tok_vocab = getattr(tokenizer, 'vocab_size', None)
        if tok_vocab and 'vocab_size' in config_dict and config_dict['vocab_size'] != tok_vocab: print(f"[WARN] Config/Tokenizer vocab mismatch. Using tokenizer size: {tok_vocab}"); config_dict['vocab_size'] = tok_vocab
        # Instantiate config
        if hasattr(HindiCausalLMConfig, 'from_dict'): lm_config = HindiCausalLMConfig.from_dict(config_dict)
        else: lm_config = HindiCausalLMConfig(**config_dict)
        print("[INFO] Model config loaded.")
    except Exception as e: print(f"[ERROR] Config loading failed: {e}"); traceback.print_exc(); return None

    # --- Load Model ---
    print("\n[INFO] --- Loading Model ---")
    model = None
    try:
        print(f"[DEBUG] Instantiating {HindiCausalLM.__name__}...")
        model = HindiCausalLM(lm_config); print(f"[INFO] Model structure created.")
        weights_file = None; s_path = os.path.join(model_path, "model.safetensors"); b_path = os.path.join(model_path, "pytorch_model.bin")
        print(f"[DEBUG] Checking weights: {s_path} (exists: {os.path.exists(s_path)}), {b_path} (exists: {os.path.exists(b_path)})")
        if SAFE_TENSORS_AVAILABLE and os.path.exists(s_path): weights_file = s_path
        elif os.path.exists(b_path): weights_file = b_path
        else: raise FileNotFoundError("Model weights (.safetensors or .bin) not found!")
        print(f"[INFO] Loading weights from: {weights_file}")
        if weights_file.endswith(".safetensors"): state_dict = safetensors.torch.load_file(weights_file, device="cpu")
        else: state_dict = torch.load(weights_file, map_location="cpu")
        print(f"[DEBUG] State dict loaded to CPU. Keys: {len(state_dict)}")
        try: load_res = model.load_state_dict(state_dict, strict=True)
        except RuntimeError as e_load: print(f"[WARN] Strict load failed: {e_load}. Trying non-strict."); load_res = model.load_state_dict(state_dict, strict=False)
        missing = getattr(load_res, "missing_keys", []); unexpected = getattr(load_res, "unexpected_keys", [])
        print(f"[INFO] State dict loaded. Missing: {len(missing)}. Unexpected: {len(unexpected)}")
        if missing: print(f"[WARN] Missing keys: {missing[:5]}...")
        if unexpected: print(f"[WARN] Unexpected keys: {unexpected[:5]}...")
        del state_dict; model.to(device); model.eval()
        print("[INFO] Model loaded to device and set to eval mode.")
        print(f"[DEBUG] Tokenizer+Config+Model loading took {time.time()-t_start_load:.2f}s")
    except Exception as e: print(f"[ERROR] Model loading failed: {e}"); traceback.print_exc(); return None

    # --- Generation ---
    print("\n[INFO] --- Starting Text Generation ---")
    t_start_gen = time.time()
    print(f"[INFO] Prompt: \"{prompt}\"")
    try:
        print("[DEBUG] Encoding prompt...")
        # Use __call__ or sp_model.EncodeAsIds
        if hasattr(tokenizer, '__call__'):
             print("DEBUG: Trying tokenizer(prompt)...")
             encoded_result = tokenizer(prompt, return_tensors=None)
             if isinstance(encoded_result, dict) and 'input_ids' in encoded_result: input_ids = encoded_result['input_ids']
             else: print(f"DEBUG: __call__ result type {type(encoded_result)} unexpected. Trying sp_model.EncodeAsIds...");
             if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'): input_ids = tokenizer.sp_model.EncodeAsIds(prompt)
             else: raise AttributeError("Cannot find suitable encoding method (__call__ or sp_model.EncodeAsIds)")
        elif hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'EncodeAsIds'):
             print("DEBUG: Trying tokenizer.sp_model.EncodeAsIds...")
             input_ids = tokenizer.sp_model.EncodeAsIds(prompt)
        else: raise AttributeError("Cannot find suitable encoding method")
        print(f"[DEBUG] Prompt token IDs: {input_ids}")

        if bos_id is not None: print(f"[DEBUG] Prepending BOS {bos_id}"); input_ids = [bos_id] + input_ids
        input_tensor = torch.tensor([input_ids], dtype=torch.long, device=device); print(f"[DEBUG] Initial input tensor shape: {input_tensor.shape}")
        generated_ids = input_tensor

        print("[DEBUG] Starting generation loop...")
        with torch.no_grad():
            for i in range(max_len - len(input_ids)):
                step = i + 1; print(f"\nDEBUG: --- Step {step}/{max_len - len(input_ids)} | Current len: {generated_ids.shape[1]} ---")
                t_fwd = time.time();

                # --- FORWARD CALL AND LOGIT EXTRACTION ---
                outputs = model(input_ids=generated_ids) # model call

                # *** CORRECTED LOGIT ACCESS ***
                if isinstance(outputs, dict) and 'logits' in outputs:
                    logits = outputs['logits'] # Access via key if output is dict
                    print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed dict['logits'].")
                elif hasattr(outputs, 'logits'):
                    logits = outputs.logits # Access via attribute if output is object
                    print(f"DEBUG: Fwd pass {time.time()-t_fwd:.4f}s. Accessed outputs.logits.")
                else:
                    print(f"[ERROR] Model output type is {type(outputs)}, and does not contain 'logits'.")
                    raise TypeError("Model output format error.")
                # *** END CORRECTION ***

                next_token_logits = logits[:, -1, :]; print(f"DEBUG: Next logits shape: {next_token_logits.shape}")

                # --- Sampling ---
                if temp > 0: scaled_logits = next_token_logits / temp
                else: scaled_logits = next_token_logits # Greedy
                if top_k > 0: kth_vals, _ = torch.topk(scaled_logits, k=top_k, dim=-1); scaled_logits[scaled_logits < kth_vals[:, -1].unsqueeze(-1)] = -float("Inf")
                probs = torch.softmax(scaled_logits, dim=-1); next_token_id = torch.multinomial(probs, num_samples=1); print(f"DEBUG: Sampled ID: {next_token_id.item()}")
                generated_ids = torch.cat([generated_ids, next_token_id], dim=1)
                if next_token_id.item() == eos_id: print(f"INFO: EOS token {eos_id} generated."); break
            else: print(f"INFO: Reached max length {max_len}.")

        # --- Decode ---
        print("\nDEBUG: --- Post-processing ---")
        output_ids = generated_ids[0].cpu().tolist(); print(f"[DEBUG] Raw output IDs: {output_ids}")
        processed_ids = output_ids
        if bos_id and processed_ids and processed_ids[0] == bos_id: print("[DEBUG] Removing BOS"); processed_ids = processed_ids[1:]
        if eos_id and processed_ids and processed_ids[-1] == eos_id: print("[DEBUG] Removing EOS"); processed_ids = processed_ids[:-1]
        print(f"[DEBUG] Processed IDs: {processed_ids}")
        print("[INFO] Decoding...")
        # Use sp_model.DecodeIds or decode
        if hasattr(tokenizer, 'sp_model') and hasattr(tokenizer.sp_model, 'DecodeIds'): print("DEBUG: Decoding using tokenizer.sp_model.DecodeIds..."); generated_text = tokenizer.sp_model.DecodeIds(processed_ids)
        elif hasattr(tokenizer, 'decode'): print("DEBUG: Decoding using tokenizer.decode..."); generated_text = tokenizer.decode(processed_ids)
        else: raise AttributeError("Cannot find suitable decoding method")
        print(f"[DEBUG] Decoded text: '{generated_text}'")
        print(f"[INFO] Generation successful ({time.time() - t_start_gen:.2f}s).")
        return generated_text

    except Exception as e: print(f"ERROR: Generation loop error: {e}"); traceback.print_exc(); return None
# --- End Generation Function Definition ---


# --- Main Execution Block ---
if __name__ == "__main__":
    # --- Parameters ---
    model_dir = "."  # Use current directory if files are downloaded here
    prompt = "गंगा नदी"
    max_len = 80
    temp = 2
    top_k = 45
    seed = 42
    device = "cuda" if torch.cuda.is_available() else "cpu"

    print("\n[INFO] --- Simple Hindi Text Generation Script ---")
    print(f"[INFO] Model Dir: {model_dir}")
    print(f"[INFO] Prompt: \"{prompt}\"")
    print(f"[INFO] Max Length: {max_len}")
    print(f"[INFO] Temperature: {temp}")
    print(f"[INFO] Top-K: {top_k}")
    print(f"[INFO] Seed: {seed}")
    print(f"[INFO] Device: {device}")
    print("-" * 30)

    # --- Validate Path ---
    if not os.path.isdir(model_dir): print(f"[ERROR] Model directory not found: {model_dir}"); exit(1)

    # --- Run Generation ---
    if 'run_generation' in locals():
        generated_output = run_generation(
            model_path=model_dir, prompt=prompt, max_len=max_len,
            temp=temp, top_k=top_k, seed=seed, device_str=device
        )
    else: print("[ERROR] run_generation function is not defined!"); generated_output = None

    # --- Print Result ---
    print("\n" + "="*20 + " Final Generation Result " + "="*20)
    if generated_output is not None:
        print(f"Prompt: {prompt}")
        print("-" * (40 + len(" Final Generation Result ")))
        print("Generated Text:")
        print(generated_output)
    else:
        print("\n[FAILURE] Text generation failed. Check print statements above.")
    print("=" * (40 + len(" Final Generation Result ")))

Example Outputs

Basic Example

prompt = "हिंदी भाषा"
# Output: "हिंदी भाषा भारत की सबसे महत्वपूर्ण भाषाओं में से एक है। यह भारत के उत्तर भारत के राज्यों में मुख्य भाषा के रूप में बोली जाती है..."

Creative Writing Example

prompt = "एक बार की बात है"
# Output: "एक बार की बात है, जब मैं छोटा था, तब मेरे दादाजी मुझे एक कहानी सुनाया करते थे। वह कहानी एक ऐसे राजा की थी जो अपने राज्य में..."

Limitations and Biases

The model may reflect biases present in its training data, including potential cultural, gender, or regional biases found in source materials.
Performance is limited by its architecture size (12 layers, hidden=768) and training dataset size.
May generate repetitive, nonsensical, or factually incorrect text.
Uses weighted pooling with sensitivity to Hindi's SOV structure, but may struggle with complex semantic relationships in longer texts.
May have particular difficulties with:
- Cultural concepts lacking direct English translations
- Idiomatic expressions specific to Hindi
- Formal/informal speech distinctions
- Handling Hindi-specific morphological complexities

License

This model is licensed under the MIT License.

Please use this model responsibly.