--- license: mit language: en tags: - medical - huggingface - healthcare pipeline_tag: text2text-generation library_name: adapter-transformers metrics: - accuracy base_model: - distilbert/distilgpt2 --- # ๐Ÿฉบ Eira-0.1 Fine-Tuned Medical Chatbot A fine-tuned version of the "Eira-0.1" Causal Language Model, designed to answer questions and generate text based on a collection of medical PDFs. This model is optimized for question answering, summarization, and chatbot-style responses grounded in the specific PDF documents it was trained on. --- ## ๐Ÿ“˜ Model Summary This Kaggle model is a fine-tuned version of the **Eira-0.1** Causal Language Model (presumably transformer-based, using `AutoModelForCausalLM`). It has been specifically adapted to understand and generate text based on content extracted from medical and domain-specific PDF documents. The model is suited for interactive chat or question-answering tasks where the knowledge base is the document collection itself. It does **not** possess general world knowledge beyond these documents. - **Training Data**: Text extracted page-by-page from PDFs in `/kaggle/input/dataset` using PyMuPDF - **Fine-Tuning**: Performed over 3 epochs with no validation split - **Architecture**: `AutoModelForCausalLM` with tokenizer from base model --- ## ๐Ÿš€ Usage This model can be used for: - **Question Answering**: Based strictly on training PDFs - **Text Generation**: Mimicking tone, structure, and style of the documents - **Summarization**: Experimental, may require carefully structured prompts ### Input / Output - **Input**: String prompt (e.g., `"What is the recommended dosage for drug X?"`) - **Output**: Generated string response based on model knowledge --- ## โš ๏ธ Known Limitations - **Out-of-Domain Knowledge**: Hallucinations likely when asked about topics outside the PDFs - **Specificity**: Heavily reliant on clarity and structure of source PDFs - **Overfitting**: No validation set used; generalization may be weak - **Repetition**: May still repeat phrases during long responses - **Prompt Sensitivity**: Works best when phrasing is close to original document language --- ## ๐Ÿ–ฅ๏ธ System Requirements ### Hardware - **Training**: GPU (used Kaggle GPUs like T4 or P100); CPU works but is much slower - **Inference**: GPU highly recommended; CPU supported with higher latency ### Software - Python 3.x - PyTorch - Hugging Face Transformers - PyMuPDF (`fitz`) - `tqdm` --- ## ๐Ÿงช Implementation Details - **Epochs**: 3 - **Batch Size**: 2 - **Tokenizer**: Inherited from base model - **Text Format**: `"filename.pdf - Page X:\n[page content]"` --- ## ๐Ÿงพ Model Initialization - **Base Model**: `Eira-0.1` - **Base Path**: `/kaggle/input/eira0.1` - **Fine-tuned Output**: `/kaggle/working/eira_2_finetuned` > *(You should link the base model card on Hugging Face or the original source if publishing externally.)* --- ## ๐Ÿ“Š Model Stats - **Size / Weights / Layers**: Inherited from base `Eira-0.1` [Details to be added if available] - **Disk Size**: Same as base model + minor weight updates - **Inference Latency**: Varies by hardware, prompt length, and decoding parameters --- ## ๐Ÿ—‚๏ธ Data Overview ### Training Data - **Source**: PDF files from `/kaggle/input/dataset` - **Type**: [Insert description โ€” e.g., "clinical guidelines", "patient care manuals", etc.] - **Extraction Tool**: PyMuPDF - **Structure**: Page-wise extraction with basic formatting - **Size**: Depends on total number and length of PDFs ### Pre-processing - Whitespace trimming - Appended filename and page info to each text segment ### Evaluation Data - **Split**: None โ€” entire dataset used for training - **Held-out Set**: Not used in current pipeline --- ## ๐Ÿ“‰ Evaluation Results - **Internal Evaluation**: None implemented in current training script - **Subgroup Performance**: Not assessed - **Recommendation**: Use a test set of similar PDFs and metrics like ROUGE, BLEU, or manual review --- ## โš–๏ธ Fairness & Ethics ### Fairness - No fairness metrics or evaluations were conducted - Model reflects potential biases in the training PDFs ### Ethics - **Misinformation Risk**: May generate plausible but incorrect responses - **Privacy**: Ensure no private/confidential info was in training PDFs - **Bias**: Output may replicate bias present in source documents - **Usage Guidance**: - Should not be used for real clinical advice without human validation - Use disclaimers and human oversight in production --- ## ๐Ÿšง Usage Limitations - **Sensitive Use Cases**: Not suitable for deployment in high-stakes domains (medical/legal) without review - **Prompt Engineering**: Needed for best results - **Scope**: Limited to PDF content โ€” model will not generalize beyond this --- ## โœ… Mitigation Strategies - Curate and vet training data thoroughly - Add post-processing or filters to generated output - Inform users of limitations and training scope - Include human-in-the-loop review for critical use cases --- ## ๐Ÿ“ฆ Example Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_path = "/kaggle/working/eira_2_finetuned" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path) model.to("cuda" if torch.cuda.is_available() else "cpu") def ask_eira(prompt): inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=200, temperature=0.7, top_p=0.9, repetition_penalty=1.2 ) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Example usage response = ask_eira("What is the treatment protocol for asthma?") print(response)