---
datasets:
- UCSC-VLAA/MedReason
base_model:
- II-Vietnam/Medical-SFT-Qwen2.5-7B-Instruct-24-april
tags:
- RL
- Medical
---

# II Medical Model

## Dataset
- Training: MedReason dataset, decontaminated with validation sets to prevent data leakage.
- Validation: 10 distinct medical validation datasets used to evaluate model performance.

## Evaluation Scores

| Dataset | DS 1 | DS 2 | DS 3 | DS 4 | DS 5 | DS 6 | DS 7 | DS 8 | DS 9 | DS 10 |
|---------|----------|----------|----------|----------|----------|----------|----------|----------|----------|-----------|
| QWQ   | -        | -        | -        | -        | -        | -        | -        | -        | -        | -         |
| ...   | -        | -        | -        | -        | -        | -        | -        | -        | -        | -         |
| II-SFT   | -        | -        | -        | -        | -        | -        | -        | -        | -        | -         |
| II-SFT-DAPO | -        | -        | -        | -        | -        | -        | -        | -        | -        | -         |

## Training Details

Model: Fine-tuned on II-Vietnam/Medical-SFT-Qwen2.5-7B-Instruct-24-april.

Algorithm: DAPO (GRPO-based adversarial estimator).

Key Hyperparameters:
- Max prompt length: 2048 tokens.
- Max response length: 12288 tokens.
- Overlong buffer: Enabled, 4096 tokens, penalty factor 1.0.
- Clip ratios: Low 0.2, High 0.28.
- Batch sizes: Train prompt 512, Generation prompt 1536, Mini-batch 32.
- Responses per prompt: 16.
- Temperature: 1.0, Top-p: 1.0, Top-k: -1 (vLLM rollout).
- Learning rate: 1e-6, Warmup steps: 10, Weight decay: 0.1.
- Epochs: 20, Nodes: 2, GPUs per node: 8.


Optimization:
- Loss aggregation: Token-mean.
- Gradient clipping: 1.0.
- Entropy coefficient: 0.
- FSDP: Parameter and optimizer offloading enabled.
- Sequence parallel size: 4.
- Dynamic batch size: Enabled.


Reward Model:
- Overlong buffer enabled with penalty factor 1.0.
- KL divergence in reward/loss: Disabled.


Training reward score
![image.png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/JN6ClKWHfmZuV-uikFZLs.png)

Validation while training score
![image.png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/8ZULP6JjXkZiL5oZDazl5.png)

Response length
![image.png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/VnYGY1iIrLQweykmMYtk9.png)