---
license: cc-by-4.0
datasets:
- allenai/c4
language:
- en
metrics:
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
pipeline_tag: text-generation
tags:
- biology
- chemistry
- finance
- legal
- climate
- medical
---


# Overview
This document presents the evaluation results of `DeepSeek-R1-Distill-Qwen-32B`, a **4-bit quantized model using GPTQ**, evaluated with the **Language Model Evaluation Harness** on the **ARC** and **MMLU-Challenge**  benchmark.

---

## 📊 Evaluation Summary


| **Metric**            | **Value**  | **Description**  |
|----------------------|-----------|-----------------|
| **ARCH** | `41.04%`  | Raw |
| **MMLU** | `29.74%`  | Averaged over MMLU-Stem, MMLU-Social-Sciences, MMLU-Humanities, MMLU-ther |
| **MMLU-Humanities** | `32.05%` | Averaged over MMLU-Formal-Logic, MMLU-Prehistory, MMLU-World-Religions, MMLU-Philosophy, MMLU-High-School-World-History, MMLU-Professional-Law, MMLU-High-School-US-History, MMLU-Logical-Fallacies, MMLU-International-Law, MMLU-High-School-European-History, MMLU-Moral-Disputes, MMLU-Moral-Scenarios, MMLU-Jurisprudence |
| **MMLU-Social-Sciences** | `30.32%`  | Averaged over MMLU-Public-Relations, MMLU-Sociology, MMLU-Security-Studies, MMLU-High-School-Government-and-Politics, MMLU-High-School-Psychology, MMLU-Human-Sexuality, MMLU-US-Foreign-Policy, MMLU-High-School-Microeconomics, MMLU-Econometrics, MMLU-High-School-Macroeconomics, MMLU-High-School-Geography, MMLU-Professional-Psychology |
| **MMLU-Stem** | `27.5%` | Averaged over MMLU-Conceptual-Physics, MMLU-High-School-Chemistry, MMLU-College-Biology, MMLU-College-Chemistry, MMLU-Machine-Learning, MMLU-Elementary-Mathematics, MMLU-Abstract-Algebra, MMLU-Astronomy, MMLU-High-School-Statistics, MMLU-Anatomy, MMLU-College-Mathematics, MMLU-Computer-Security, MMLU-College-Computer-Science, MMLU-Electrical-Engineering, MMLU-College-Physics, MMLU-High-School-Computer-Science, MMLU-High-School-Physics, MMLU-High-School-Biology, MMLU-High-School-Mathematics |
| **MMLU-Other** | `27.94%` | Averaged over MMLU-Medical-Genetics, MMLU-Global-Facts, MMLU-Marketing, MMLU-College-Medicine, MMLU-Human-Aging, MMLU-Virology, MMLU-Business-Ethics, MMLU-Clinical-Knowledge, MMLU-Professional-Medicine, MMLU-Nutrition, MMLU-Miscellaneous, MMLU-Professional-Accounting, MMLU-Management |

---

## ⚙️ Model Configuration

- **Model:** `DeepSeek-R1-Distill-Qwen-32B`
- **Parameters:** `70 billion`
- **Quantization:** `4-bit GPTQ`
- **Source:** Hugging Face (`hf`)
- **Precision:** `torch.float16`
- **Hardware:** `NVIDIA A100 80GB PCIe`
- **CUDA Version:** `12.4`
- **PyTorch Version:** `2.6.0+cu124`
- **Batch Size:** `1`
- **Evaluation Time:** `1780.502 seconds (~29 minutes)`

📌 **Interpretation:**
- The evaluation was performed on a **high-performance GPU (A100 80GB)**.
- The model is significantly larger than the previous 8B version, with **GPTQ 4-bit quantization reducing memory footprint**.
- A **single-sample batch size** was used, which might slow evaluation speed.

---

## 📈 Performance Insights

- The `"higher_is_better"` flag confirms that **higher accuracy is preferred**.
- **Quantization Impact:** The **4-bit GPTQ quantization** reduces memory usage but may also impact accuracy slightly.
- **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing).

---


📌 Let us know if you need further analysis or model tuning! 🚀