Repository for:
ThinkEdit-deepseek-qwen-14b
(We also release ThinkEdit versions for ThinkEdit-deepseek-qwen-1.5b and ThinkEdit-deepseek-llama3-8b.)
Authors: Chung-En Sun, Ge Yan, Tsui-Wei Weng
Paper: ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Code: https://github.com/Trustworthy-ML-Lab/ThinkEdit
Introduction
Reasoning-augmented models sometimes fail by generating overly short, abstract chain-of-thought (CoT) reasoning, hurting their accuracy.
ThinkEdit is a lightweight weight-editing method that:
- Identifies ~2% of "short reasoning" attention heads
- Edits only ~0.1% of total parameters
- Removes the "short reasoning" direction from their output
- Boosts performance, especially on cases with short reasoning traces
Full Performance Results
1. Overall Accuracy
Model | GSM8K | MMLU Elementary Math | MATH-Level1 | MATH-Level5 | MATH-500 |
---|---|---|---|---|---|
deepseek-qwen-14b | 90.80 ± 0.36 | 95.08 ± 0.65 | 96.32 ± 0.35 | 90.25 ± 0.72 | 91.48 ± 0.55 |
ThinkEdit-deepseek-qwen-14b | 93.50 ± 0.31 | 96.53 ± 0.54 | 96.50 ± 0.46 | 91.15 ± 0.59 | 91.78 ± 0.58 |
deepseek-llama3-8b | 82.26 ± 0.91 | 96.01 ± 0.62 | 93.46 ± 0.84 | 85.49 ± 0.83 | 87.26 ± 1.16 |
ThinkEdit-deepseek-llama3-8b | 88.97 ± 0.78 | 96.08 ± 0.86 | 94.12 ± 0.47 | 85.91 ± 0.48 | 87.60 ± 0.81 |
deepseek-qwen-1.5b | 79.15 ± 1.08 | 68.52 ± 1.56 | 93.00 ± 0.33 | 75.48 ± 0.90 | 82.22 ± 1.29 |
ThinkEdit-deepseek-qwen-1.5b | 83.34 ± 0.79 | 86.24 ± 1.12 | 93.89 ± 0.76 | 74.94 ± 0.85 | 82.74 ± 0.77 |
2. Accuracy on Short Reasoning Cases (Top 5% / 10% / 20%)
Model | GSM8K | MMLU Elementary Math | MATH-Level1 | MATH-Level5 | MATH-500 |
---|---|---|---|---|---|
deepseek-qwen-14b | 96.31 / 95.65 / 92.93 | 93.89 / 96.22 / 95.60 | 99.52 / 99.30 / 97.70 | 89.39 / 94.32 / 96.25 | 86.40 / 91.40 / 93.50 |
ThinkEdit-deepseek-qwen-14b | 96.62 / 96.03 / 96.12 | 96.11 / 96.22 / 96.27 | 100.00 / 99.77 / 98.85 | 95.76 / 97.65 / 98.07 | 89.60 / 92.60 / 94.70 |
deepseek-llama3-8b | 88.92 / 87.18 / 85.82 | 97.22 / 96.49 / 96.80 | 97.14 / 94.88 / 94.83 | 78.64 / 88.79 / 93.41 | 82.00 / 81.40 / 88.30 |
ThinkEdit-deepseek-llama3-8b | 97.08 / 95.27 / 93.95 | 97.78 / 98.65 / 97.87 | 100.00 / 99.30 / 98.62 | 95.61 / 96.89 / 97.12 | 92.80 / 93.60 / 94.40 |
deepseek-qwen-1.5b | 88.46 / 87.48 / 85.02 | 62.78 / 62.16 / 60.53 | 97.62 / 95.12 / 93.91 | 91.52 / 95.00 / 95.72 | 82.40 / 89.80 / 93.40 |
ThinkEdit-deepseek-qwen-1.5b | 92.46 / 92.37 / 92.05 | 77.22 / 80.54 / 79.73 | 96.19 / 95.81 / 97.36 | 93.79 / 95.83 / 95.80 | 92.80 / 94.40 / 94.90 |
3. Reasoning Lengths (Top 5% / 10% / 20% Shortest Responses)
Model | GSM8K | MMLU Elementary Math | MATH-Level1 | MATH-Level5 | MATH-500 |
---|---|---|---|---|---|
deepseek-qwen-14b | 76.6 / 86.5 / 99.1 | 65.8 / 72.2 / 80.6 | 93.7 / 114.3 / 188.6 | 628.8 / 858.4 / 1125.9 | 198.7 / 434.3 / 697.0 |
ThinkEdit-deepseek-qwen-14b | 95.4 / 106.3 / 120.2 | 79.1 / 87.1 / 98.7 | 125.1 / 150.2 / 243.4 | 698.5 / 906.6 / 1157.2 | 270.2 / 492.6 / 733.3 |
deepseek-llama3-8b | 73.0 / 83.1 / 96.6 | 371.0 / 438.1 / 518.2 | 80.3 / 97.2 / 130.3 | 617.9 / 854.9 / 1126.5 | 159.5 / 357.5 / 644.5 |
ThinkEdit-deepseek-llama3-8b | 93.2 / 106.9 / 127.4 | 396.5 / 464.2 / 543.2 | 137.4 / 173.3 / 277.1 | 791.2 / 954.8 / 1185.1 | 305.2 / 506.3 / 737.6 |
deepseek-qwen-1.5b | 78.8 / 89.4 / 103.0 | 61.6 / 68.5 / 77.6 | 88.8 / 110.3 / 219.7 | 804.6 / 1017.9 / 1314.0 | 249.7 / 506.5 / 760.7 |
ThinkEdit-deepseek-qwen-1.5b | 97.2 / 109.4 / 126.3 | 75.9 / 85.0 / 99.5 | 127.9 / 174.1 / 416.4 | 818.0 / 984.5 / 1214.3 | 435.0 / 612.9 / 800.6 |
Usage
The usage of ThinkEdit models is exactly the same as the original deepseek-distilled models.
Citation
@misc{sun2025thinkedit,
title={ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models},
author={Chung-En Sun and Ge Yan and Tsui-Wei Weng},
year={2025},
eprint={2503.22048},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.22048},
}
- Downloads last month
- 36
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support