Repository for:

ThinkEdit-deepseek-llama3-8b

(We also release ThinkEdit versions for ThinkEdit-deepseek-qwen-1.5b and ThinkEdit-deepseek-qwen-14b.)

Authors: Chung-En Sun, Ge Yan, Tsui-Wei Weng Paper: ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models

Code: https://github.com/Trustworthy-ML-Lab/ThinkEdit


Introduction

Reasoning-augmented models sometimes fail by generating overly short, abstract chain-of-thought (CoT) reasoning, hurting their accuracy.

ThinkEdit is a lightweight weight-editing method that:

  • Identifies ~2% of "short reasoning" attention heads
  • Edits only ~0.1% of total parameters
  • Removes the "short reasoning" direction from their output
  • Boosts performance, especially on cases with short reasoning traces

Full Performance Results

1. Overall Accuracy

Model GSM8K MMLU Elementary Math MATH-Level1 MATH-Level5 MATH-500
deepseek-qwen-14b 90.80 ± 0.36 95.08 ± 0.65 96.32 ± 0.35 90.25 ± 0.72 91.48 ± 0.55
ThinkEdit-deepseek-qwen-14b 93.50 ± 0.31 96.53 ± 0.54 96.50 ± 0.46 91.15 ± 0.59 91.78 ± 0.58
deepseek-llama3-8b 82.26 ± 0.91 96.01 ± 0.62 93.46 ± 0.84 85.49 ± 0.83 87.26 ± 1.16
ThinkEdit-deepseek-llama3-8b 88.97 ± 0.78 96.08 ± 0.86 94.12 ± 0.47 85.91 ± 0.48 87.60 ± 0.81
deepseek-qwen-1.5b 79.15 ± 1.08 68.52 ± 1.56 93.00 ± 0.33 75.48 ± 0.90 82.22 ± 1.29
ThinkEdit-deepseek-qwen-1.5b 83.34 ± 0.79 86.24 ± 1.12 93.89 ± 0.76 74.94 ± 0.85 82.74 ± 0.77

2. Accuracy on Short Reasoning Cases (Top 5% / 10% / 20%)

Model GSM8K MMLU Elementary Math MATH-Level1 MATH-Level5 MATH-500
deepseek-qwen-14b 96.31 / 95.65 / 92.93 93.89 / 96.22 / 95.60 99.52 / 99.30 / 97.70 89.39 / 94.32 / 96.25 86.40 / 91.40 / 93.50
ThinkEdit-deepseek-qwen-14b 96.62 / 96.03 / 96.12 96.11 / 96.22 / 96.27 100.00 / 99.77 / 98.85 95.76 / 97.65 / 98.07 89.60 / 92.60 / 94.70
deepseek-llama3-8b 88.92 / 87.18 / 85.82 97.22 / 96.49 / 96.80 97.14 / 94.88 / 94.83 78.64 / 88.79 / 93.41 82.00 / 81.40 / 88.30
ThinkEdit-deepseek-llama3-8b 97.08 / 95.27 / 93.95 97.78 / 98.65 / 97.87 100.00 / 99.30 / 98.62 95.61 / 96.89 / 97.12 92.80 / 93.60 / 94.40
deepseek-qwen-1.5b 88.46 / 87.48 / 85.02 62.78 / 62.16 / 60.53 97.62 / 95.12 / 93.91 91.52 / 95.00 / 95.72 82.40 / 89.80 / 93.40
ThinkEdit-deepseek-qwen-1.5b 92.46 / 92.37 / 92.05 77.22 / 80.54 / 79.73 96.19 / 95.81 / 97.36 93.79 / 95.83 / 95.80 92.80 / 94.40 / 94.90

3. Reasoning Lengths (Top 5% / 10% / 20% Shortest Responses)

Model GSM8K MMLU Elementary Math MATH-Level1 MATH-Level5 MATH-500
deepseek-qwen-14b 76.6 / 86.5 / 99.1 65.8 / 72.2 / 80.6 93.7 / 114.3 / 188.6 628.8 / 858.4 / 1125.9 198.7 / 434.3 / 697.0
ThinkEdit-deepseek-qwen-14b 95.4 / 106.3 / 120.2 79.1 / 87.1 / 98.7 125.1 / 150.2 / 243.4 698.5 / 906.6 / 1157.2 270.2 / 492.6 / 733.3
deepseek-llama3-8b 73.0 / 83.1 / 96.6 371.0 / 438.1 / 518.2 80.3 / 97.2 / 130.3 617.9 / 854.9 / 1126.5 159.5 / 357.5 / 644.5
ThinkEdit-deepseek-llama3-8b 93.2 / 106.9 / 127.4 396.5 / 464.2 / 543.2 137.4 / 173.3 / 277.1 791.2 / 954.8 / 1185.1 305.2 / 506.3 / 737.6
deepseek-qwen-1.5b 78.8 / 89.4 / 103.0 61.6 / 68.5 / 77.6 88.8 / 110.3 / 219.7 804.6 / 1017.9 / 1314.0 249.7 / 506.5 / 760.7
ThinkEdit-deepseek-qwen-1.5b 97.2 / 109.4 / 126.3 75.9 / 85.0 / 99.5 127.9 / 174.1 / 416.4 818.0 / 984.5 / 1214.3 435.0 / 612.9 / 800.6

Usage

The usage of ThinkEdit models is exactly the same as the original deepseek-distilled models.

Citation

@misc{sun2025thinkedit,
      title={ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models},
      author={Chung-En Sun and Ge Yan and Tsui-Wei Weng},
      year={2025},
      eprint={2503.22048},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.22048},
}
Downloads last month
51
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cesun/ThinkEdit-deepseek-llama3-8b

Quantizations
2 models