Geralt-Targaryen commited on
Commit
9d0ad9a
·
verified ·
1 Parent(s): 8615ce3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -3
README.md CHANGED
@@ -1,3 +1,78 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ ---
6
+ license: apache-2.0
7
+ ---
8
+
9
+ # QwQ-Math-7B-Persona
10
+
11
+ ## Introduction
12
+
13
+ QwQ-Math-7B-Persona is finetuned from Qwen2.5-Math-7B-Instruct on 1 million math persona data (see [this paper](https://arxiv.org/abs/2406.20094) for details about how to construct the data).
14
+
15
+ Currently QwQ-Math-7B-Persona is meant to serve as a draft model for losslessly accelerating the inference of QwQ-32B, but you may also use it as a standalone model.
16
+
17
+ ## Quickstart
18
+
19
+ Here is a code snippet for using QwQ-Math-7B-Persona to accelerate the inference of QwQ 32B:
20
+
21
+ ```python
22
+ from transformers import AutoModelForCausalLM, AutoTokenizer
23
+
24
+ model = AutoModelForCausalLM.from_pretrained(
25
+ "Qwen/QwQ-32B-Preview",
26
+ torch_dtype="auto",
27
+ device_map={'': 0}
28
+ )
29
+
30
+ draft_model = AutoModelForCausalLM.from_pretrained(
31
+ "Geralt-Targaryen/QwQ-Math-7B-Persona",
32
+ torch_dtype="auto",
33
+ device_map={'': 0}
34
+ )
35
+
36
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")
37
+
38
+ prompt = "How many r in strawberry."
39
+ messages = [
40
+ {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
41
+ {"role": "user", "content": prompt}
42
+ ]
43
+ text = tokenizer.apply_chat_template(
44
+ messages,
45
+ tokenize=False,
46
+ add_generation_prompt=True
47
+ )
48
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
49
+
50
+ generated_ids = model.generate(
51
+ **model_inputs,
52
+ max_new_tokens=512,
53
+ assistant_model=draft_model
54
+ )
55
+ generated_ids = [
56
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
57
+ ]
58
+
59
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
60
+ ```
61
+
62
+ For the more advanced SVIP draft length policy, please refer to [this GitHub repo](https://github.com/Geralt-Targaryen/SVIP).
63
+
64
+ ## Citation
65
+
66
+ If you find QwQ-Math-1.5B-Persona to be helpful, please cite the following paper.
67
+
68
+ ```
69
+ @misc{zhang2024svip,
70
+ title={Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding},
71
+ author={Ziyin Zhang and Jiahao Xu and Tian Liang and Xingyu Chen and Zhiwei He and Rui Wang and Zhaopeng Tu},
72
+ year={2024},
73
+ eprint={2411.18462},
74
+ archivePrefix={arXiv},
75
+ primaryClass={cs.CL},
76
+ url={https://arxiv.org/abs/2411.18462},
77
+ }
78
+ ```