leonardlin commited on
Commit
2af98b0
·
verified ·
1 Parent(s): b585764

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ datasets:
7
+ - allenai/RLVR-MATH
8
+ base_model:
9
+ - allenai/Llama-3.1-Tulu-3-405B
10
+ tags:
11
+ - quant
12
+ ---
13
+
14
+ This is an [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.4.0 [FP8 Dynamic](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8) quant.
15
+
16
+ You can refer to [CPU offloading example](https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate) but for quanting with an H100 node, we used this setup to avoid OOM errors:
17
+
18
+ ```
19
+ config = AutoConfig.from_pretrained(model_name)
20
+ with init_empty_weights():
21
+ model = AutoModelForCausalLM.from_config(config)
22
+
23
+ max_memory = {
24
+ 0: "60GiB",
25
+ 1: "60GiB",
26
+ 2: "60GiB",
27
+ 3: "60GiB",
28
+ 4: "60GiB",
29
+ 5: "60GiB",
30
+ 6: "60GiB",
31
+ 7: "60GiB",
32
+ "cpu": "1500GiB",
33
+ }
34
+
35
+ device_map = infer_auto_device_map(
36
+ model,
37
+ max_memory=max_memory,
38
+ no_split_module_classes=["LlamaDecoderLayer"],
39
+ )
40
+ ```
41
+
42
+ Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B