shisa-ai
/

Llama-3.1-Tulu-3-405B-FP8-Dynamic

Text Generation

compressed-tensors

Model card Files Files and versions Community

leonardlin commited on Feb 9

Commit

2af98b0

·

verified ·

1 Parent(s): b585764

Create README.md

Files changed (1) hide show

README.md +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+license: llama3.1
+language:
+- en
+pipeline_tag: text-generation
+datasets:
+- allenai/RLVR-MATH
+base_model:
+- allenai/Llama-3.1-Tulu-3-405B
+tags:
+- quant
+---
+This is an [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.4.0 [FP8 Dynamic](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8) quant.
+You can refer to [CPU offloading example](https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate) but for quanting with an H100 node, we used this setup to avoid OOM errors:
+```
+config = AutoConfig.from_pretrained(model_name)
+with init_empty_weights():
+    model = AutoModelForCausalLM.from_config(config)
+max_memory = {
+      0: "60GiB",
+      1: "60GiB",
+      2: "60GiB",
+      3: "60GiB",
+      4: "60GiB",
+      5: "60GiB",
+      6: "60GiB",
+      7: "60GiB",
+      "cpu": "1500GiB",
+}
+device_map = infer_auto_device_map(
+    model,
+    max_memory=max_memory,
+    no_split_module_classes=["LlamaDecoderLayer"],
+)
+```
+Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B