Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic

SmoothQuant/GPTQ W8A8 quantization of https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

Creation

Created with llmcompressor using the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import random

# Config
MODEL_ID = "/models/Llama-3_1-Nemotron-Ultra-253B-v1"
SAVE_DIR = "/models/Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic"
NUM_CALIBRATION_SAMPLES = 1024
MAX_SEQUENCE_LENGTH = 4096

# Load model
device_map = calculate_offload_device_map(
    MODEL_ID, num_gpus=8, reserve_for_hessians=False, torch_dtype="auto", trust_remote_code=True,
)
print(device_map)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map=device_map, torch_dtype="auto", trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load and preprocess the dataset
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=1337).select(range(NUM_CALIBRATION_SAMPLES))

def add_system_prompt(messages):
    options = ["on", "off"]
    thinking = random.choice(options)
    return [{"content": f"detailed thinking {thinking}", "role": "system"}] + messages

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(add_system_prompt(example["messages"]), tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithms
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*125.*", "re:.*134.*", "re:.*143.*", "re:.*149.*"], dampening_frac=0.01, offload_hessians=False),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True
)

# Save the compressed model
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Note that Layers 125, 134, 143 and 149 had to be excluded from GPTQ quantization, because their extreme size would lead to allocations of 600+GB Heassian matrices for GPTQ (which couldn't be offloaded for some reason). Furthermore, the GPU memory allocation code in calculate_offload_device_map() was adjusted.

Evaluation

GSM8K (3 Runs)

Original

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.9469 ± 0.0062
strict-match 5 exact_match ↑ 0.9462 ± 0.0062
----- ------: ---------------- -----: ----------- --- -----: --- -----:
gsm8k 3 flexible-extract 5 exact_match ↑ 0.9424 ± 0.0064
strict-match 5 exact_match ↑ 0.9401 ± 0.0065
----- ------: ---------------- -----: ----------- --- -----: --- -----:
gsm8k 3 flexible-extract 5 exact_match ↑ 0.9454 ± 0.0063
strict-match 5 exact_match ↑ 0.9454 ± 0.0063
----- ------: ---------------- -----: ----------- --- -----: --- -----:
Avg: 3 flexible-extract 5 exact_match ↑ 0.9449 ± 0.0036
strict-match 5 exact_match ↑ 0.9439 ± 0.0037

Quantized

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.9431 ± 0.0064
strict-match 5 exact_match ↑ 0.9393 ± 0.0066
----- ------: ---------------- -----: ----------- --- -----: --- -----:
gsm8k 3 flexible-extract 5 exact_match ↑ 0.9538 ± 0.0058
strict-match 5 exact_match ↑ 0.9500 ± 0.0060
----- ------: ---------------- -----: ----------- --- -----: --- -----:
gsm8k 3 flexible-extract 5 exact_match ↑ 0.9477 ± 0.0061
strict-match 5 exact_match ↑ 0.9462 ± 0.0062
----- ------: ---------------- -----: ----------- --- -----: --- -----:
Avg. 3 flexible-extract 5 exact_match ↑ 0.9482 ± 0.0035
strict-match 5 exact_match ↑ 0.9452 ± 0.0036

simple-evals (10x50 Samples each)

Using custom fork of OpenAI's simple-evals benchmark suite: https://github.com/Ithanil/simple-evals/tree/custom

These were run using the chat template as well as Nvidias suggested settings:

  • Reasoning Off: Greedy (temperature=0), system prompt: detailed thinking off
  • Reasoning On: temperature=0.6, top_p=0.95, system prompt: detailed thinking on

Original (Reasoning Off)

Benchmark Average Score Standard Error
DROP (F1) 92.6556 0.711437
GPQA 43.2 2.04831
HumanEval 85.6 0.37238
MGSM 90.9091 1.40836
MMLU 84.6 0.6

Quantized (Reasoning Off)

Benchmark Average Score Standard Error
DROP (F1) 91.2381 0.843284
GPQA 43.2 0.997775
HumanEval 85.08 0.430194
MGSM 92.9091 0.994013
MMLU 82.8 1.04137

i.e. all quantized evals are within statistical error of original model's evals.

Quantized (Reasoning On)

For completeness, here also results for Reasoning ON:

Benchmark Average Score Standard Error
DROP (F1) 89.8326 1.14615
GPQA 61.2 1.81842
HumanEval 93 0.181353
MGSM 94.9091 0.931048
MMLU 85.2 0.8
Downloads last month
110
Safetensors
Model size
253B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ithanil/Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic

Quantized
(5)
this model