Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic

SmoothQuant/GPTQ W8A8 quantization of https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

Creation

Created with llmcompressor using the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import random

# Config
MODEL_ID = "/models/Llama-3_1-Nemotron-Ultra-253B-v1"
SAVE_DIR = "/models/Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic"
NUM_CALIBRATION_SAMPLES = 1024
MAX_SEQUENCE_LENGTH = 4096

# Load model
device_map = calculate_offload_device_map(
    MODEL_ID, num_gpus=8, reserve_for_hessians=False, torch_dtype="auto", trust_remote_code=True,
)
print(device_map)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map=device_map, torch_dtype="auto", trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load and preprocess the dataset
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=1337).select(range(NUM_CALIBRATION_SAMPLES))

def add_system_prompt(messages):
    options = ["on", "off"]
    thinking = random.choice(options)
    return [{"content": f"detailed thinking {thinking}", "role": "system"}] + messages

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(add_system_prompt(example["messages"]), tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithms
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*125.*", "re:.*134.*", "re:.*143.*", "re:.*149.*"], dampening_frac=0.01, offload_hessians=False),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True
)

# Save the compressed model
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Note that Layers 125, 134, 143 and 149 had to be excluded from GPTQ quantization, because their extreme size would lead to allocations of 600+GB Heassian matrices for GPTQ (which couldn't be offloaded for some reason). Furthermore, the GPU memory allocation code in calculate_offload_device_map() was adjusted.

Evaluation

GSM8K (3 Runs)

Original

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9469	±	0.0062
		strict-match	5	exact_match	↑	0.9462	±	0.0062
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9424	±	0.0064
		strict-match	5	exact_match	↑	0.9401	±	0.0065
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9454	±	0.0063
		strict-match	5	exact_match	↑	0.9454	±	0.0063
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
Avg:	3	flexible-extract	5	exact_match	↑	0.9449	±	0.0036
		strict-match	5	exact_match	↑	0.9439	±	0.0037

Quantized

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9431	±	0.0064
		strict-match	5	exact_match	↑	0.9393	±	0.0066
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9538	±	0.0058
		strict-match	5	exact_match	↑	0.9500	±	0.0060
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9477	±	0.0061
		strict-match	5	exact_match	↑	0.9462	±	0.0062
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
Avg.	3	flexible-extract	5	exact_match	↑	0.9482	±	0.0035
		strict-match	5	exact_match	↑	0.9452	±	0.0036

simple-evals (10x50 Samples each)

Using custom fork of OpenAI's simple-evals benchmark suite: https://github.com/Ithanil/simple-evals/tree/custom

These were run using the chat template as well as Nvidias suggested settings:

Reasoning Off: Greedy (temperature=0), system prompt: detailed thinking off
Reasoning On: temperature=0.6, top_p=0.95, system prompt: detailed thinking on

Original (Reasoning Off)

Benchmark	Average Score	Standard Error
DROP (F1)	92.6556	0.711437
GPQA	43.2	2.04831
HumanEval	85.6	0.37238
MGSM	90.9091	1.40836
MMLU	84.6	0.6

Quantized (Reasoning Off)

Benchmark	Average Score	Standard Error
DROP (F1)	91.2381	0.843284
GPQA	43.2	0.997775
HumanEval	85.08	0.430194
MGSM	92.9091	0.994013
MMLU	82.8	1.04137

i.e. all quantized evals are within statistical error of original model's evals.

Quantized (Reasoning On)

For completeness, here also results for Reasoning ON:

Benchmark	Average Score	Standard Error
DROP (F1)	89.8326	1.14615
GPQA	61.2	1.81842
HumanEval	93	0.181353
MGSM	94.9091	0.931048
MMLU	85.2	0.8

Ithanil
/

Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic