Phi4-mini model quantized with torchao int4 weight only quantization with gemlite kernels, by PyTorch team.

Installation

pip install transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install [email protected]:EleutherAI/lm-evaluation-harness.git
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/mobiusml/gemlite/

Quantization Recipe

We used following code to get the quantized model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import GemliteUIntXWeightOnlyConfig
quant_config = GemliteUIntXWeightOnlyConfig()
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Push to hub
USER_ID = "YOUR_USER_ID"
save_to = f"{USER_ID}/{model_id}-int4wo-gemlite"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

# Local Benchmark
import torch.utils.benchmark as benchmark
from torchao.utils import benchmark_model
import torchao

def benchmark_fn(f, *args, **kwargs):
    # Manual warmup
    for _ in range(2):
        f(*args, **kwargs)

    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)",
        globals={"args": args, "kwargs": kwargs, "f": f},
        num_threads=torch.get_num_threads(),
    )
    return f"{(t0.blocked_autorange().mean):.3f}"

torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")
print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))

Model Quality

We rely on lm-evaluation-harness to evaluate the quality of the quantized model.

Installing the nightly version to get most recent updates

pip install git+https://github.com/EleutherAI/lm-evaluation-harness

baseline

lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

int4wo-gemlite

lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-gemlite --tasks hellaswag --device cuda:0 --batch_size 8

TODO: more complete eval results

Benchmark
	Phi-4 mini-Ins	phi4-mini-int4wo-gemlite
Popular aggregated benchmark
Reasoning
HellaSwag	54.57	53.51
Multilingual
Math
Overall	TODO	TODO

Model Performance

Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm. For batch size N, please see our gemlite checkpoint.

Download vllm source code and install vllm

git clone [email protected]:vllm-project/vllm.git
VLLM_USE_PRECOMPILED=1 pip install .

Download dataset

Download sharegpt dataset: wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks

benchmark_latency

Run the following under vllm source code root folder:

baseline

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

int4wo-gemlite

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/phi4-mini-int4wo-gemlite --batch-size 1

benchmark_serving

We also benchmarked the throughput in a serving environment.

Run the following under vllm source code root folder:

baseline

Server:

vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

int4wo-gemlite

Server:

vllm serve jerryzh168/phi4-mini-int4wo-gemlite --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-int4wo-hqq --num-prompts 1

Serving with vllm

We can use the same command we used in serving benchmarks to serve the model with vllm

vllm serve jerryzh168/phi4-mini-int4wo-gemlite --tokenizer microsoft/Phi-4-mini-instruct -O3

jerryzh168
/

phi4-mini-int4wo-gemlite

Installation

Quantization Recipe

Model Quality

Installing the nightly version to get most recent updates

baseline

int4wo-gemlite

Model Performance

Download vllm source code and install vllm

Download dataset

benchmark_latency

baseline

int4wo-gemlite

benchmark_serving

baseline

int4wo-gemlite

Serving with vllm

Model tree for jerryzh168/phi4-mini-int4wo-gemlite