Qwen
Collection
HQQ-quantized Qwen models
•
6 items
•
Updated
•
1
This is an HQQ all 4-bit (group-size=64) quantized Qwen2.5-VL-3B-Instruct model.
First, install the dependecies:
pip install hqq gemlite; #to use the gemlite backend
Then you can use the sample code below:
import torch
device = 'cuda:0'
backend = 'torchao_int4' #'torchao_int4' or 'gemlite'
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
model_id = 'mobiuslabsgmbh/Qwen2.5-VL-3B-Instruct_4bitgs64_hqq_hf'
#Load model
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=compute_dtype,
device_map=device,
)
processor = AutoProcessor.from_pretrained(model_id)
#Patching
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model, backend=backend, verbose=True)
Use in vllm:
pip install git+https://github.com/mobiusml/hqq/;
pip install git+https://github.com/mobiusml/gemlite/;
from vllm import LLM
from vllm.sampling_params import SamplingParams
from hqq.utils.vllm import set_vllm_hqq_backend, VLLM_HQQ_BACKEND
set_vllm_hqq_backend(backend=VLLM_HQQ_BACKEND.GEMLITE)
model_id = "mobiuslabsgmbh/Qwen2.5-VL-3B-Instruct_4bitgs64_hqq_hf"
llm = LLM(model=model_id, max_model_len=4096, max_num_seqs=2, limit_mm_per_prompt={"image": 1}, dtype=torch.float16)