mobiuslabsgmbh/Qwen2.5-VL-3B-Instruct_4bitgs64_hqq_hf

This is an HQQ all 4-bit (group-size=64) quantized Qwen2.5-VL-3B-Instruct model.

Usage

First, install the dependecies:

pip install hqq gemlite; #to use the gemlite backend

Then you can use the sample code below:

import torch
device        = 'cuda:0'
backend       = 'torchao_int4' #'torchao_int4' or 'gemlite'
compute_dtype = torch.bfloat16 if backend=="torchao_int4" else torch.float16
model_id      = 'mobiuslabsgmbh/Qwen2.5-VL-3B-Instruct_4bitgs64_hqq_hf' 

#Load model
from transformers import AutoModelForCausalLM, AutoProcessor 

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=compute_dtype, 
    device_map=device, 
)

processor = AutoProcessor.from_pretrained(model_id)

#Patching
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model, backend=backend, verbose=True)

Use in vllm:

pip install git+https://github.com/mobiusml/hqq/;
pip install git+https://github.com/mobiusml/gemlite/;

from vllm import LLM
from vllm.sampling_params import SamplingParams

from hqq.utils.vllm import set_vllm_hqq_backend, VLLM_HQQ_BACKEND
set_vllm_hqq_backend(backend=VLLM_HQQ_BACKEND.GEMLITE)

model_id = "mobiuslabsgmbh/Qwen2.5-VL-3B-Instruct_4bitgs64_hqq_hf"

llm = LLM(model=model_id, max_model_len=4096, max_num_seqs=2, limit_mm_per_prompt={"image": 1}, dtype=torch.float16)

mobiuslabsgmbh
/

Qwen2.5-VL-3B-Instruct_4bitgs64_hqq_hf

Usage

Collection including mobiuslabsgmbh/Qwen2.5-VL-3B-Instruct_4bitgs64_hqq_hf

Qwen