Spaces:
Running
Running
I can't run any of the dynamic bnb-4bit quants with TextGenerationInference
#6
by
v3ss0n
- opened
here is the options i had used :
"--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"
docker compose file
text-generation-inference:
image: ghcr.io/huggingface/text-generation-inference:3.1.0
environment:
- MODEL_ID=unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit
ports:
- "0.0.0.0:8099:80"
restart: "unless-stopped"
command: "--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '1']
capabilities: [gpu]
shm_size: '90g'
volumes:
- ~/.hf-docker-data:/data
networks:
- llmhost
Error :
text-generation-inference-1 | [rank1]: AssertionError: The choosen size 1 is not compatible with sharding on 2 shards rank=1
I also opened an issue at TGI . Not sure which side have the problem
https://github.com/huggingface/text-generation-inference/issues/3005
Thanks, honestly I have never seen the error before - but please note you are using our dynamic quant which might be supported. Instead use the basic BNB version
Basic BNB - JIT Quant works fine , i wanted to use dynamic quants.
v3ss0n
changed discussion title from
I can't run any of the bnb-4bit quants with TextGenerationInference
to I can't run any of the dynamic bnb-4bit quants with TextGenerationInference