I can't run any of the dynamic bnb-4bit quants with TextGenerationInference

#6
by v3ss0n - opened

here is the options i had used :

"--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"

docker compose file

  text-generation-inference:
    image: ghcr.io/huggingface/text-generation-inference:3.1.0
    environment:
      - MODEL_ID=unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit
    ports:
      - "0.0.0.0:8099:80"
    restart: "unless-stopped"
    command: "--quantize bitsandbytes-fp4 --max-input-tokens 30000 --sharded true --num-shard 2"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    shm_size: '90g'
    volumes:
      - ~/.hf-docker-data:/data
    networks:
      - llmhost

Error :

text-generation-inference-1  | [rank1]: AssertionError: The choosen size 1 is not compatible with sharding on 2 shards rank=1

I also opened an issue at TGI . Not sure which side have the problem

https://github.com/huggingface/text-generation-inference/issues/3005

Unsloth AI org

Thanks, honestly I have never seen the error before - but please note you are using our dynamic quant which might be supported. Instead use the basic BNB version

Basic BNB - JIT Quant works fine , i wanted to use dynamic quants.

v3ss0n changed discussion title from I can't run any of the bnb-4bit quants with TextGenerationInference to I can't run any of the dynamic bnb-4bit quants with TextGenerationInference
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment