How to run this model with Vllm backend?
When will the Vllm support your work,or would you mind telling me how to modfiy vllm to support this model?
Hi @kentyuan123 , thank you for your interest in our work!
While I'd love to integrate DFloat11 into vLLM, I currently don't have the bandwidth to tackle this project. If you'd like to implement this integration yourself, I recommend taking a look at this specific function in our codebase: https://github.com/LeanModels/DFloat11/blob/75f7181dc1c7341920c50bd349bbe6949074675b/dfloat11/dfloat11.py#L173
This function handles replacing the original BFloat16 weights with DFloat11 weights and adds pre-forward hooks that perform on-the-fly decompression from DFloat11 to BFloat16.
Our models currently support inference with the transformers
library. You can install the latest version via pip install -U dfloat11[cuda12]
and follow the guide at https://github.com/LeanModels/DFloat11.