About
Bitsandbytes 4bit quantized version of https://huggingface.co/sometimesanotion/Lamarck-14B-v0.7
Ideal for faster and cheaper GPU inference e.g. in VLLM.
Stats from running on RTX 4090:
model weights take 9.35GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 4.50GiB; the rest of the memory reserved for KV Cache is 9.24GiB.
Maximum concurrency for 32768 tokens per request: 1.54x
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for 3WaD/Lamarck-14B-v0.7-bnb-4bit
Base model
sometimesanotion/Lamarck-14B-v0.7