About

Bitsandbytes 4bit quantized version of https://huggingface.co/sometimesanotion/Lamarck-14B-v0.7

Ideal for faster and cheaper GPU inference e.g. in VLLM.
Stats from running on RTX 4090:

model weights take 9.35GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 4.50GiB; the rest of the memory reserved for KV Cache is 9.24GiB.
Maximum concurrency for 32768 tokens per request: 1.54x
Downloads last month
4
Safetensors
Model size
8.37B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for 3WaD/Lamarck-14B-v0.7-bnb-4bit

Quantized
(17)
this model