ArcticSpeculator

Build a fastest OSS vllm-based speculative decoding system for your own model, using ArcticTraining and ArcticInference!

We compare the throughput (tokens/s) of existing vllm-based speculative decoding systems for Llama3.1-70B-Instruct on 8xH100 as below:

method ShareGPT HumanEval
VLLM V1 Baseline 84.1 84.1
VLLM V1 Eagle 102.2 112.0
VLLM V1 Eagle3 77.7 85.3
VLLM V0 MLP-Speculator (IBM) 77.9 66.7
ArcticSpeculator 172.4 203.7

For more details about ArcticSpeculator and how to use it:

We also release ArcticSpeculator checkpoints we trained with ArcticTraining to run with ArcticInference:

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including Snowflake/Arctic-LSTM-Speculator-Qwen2.5-32B-Instruct