ArcticSpeculator

Build a fastest OSS vllm-based speculative decoding system for your own model, using ArcticTraining and ArcticInference!

We compare the throughput (tokens/s) of existing vllm-based speculative decoding systems for Llama3.1-70B-Instruct on 8xH100 as below:

For more details about ArcticSpeculator and how to use it:

We also release ArcticSpeculator checkpoints we trained with ArcticTraining to run with ArcticInference:

model	ArcticSpeculator
Llama-3.1-70B-Instruct	Arctic-LSTM-Speculator-Llama-3.1-70B-Instruct
Llama-3.3-70B-Instruct	Arctic-LSTM-Speculator-Llama-3.3-70B-Instruct
Qwen2.5-32B-Instruct	Arctic-LSTM-Speculator-Qwen2.5-32B-Instruct
Llama-3.1-8B-Instruct	Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct

Snowflake
/

Arctic-LSTM-Speculator-Qwen2.5-32B-Instruct