Wlong692 commited on
Commit
64a693a
·
1 Parent(s): 2510df9

Update throughput chart and improve README

Browse files
Files changed (2) hide show
  1. README.md +8 -1
  2. throughput.png +2 -2
README.md CHANGED
@@ -49,7 +49,14 @@ In terms of the evaluation of reasoning ability, Ring-lite-linear-preview achi
49
 
50
  ## Inference Speed
51
 
52
- To evaluate the generation throughput, we deploy Ring-lite-linear and the softmax-attention-based Ring-lite based on vLLM on a single NVIDIA A100 GPU. Specifically, the input sequence length is fixed to 1. The end-to-end (E2E) generation time required for generating output sequences of varying lengths is illustrated below. It is shown in the figure that at 32k output length, Ring-lite-linear-preview achieves 2.2× throughput of Ring-lite.
 
 
 
 
 
 
 
53
 
54
  <p align="center">
55
  <img src="https://huggingface.co/inclusionAI/Ring-lite-linear-preview/resolve/main/throughput.png" width="600"/>
 
49
 
50
  ## Inference Speed
51
 
52
+ To evaluate the generation throughput, we deploy Ring-lite-linear and the softmax-attention-based Ring-lite based on vLLM on a single NVIDIA A100 GPU. We conduct two sets of experiments:
53
+
54
+ 1. **Long Input Evaluation**: We measure the time-to-first-token (TTFT) with varying input sequence lengths (from 512 to 384k tokens) using batch size 1 and TP=1. As shown in the top figure, at 384k input length, Ring-lite-linear achieves 3.5× faster TTFT compared to the softmax-attention-based model.
55
+
56
+ 2. **Long Output Evaluation**: We fix the input sequence length to 1 and measure the end-to-end (E2E) generation time required for generating output sequences of varying lengths (from 512 to 32k tokens) with batch size 64 and TP=1. As illustrated in the bottom figure, at 32k output length, Ring-lite-linear achieves 2.2× throughput of the softmax-attention-based Ring-lite.
57
+
58
+ These results demonstrate that our hybrid linear attention mechanism significantly improves both input processing efficiency and generation throughput, especially for long context scenarios.
59
+
60
 
61
  <p align="center">
62
  <img src="https://huggingface.co/inclusionAI/Ring-lite-linear-preview/resolve/main/throughput.png" width="600"/>
throughput.png CHANGED

Git LFS Details

  • SHA256: a3d9280f3021a7ab00a19777b3bc7da7d10fe3b33ccd2fb0452ec05c148d2107
  • Pointer size: 131 Bytes
  • Size of remote file: 261 kB

Git LFS Details

  • SHA256: d4ff610059a82a76830676a14b8e66be3824fe188955a94ec5083bbf57fbab5c
  • Pointer size: 131 Bytes
  • Size of remote file: 467 kB