This is a quantization of the Qwen2.5-7B-Instruct-1M.
Qwen2.5-7B-Instruct-1M, developed by Alibaba Cloud, stands out for its remarkable capability to handle extremely long-context tasks with a context length of up to 1 million tokens, positioning it as a top-tier choice for tasks requiring extensive contextual understanding. Compared to previous models like the Qwen2.5 128K version, it demonstrates significantly improved performance in processing long sequences while maintaining efficiency in shorter tasks. Its architecture incorporates advanced features such as RoPE, SwiGLU, and RMSNorm, enhancing its effectiveness and robustness in various scenarios. The model's design as a causal language model ensures that it excels in both pretraining and post-training stages, making it a versatile tool for generating coherent and contextually aware language outputs.
Evaluations
This model provides an accuracy recovery of 100.69%.
English | Qwen2.5-7B-Instruct-1M | Qwen2.5-7B-Instruct-1M-FP8-Dynamic (this) |
---|---|---|
Avg. | 69.31 | 69.78 |
ARC | 62.8 | 63 |
Hellaswag | 70.4 | 70.4 |
MMLU | 74.72 | 75.95 |
We did not check for data contamination.
Evaluation was done using Eval. Harness with limit=1000
.
Usage
Install vLLM and run the server:
python -m vllm.entrypoints.openai.api_server --model cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic --max-model-len 262144 --gpu-memory-utilization 0.9
Access the model:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d ' {
"model": "cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic",
"prompt": "San Francisco is a"
} '
- Downloads last month
- 87