cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic

This is a quantization of the Qwen2.5-7B-Instruct-1M.

Qwen2.5-7B-Instruct-1M, developed by Alibaba Cloud, stands out for its remarkable capability to handle extremely long-context tasks with a context length of up to 1 million tokens, positioning it as a top-tier choice for tasks requiring extensive contextual understanding. Compared to previous models like the Qwen2.5 128K version, it demonstrates significantly improved performance in processing long sequences while maintaining efficiency in shorter tasks. Its architecture incorporates advanced features such as RoPE, SwiGLU, and RMSNorm, enhancing its effectiveness and robustness in various scenarios. The model's design as a causal language model ensures that it excels in both pretraining and post-training stages, making it a versatile tool for generating coherent and contextually aware language outputs.

Evaluations

This model provides an accuracy recovery of 100.69%.

English	Qwen2.5-7B-Instruct-1M	Qwen2.5-7B-Instruct-1M-FP8-Dynamic (this)
Avg.	69.31	69.78
ARC	62.8	63
Hellaswag	70.4	70.4
MMLU	74.72	75.95

We did not check for data contamination. Evaluation was done using Eval. Harness with limit=1000.

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic --max-model-len 262144 --gpu-memory-utilization 0.9

Access the model:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d ' {
        "model": "cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic",
        "prompt": "San Francisco is a"
    } '

cortecs
/

Qwen2.5-7B-Instruct-1M-FP8-Dynamic

Evaluations

Usage

Model tree for cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic