This is a quantization of the Qwen2.5-7B-Instruct-1M.

Qwen2.5-7B-Instruct-1M, developed by Alibaba Cloud, stands out for its remarkable capability to handle extremely long-context tasks with a context length of up to 1 million tokens, positioning it as a top-tier choice for tasks requiring extensive contextual understanding. Compared to previous models like the Qwen2.5 128K version, it demonstrates significantly improved performance in processing long sequences while maintaining efficiency in shorter tasks. Its architecture incorporates advanced features such as RoPE, SwiGLU, and RMSNorm, enhancing its effectiveness and robustness in various scenarios. The model's design as a causal language model ensures that it excels in both pretraining and post-training stages, making it a versatile tool for generating coherent and contextually aware language outputs.

Evaluations

This model provides an accuracy recovery of 100.69%.

English Qwen2.5-7B-Instruct-1M Qwen2.5-7B-Instruct-1M-FP8-Dynamic (this)
Avg. 69.31 69.78
ARC 62.8 63
Hellaswag 70.4 70.4
MMLU 74.72 75.95

We did not check for data contamination. Evaluation was done using Eval. Harness with limit=1000.

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic --max-model-len 262144 --gpu-memory-utilization 0.9

Access the model:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d ' {
        "model": "cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic",
        "prompt": "San Francisco is a"
    } '
Downloads last month
87
Safetensors
Model size
7.62B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cortecs/Qwen2.5-7B-Instruct-1M-FP8-Dynamic

Base model

Qwen/Qwen2.5-7B
Quantized
(63)
this model