Qwen-7B-Chat
๐ค Hugging Face | ๐ค ModelScope | ๐ Paper ๏ฝ ๐ฅ๏ธ Demo
WeChat (ๅพฎไฟก) | Discord ๏ฝ API
ไป็ป๏ผIntroduction๏ผ
้ไนๅ้ฎ-7B๏ผQwen-7B๏ผๆฏ้ฟ้ไบ็ ๅ็้ไนๅ้ฎๅคงๆจกๅ็ณปๅ็70ไบฟๅๆฐ่งๆจก็ๆจกๅใQwen-7BๆฏๅบไบTransformer็ๅคง่ฏญ่จๆจกๅ, ๅจ่ถ ๅคง่งๆจก็้ข่ฎญ็ปๆฐๆฎไธ่ฟ่ก่ฎญ็ปๅพๅฐใ้ข่ฎญ็ปๆฐๆฎ็ฑปๅๅคๆ ท๏ผ่ฆ็ๅนฟๆณ๏ผๅ ๆฌๅคง้็ฝ็ปๆๆฌใไธไธไนฆ็ฑใไปฃ็ ็ญใๅๆถ๏ผๅจQwen-7B็ๅบ็กไธ๏ผๆไปฌไฝฟ็จๅฏน้ฝๆบๅถๆ้ ไบๅบไบๅคง่ฏญ่จๆจกๅ็AIๅฉๆQwen-7B-Chatใ็ธ่พไบๆๅๅผๆบ็Qwen-7Bๆจกๅ๏ผๆไปฌ็ฐๅทฒๅฐ้ข่ฎญ็ปๆจกๅๅChatๆจกๅๆดๆฐๅฐๆๆๆดไผ็็ๆฌใๆฌไปๅบไธบQwen-7B-Chat็ไปๅบใ
ๅฆๆๆจๆณไบ่งฃๆดๅคๅ ณไบ้ไนๅ้ฎ-7Bๅผๆบๆจกๅ็็ป่๏ผๆไปฌๅปบ่ฎฎๆจๅ้ GitHubไปฃ็ ๅบใ
Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. Now we have updated both our pretrained and chat models with better performances. This repository is the one for Qwen-7B-Chat.
For more details about Qwen, please refer to the GitHub code repository.
่ฆๆฑ๏ผRequirements๏ผ
- python 3.8ๅไปฅไธ็ๆฌ
- pytorch 1.12ๅไปฅไธ็ๆฌ๏ผๆจ่2.0ๅไปฅไธ็ๆฌ
- ๅปบ่ฎฎไฝฟ็จCUDA 11.4ๅไปฅไธ๏ผGPU็จๆทใflash-attention็จๆท็ญ้่่ๆญค้้กน๏ผ
- python 3.8 and above
- pytorch 1.12 and above, 2.0 and above are recommended
- CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
ไพ่ต้กน๏ผDependency๏ผ
่ฟ่กQwen-7B-Chat๏ผ่ฏท็กฎไฟๆปก่ถณไธ่ฟฐ่ฆๆฑ๏ผๅๆง่กไปฅไธpipๅฝไปคๅฎ่ฃ ไพ่ตๅบ
To run Qwen-7B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries.
pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed
ๅฆๅค๏ผๆจ่ๅฎ่ฃ
flash-attention
ๅบ๏ผๅฝๅๅทฒๆฏๆflash attention 2๏ผ๏ผไปฅๅฎ็ฐๆด้ซ็ๆ็ๅๆดไฝ็ๆพๅญๅ ็จใ
In addition, it is recommended to install the flash-attention
library (we support flash attention 2 now.) for higher efficiency and lower memory usage.
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# ไธๆนๅฎ่ฃ
ๅฏ้๏ผๅฎ่ฃ
ๅฏ่ฝๆฏ่พ็ผๆ
ขใ
# pip install csrc/layer_norm
# pip install csrc/rotary
ๅฟซ้ไฝฟ็จ๏ผQuickstart๏ผ
ไธ้ขๆไปฌๅฑ็คบไบไธไธชไฝฟ็จQwen-7B-Chatๆจกๅ๏ผ่ฟ่กๅค่ฝฎๅฏน่ฏไบคไบ็ๆ ทไพ๏ผ
We show an example of multi-turn interaction with Qwen-7B-Chat in the following code:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) # ๅฏๆๅฎไธๅ็็ๆ้ฟๅบฆใtop_p็ญ็ธๅ
ณ่ถ
ๅ
# ็ฌฌไธ่ฝฎๅฏน่ฏ 1st dialogue turn
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
print(response)
# ไฝ ๅฅฝ๏ผๅพ้ซๅ
ดไธบไฝ ๆไพๅธฎๅฉใ
# ็ฌฌไบ่ฝฎๅฏน่ฏ 2nd dialogue turn
response, history = model.chat(tokenizer, "็ปๆ่ฎฒไธไธชๅนด่ฝปไบบๅฅๆๅไธๆ็ปๅๅพๆๅ็ๆ
ไบใ", history=history)
print(response)
# ่ฟๆฏไธไธชๅ
ณไบไธไธชๅนด่ฝปไบบๅฅๆๅไธๆ็ปๅๅพๆๅ็ๆ
ไบใ
# ๆ
ไบ็ไธปไบบๅ
ฌๅซๆๆ๏ผไปๆฅ่ชไธไธชๆฎ้็ๅฎถๅบญ๏ผ็ถๆฏ้ฝๆฏๆฎ้็ๅทฅไบบใไปๅฐ๏ผๆๆๅฐฑ็ซไธไบไธไธช็ฎๆ ๏ผ่ฆๆไธบไธๅๆๅ็ไผไธๅฎถใ
# ไธบไบๅฎ็ฐ่ฟไธช็ฎๆ ๏ผๆๆๅคๅฅๅญฆไน ๏ผ่ไธไบๅคงๅญฆใๅจๅคงๅญฆๆ้ด๏ผไป็งฏๆๅๅ ๅ็งๅไธๆฏ่ต๏ผ่ทๅพไบไธๅฐๅฅ้กนใไป่ฟๅฉ็จ่ฏพไฝๆถ้ดๅปๅฎไน ๏ผ็งฏ็ดฏไบๅฎ่ดต็็ป้ชใ
# ๆฏไธๅ๏ผๆๆๅณๅฎๅผๅง่ชๅทฑ็ๅไธไน่ทฏใไปๅผๅงๅฏปๆพๆ่ตๆบไผ๏ผไฝๅคๆฌก้ฝ่ขซๆ็ปไบใ็ถ่๏ผไปๅนถๆฒกๆๆพๅผใไป็ปง็ปญๅชๅ๏ผไธๆญๆน่ฟ่ชๅทฑ็ๅไธ่ฎกๅ๏ผๅนถๅฏปๆพๆฐ็ๆ่ตๆบไผใ
# ๆ็ป๏ผๆๆๆๅๅฐ่ทๅพไบไธ็ฌๆ่ต๏ผๅผๅงไบ่ชๅทฑ็ๅไธไน่ทฏใไปๆ็ซไบไธๅฎถ็งๆๅ
ฌๅธ๏ผไธๆณจไบๅผๅๆฐๅ่ฝฏไปถใๅจไป็้ขๅฏผไธ๏ผๅ
ฌๅธ่ฟ
้ๅๅฑ่ตทๆฅ๏ผๆไธบไบไธๅฎถๆๅ็็งๆไผไธใ
# ๆๆ็ๆๅๅนถไธๆฏๅถ็ถ็ใไปๅคๅฅใๅ้งใๅไบๅ้ฉ๏ผไธๆญๅญฆไน ๅๆน่ฟ่ชๅทฑใไป็ๆๅไน่ฏๆไบ๏ผๅช่ฆๅชๅๅฅๆ๏ผไปปไฝไบบ้ฝๆๅฏ่ฝๅๅพๆๅใ
# ็ฌฌไธ่ฝฎๅฏน่ฏ 3rd dialogue turn
response, history = model.chat(tokenizer, "็ป่ฟไธชๆ
ไบ่ตทไธไธชๆ ้ข", history=history)
print(response)
# ใๅฅๆๅไธ๏ผไธไธชๅนด่ฝปไบบ็ๆๅไน่ทฏใ
ๅ ณไบๆดๅค็ไฝฟ็จ่ฏดๆ๏ผ่ฏทๅ่ๆไปฌ็GitHub repo่ทๅๆดๅคไฟกๆฏใ
For more information, please refer to our GitHub repo for more information.
Tokenizer
ๆณจ๏ผไฝไธบๆฏ่ฏญ็โtokenizationโๅจไธญๆไธญๅฐๆ ๅ ฑ่ฏ็ๆฆๅฟตๅฏนๅบ๏ผๆฌๆๆกฃ้็จ่ฑๆ่กจ่พพไปฅๅฉ่ฏดๆใ
ๅบไบtiktoken็ๅ่ฏๅจๆๅซไบๅ ถไปๅ่ฏๅจ๏ผๆฏๅฆsentencepieceๅ่ฏๅจใๅฐคๅ ถๅจๅพฎ่ฐ้ถๆฎต๏ผ้่ฆ็นๅซๆณจๆ็นๆฎtoken็ไฝฟ็จใๅ ณไบtokenizer็ๆดๅคไฟกๆฏ๏ผไปฅๅๅพฎ่ฐๆถๆถๅ็็ธๅ ณไฝฟ็จ๏ผ่ฏทๅ้ ๆๆกฃใ
Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation.
้ๅ (Quantization)
็จๆณ (Usage)
่ฏทๆณจๆ๏ผๆไปฌๆดๆฐ้ๅๆนๆกไธบๅบไบAutoGPTQ็้ๅ๏ผๆไพQwen-7B-Chat็Int4้ๅๆจกๅ็นๅป่ฟ้ใ็ธๆฏๆญคๅๆนๆก๏ผ่ฏฅๆนๆกๅจๆจกๅ่ฏๆตๆๆๅ ไนๆ ๆ๏ผไธๅญๅจ้ๆฑๆดไฝ๏ผๆจ็้ๅบฆๆดไผใ
Note: we provide a new solution based on AutoGPTQ, and release an Int4 quantized model for Qwen-7B-Chat Click here, which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.
ไปฅไธๆไปฌๆไพ็คบไพ่ฏดๆๅฆไฝไฝฟ็จInt4้ๅๆจกๅใๅจๅผๅงไฝฟ็จๅ๏ผ่ฏทๅ ไฟ่ฏๆปก่ถณ่ฆๆฑ๏ผๅฆtorch 2.0ๅไปฅไธ๏ผtransformers็ๆฌไธบ4.32.0ๅไปฅไธ๏ผ็ญ็ญ๏ผ๏ผๅนถๅฎ่ฃ ๆ้ๅฎ่ฃ ๅ ๏ผ
Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages:
pip install auto-gptq optimum
ๅฆๅฎ่ฃ
auto-gptq
้ๅฐ้ฎ้ข๏ผๆไปฌๅปบ่ฎฎๆจๅฐๅฎๆนrepoๆ็ดขๅ้็้ข็ผ่ฏwheelใ
้ๅๅณๅฏไฝฟ็จๅไธ่ฟฐไธ่ด็็จๆณ่ฐ็จ้ๅๆจกๅ๏ผ
If you meet problems installing auto-gptq
, we advise you to check out the official repo to find a pre-build wheel.
Then you can load the quantized model easily and run inference as same as usual:
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat-Int4",
device_map="auto",
trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "ไฝ ๅฅฝ", history=None)
ๆๆ่ฏๆต
ๆไปฌๅฏนBF16๏ผInt8ๅInt4ๆจกๅๅจๅบๅ่ฏๆตไธๅไบๆต่ฏ๏ผไฝฟ็จzero-shot่ฎพ็ฝฎ๏ผ๏ผๅ็ฐ้ๅๆจกๅๆๆๆๅคฑ่พๅฐ๏ผ็ปๆๅฆไธๆ็คบ๏ผ
We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
---|---|---|---|---|
BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
Int8 | 55.4 | 59.4 | 48.3 | 34.8 |
Int4 | 55.1 | 59.2 | 49.7 | 29.9 |
ๆจ็้ๅบฆ (Inference Speed)
ๆไปฌๆต็ฎไบไธๅ็ฒพๅบฆๆจกๅไปฅๅไธๅFlashAttnๅบ็ๆฌไธๆจกๅ็ๆ2048ๅ8192ไธชtoken็ๅนณๅๆจ็้ๅบฆใๅฆๅพๆ็คบ๏ผ
We measured the average inference speed of generating 2048 and 8192 tokens with different quantization levels and versions of flash-attention, respectively.
Quantization | FlashAttn | Speed (2048 tokens) | Speed (8192 tokens) |
---|---|---|---|
BF16 | v2 | 40.93 | 36.14 |
Int8 | v2 | 37.47 | 32.54 |
Int4 | v2 | 50.09 | 38.61 |
BF16 | v1 | 40.75 | 35.34 |
Int8 | v1 | 37.51 | 32.39 |
Int4 | v1 | 45.98 | 36.47 |
BF16 | Disabled | 37.55 | 33.56 |
Int8 | Disabled | 37.84 | 32.65 |
Int4 | Disabled | 48.12 | 36.70 |
ๅ ทไฝ่่จ๏ผๆไปฌ่ฎฐๅฝๅจ้ฟๅบฆไธบ1็ไธไธๆ็ๆกไปถไธ็ๆ8192ไธชtoken็ๆง่ฝใ่ฏๆต่ฟ่กไบๅๅผ A100-SXM4-80G GPU๏ผไฝฟ็จPyTorch 2.0.1ๅCUDA 11.8ใๆจ็้ๅบฆๆฏ็ๆ8192ไธชtoken็้ๅบฆๅๅผใ
In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 8192 tokens.
ๆณจๆ๏ผไปฅไธInt4/Int8ๆจกๅ็ๆ้ๅบฆไฝฟ็จautogptqๅบ็ปๅบ๏ผๅฝๅAutoModelForCausalLM.from_pretrained
่ฝฝๅ
ฅ็ๆจกๅ็ๆ้ๅบฆไผๆ
ขๅคง็บฆ20%ใๆไปฌๅทฒ็ปๅฐ่ฏฅ้ฎ้ขๆฑๆฅ็ปHuggingFaceๅข้๏ผ่ฅๆ่งฃๅณๆนๆกๅฐๅณๆถๆดๆฐใ
Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
ๆพๅญไฝฟ็จ (GPU Memory Usage)
ๆไปฌ่ฟๆต็ฎไบไธๅๆจกๅ็ฒพๅบฆ็ผ็ 2048ไธชtokenๅ็ๆ8192ไธชtoken็ๅณฐๅผๆพๅญๅ ็จๆ ๅตใ๏ผๆพๅญๆถ่ๅจๆฏๅฆไฝฟ็จFlashAttn็ๆ ๅตไธๅ็ฑปไผผใ๏ผ็ปๆๅฆไธๆ็คบ๏ผ
We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under different quantization levels, respectively. ๏ผThe GPU memory usage is similar when using flash-attention or not.๏ผThe results are shown below.
Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
---|---|---|
BF16 | 16.99GB | 22.53GB |
Int8 | 11.20GB | 16.62GB |
Int4 | 8.21GB | 13.63GB |
ไธ่ฟฐๆง่ฝๆต็ฎไฝฟ็จๆญค่ๆฌๅฎๆใ
The above speed and memory profiling are conducted using this script.
ๆจกๅ็ป่๏ผModel๏ผ
ไธQwen-7B้ข่ฎญ็ปๆจกๅ็ธๅ๏ผQwen-7B-Chatๆจกๅ่งๆจกๅบๆฌๆ ๅตๅฆไธๆ็คบ:
The details of the model architecture of Qwen-7B-Chat are listed as follows:
Hyperparameter | Value |
---|---|
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 151851 |
sequence length | 8192 |
ๅจไฝ็ฝฎ็ผ็ ใFFNๆฟๆดปๅฝๆฐๅnormalization็ๅฎ็ฐๆนๅผไธ๏ผๆไปฌไน้็จไบ็ฎๅๆๆต่ก็ๅๆณ๏ผ ๅณRoPE็ธๅฏนไฝ็ฝฎ็ผ็ ใSwiGLUๆฟๆดปๅฝๆฐใRMSNorm๏ผๅฏ้ๅฎ่ฃ flash-attentionๅ ้๏ผใ
ๅจๅ่ฏๅจๆน้ข๏ผ็ธๆฏ็ฎๅไธปๆตๅผๆบๆจกๅไปฅไธญ่ฑ่ฏ่กจไธบไธป๏ผQwen-7B-Chatไฝฟ็จไบ็บฆ15ไธtokenๅคงๅฐ็่ฏ่กจใ
่ฏฅ่ฏ่กจๅจGPT-4ไฝฟ็จ็BPE่ฏ่กจcl100k_base
ๅบ็กไธ๏ผๅฏนไธญๆใๅค่ฏญ่จ่ฟ่กไบไผๅ๏ผๅจๅฏนไธญใ่ฑใไปฃ็ ๆฐๆฎ็้ซๆ็ผ่งฃ็ ็ๅบ็กไธ๏ผๅฏน้จๅๅค่ฏญ่จๆดๅ ๅๅฅฝ๏ผๆนไพฟ็จๆทๅจไธๆฉๅฑ่ฏ่กจ็ๆ
ๅตไธๅฏน้จๅ่ฏญ็ง่ฟ่ก่ฝๅๅขๅผบใ
่ฏ่กจๅฏนๆฐๅญๆๅไธชๆฐๅญไฝๅๅใ่ฐ็จ่พไธบ้ซๆ็tiktokenๅ่ฏๅบ่ฟ่กๅ่ฏใ
For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).
For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B-Chat uses a vocabulary of over 150K tokens.
It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary.
It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.
่ฏๆตๆๆ๏ผEvaluation๏ผ
ๅฏนไบQwen-7B-Chatๆจกๅ๏ผๆไปฌๅๆ ท่ฏๆตไบๅธธ่ง็ไธญๆ็่งฃ๏ผC-Eval๏ผใ่ฑๆ็่งฃ๏ผMMLU๏ผใไปฃ็ ๏ผHumanEval๏ผๅๆฐๅญฆ๏ผGSM8K๏ผ็ญๆๅจไปปๅก๏ผๅๆถๅ ๅซไบ้ฟๅบๅไปปๅก็่ฏๆต็ปๆใ็ฑไบQwen-7B-Chatๆจกๅ็ป่ฟๅฏน้ฝๅ๏ผๆฟๅไบ่พๅผบ็ๅค้จ็ณป็ป่ฐ็จ่ฝๅ๏ผๆไปฌ่ฟ่ฟ่กไบๅทฅๅ ทไฝฟ็จ่ฝๅๆน้ข็่ฏๆตใ
ๆ็คบ๏ผ็ฑไบ็กฌไปถๅๆกๆถ้ ๆ็่ๅ ฅ่ฏฏๅทฎ๏ผๅค็ฐ็ปๆๅฆๆๆณขๅจๅฑไบๆญฃๅธธ็ฐ่ฑกใ
For Qwen-7B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.
Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.
ไธญๆ่ฏๆต๏ผChinese Evaluation๏ผ
C-Eval
ๅจC-Eval้ช่ฏ้ไธ๏ผๆไปฌ่ฏไปทไบQwen-7B-Chatๆจกๅ็0-shot & 5-shotๅ็กฎ็
We demonstrate the 0-shot & 5-shot accuracy of Qwen-7B-Chat on C-Eval validation set
Model | Avg. Acc. |
---|---|
LLaMA2-7B-Chat | 31.9 |
LLaMA2-13B-Chat | 36.2 |
LLaMA2-70B-Chat | 44.3 |
ChatGLM2-6B-Chat | 52.6 |
InternLM-7B-Chat | 53.6 |
Baichuan2-7B-Chat | 55.6 |
Baichuan2-13B-Chat | 56.7 |
Qwen-7B-Chat (original) (0-shot) | 54.2 |
Qwen-7B-Chat (0-shot) | 59.7 |
Qwen-7B-Chat (5-shot) | 59.3 |
Qwen-14B-Chat (0-shot) | 69.8 |
Qwen-14B-Chat (5-shot) | 71.7 |
C-Evalๆต่ฏ้ไธ๏ผQwen-7B-Chatๆจกๅ็zero-shotๅ็กฎ็็ปๆๅฆไธ๏ผ
The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
Model | Avg. | STEM | Social Sciences | Humanities | Others |
---|---|---|---|---|---|
Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
Qwen-7B-Chat (original) | 54.6 | 47.8 | 67.6 | 59.3 | 50.6 |
Qwen-7B-Chat | 58.6 | 53.3 | 72.1 | 62.8 | 52.0 |
Qwen-14B-Chat | 69.1 | 65.1 | 80.9 | 71.2 | 63.4 |
ๅจ7B่งๆจกๆจกๅไธ๏ผ็ป่ฟไบบ็ฑปๆไปคๅฏน้ฝ็Qwen-7B-Chatๆจกๅ๏ผๅ็กฎ็ๅจๅ็ฑป็ธ่ฟ่งๆจกๆจกๅไธญไป็ถๅคไบๅๅใ
Compared with other pretrained models with comparable model size, the human-aligned Qwen-7B-Chat performs well in C-Eval accuracy.
่ฑๆ่ฏๆต๏ผEnglish Evaluation๏ผ
MMLU
MMLU่ฏๆต้ไธ๏ผQwen-7B-Chatๆจกๅ็ 0-shot & 5-shot ๅ็กฎ็ๅฆไธ๏ผๆๆๅๆ ทๅจๅ็ฑปๅฏน้ฝๆจกๅไธญๅๆ ท่กจ็ฐ่พไผใ
The 0-shot & 5-shot accuracy of Qwen-7B-Chat on MMLU is provided below. The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
Model | Avg. Acc. |
---|---|
ChatGLM2-6B-Chat | 46.0 |
LLaMA2-7B-Chat | 46.2 |
InternLM-7B-Chat | 51.1 |
Baichuan2-7B-Chat | 52.9 |
LLaMA2-13B-Chat | 54.6 |
Baichuan2-13B-Chat | 57.3 |
LLaMA2-70B-Chat | 63.8 |
Qwen-7B-Chat (original) (0-shot) | 53.9 |
Qwen-7B-Chat (0-shot) | 55.8 |
Qwen-7B-Chat (5-shot) | 57.0 |
Qwen-14B-Chat (0-shot) | 64.6 |
Qwen-14B-Chat (5-shot) | 66.5 |
ไปฃ็ ่ฏๆต๏ผCoding Evaluation๏ผ
Qwen-7B-ChatๅจHumanEval็zero-shot Pass@1ๆๆๅฆไธ
The zero-shot Pass@1 of Qwen-7B-Chat on HumanEval is demonstrated below
Model | Pass@1 |
---|---|
ChatGLM2-6B-Chat | 11.0 |
LLaMA2-7B-Chat | 12.2 |
Baichuan2-7B-Chat | 13.4 |
InternLM-7B-Chat | 14.6 |
Baichuan2-13B-Chat | 17.7 |
LLaMA2-13B-Chat | 18.9 |
LLaMA2-70B-Chat | 32.3 |
Qwen-7B-Chat (original) | 24.4 |
Qwen-7B-Chat | 37.2 |
Qwen-14B-Chat | 43.9 |
ๆฐๅญฆ่ฏๆต๏ผMathematics Evaluation๏ผ
ๅจ่ฏๆตๆฐๅญฆ่ฝๅ็GSM8Kไธ๏ผQwen-7B-Chat็ๅ็กฎ็็ปๆๅฆไธ
The accuracy of Qwen-7B-Chat on GSM8K is shown below
Model | Acc. |
---|---|
LLaMA2-7B-Chat | 26.3 |
ChatGLM2-6B-Chat | 28.8 |
Baichuan2-7B-Chat | 32.8 |
InternLM-7B-Chat | 33.0 |
LLaMA2-13B-Chat | 37.1 |
Baichuan2-13B-Chat | 55.3 |
LLaMA2-70B-Chat | 59.3 |
Qwen-7B-Chat (original) (0-shot) | 41.1 |
Qwen-7B-Chat (0-shot) | 50.3 |
Qwen-7B-Chat (8-shot) | 54.1 |
Qwen-14B-Chat (0-shot) | 60.1 |
Qwen-14B-Chat (8-shot) | 59.3 |
้ฟๅบๅ่ฏๆต๏ผLong-Context Understanding๏ผ
้่ฟNTKๆๅผ๏ผLogNๆณจๆๅ็ผฉๆพๅฏไปฅๆฉๅฑQwen-7B-Chat็ไธไธๆ้ฟๅบฆใๅจ้ฟๆๆฌๆ่ฆๆฐๆฎ้VCSUMไธ๏ผๆๆฌๅนณๅ้ฟๅบฆๅจ15Kๅทฆๅณ๏ผ๏ผQwen-7B-Chat็Rouge-L็ปๆๅฆไธ๏ผ
(่ฅ่ฆๅฏ็จ่ฟไบๆๅทง๏ผ่ฏทๅฐconfig.json้็use_dynamic_ntk
ๅuse_logn_attn
่ฎพ็ฝฎไธบtrue)
We introduce NTK-aware interpolation, LogN attention scaling to extend the context length of Qwen-7B-Chat. The Rouge-L results of Qwen-7B-Chat on long-text summarization dataset VCSUM (The average length of this dataset is around 15K) are shown below:
(To use these tricks, please set use_dynamic_ntk
and use_long_attn
to true in config.json.)
Model | VCSUM (zh) |
---|---|
GPT-3.5-Turbo-16k | 16.0 |
LLama2-7B-Chat | 0.2 |
InternLM-7B-Chat | 13.0 |
ChatGLM2-6B-Chat | 16.3 |
Qwen-7B-Chat | 16.6 |
ๅทฅๅ ทไฝฟ็จ่ฝๅ็่ฏๆต๏ผTool Usage๏ผ
ReAct Prompting
ๅ้ฎๆฏๆ้่ฟ ReAct Prompting ่ฐ็จๆไปถ/ๅทฅๅ ท/APIใReAct ไนๆฏ LangChain ๆกๆถ้็จ็ไธป่ฆๆนๅผไนไธใๅจๆไปฌๅผๆบ็ใ็จไบ่ฏไผฐๅทฅๅ ทไฝฟ็จ่ฝๅ็่ฏๆตๅบๅไธ๏ผๅ้ฎ็่กจ็ฐๅฆไธ๏ผ
Qwen-Chat supports calling plugins/tools/APIs through ReAct Prompting. ReAct is also one of the main approaches used by the LangChain framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-Chat's performance is as follows:
Chinese Tool-Use Benchmark | |||
---|---|---|---|
Model | Tool Selection (Acc.โ) | Tool Input (Rouge-Lโ) | False Positive Errorโ |
GPT-4 | 95% | 0.90 | 15.0% |
GPT-3.5 | 85% | 0.88 | 75.0% |
Qwen-7B-Chat | 98% | 0.91 | 7.3% |
Qwen-14B-Chat | 98% | 0.93 | 2.4% |
่ฏๆตๅบๅไธญๅบ็ฐ็ๆไปถๅๆฒกๆๅบ็ฐๅจๅ้ฎ็่ฎญ็ป้ไธญใ่ฏฅๅบๅ่ฏไผฐไบๆจกๅๅจๅคไธชๅ้ๆไปถไธญ้ๆฉๆญฃ็กฎๆไปถ็ๅ็กฎ็ใไผ ๅ ฅๆไปถ็ๅๆฐ็ๅ็ๆงใไปฅๅๅ้ณ็ใๅ้ณ็๏ผFalse Positive๏ผๅฎไน๏ผๅจๅค็ไธ่ฏฅ่ฐ็จๆไปถ็่ฏทๆฑๆถ๏ผ้่ฏฏๅฐ่ฐ็จไบๆไปถใ
The plugins that appear in the evaluation set do not appear in the training set of Qwen. This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query.
Code Interpreter
ไธบไบ่ๅฏQwenไฝฟ็จPython Code Interpreterๅฎๆๆฐๅญฆ่งฃ้ขใๆฐๆฎๅฏ่งๅใๅๆไปถๅค็ไธ็ฌ่ซ็ญไปปๅก็่ฝๅ๏ผๆไปฌไธ้จๅปบ่ฎพๅนถๅผๆบไบไธไธช่ฏๆต่ฟๆน้ข่ฝๅ็่ฏๆตๅบๅใ
ๆไปฌๅ็ฐQwenๅจ็ๆไปฃ็ ็ๅฏๆง่ก็ใ็ปๆๆญฃ็กฎๆงไธๅ่กจ็ฐ่พๅฅฝ๏ผ
To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this link.
We have observed that Qwen performs well in terms of code executability and result accuracy when generating code:
Executable Rate of Generated Code (%) | |||
---|---|---|---|
Model | Mathโ | Visualizationโ | Generalโ |
GPT-4 | 91.9 | 85.9 | 82.8 |
GPT-3.5 | 89.2 | 65.0 | 74.1 |
LLaMA2-7B-Chat | 41.9 | 33.1 | 24.1 |
LLaMA2-13B-Chat | 50.0 | 40.5 | 48.3 |
CodeLLaMA-7B-Instruct | 85.1 | 54.0 | 70.7 |
CodeLLaMA-13B-Instruct | 93.2 | 55.8 | 74.1 |
InternLM-7B-Chat-v1.1 | 78.4 | 44.2 | 62.1 |
InternLM-20B-Chat | 70.3 | 44.2 | 65.5 |
Qwen-7B-Chat | 82.4 | 64.4 | 67.2 |
Qwen-14B-Chat | 89.2 | 84.1 | 65.5 |
Accuracy of Code Execution Results (%) | |||
---|---|---|---|
Model | Mathโ | Visualization-Hardโ | Visualization-Easyโ |
GPT-4 | 82.8 | 66.7 | 60.8 |
GPT-3.5 | 47.3 | 33.3 | 55.7 |
LLaMA2-7B-Chat | 3.9 | 14.3 | 39.2 |
LLaMA2-13B-Chat | 8.3 | 8.3 | 40.5 |
CodeLLaMA-7B-Instruct | 14.3 | 26.2 | 60.8 |
CodeLLaMA-13B-Instruct | 28.2 | 27.4 | 62.0 |
InternLM-7B-Chat-v1.1 | 28.5 | 4.8 | 40.5 |
InternLM-20B-Chat | 34.6 | 21.4 | 45.6 |
Qwen-7B-Chat | 41.9 | 40.5 | 54.4 |
Qwen-14B-Chat | 58.4 | 53.6 | 59.5 |
Huggingface Agent
ๅ้ฎ่ฟๅ ทๅคไฝไธบ HuggingFace Agent ็่ฝๅใๅฎๅจ Huggingface ๆไพ็runๆจกๅผ่ฏๆตๅบๅไธ็่กจ็ฐๅฆไธ๏ผ
Qwen-Chat also has the capability to be used as a HuggingFace Agent. Its performance on the run-mode benchmark provided by HuggingFace is as follows:
HuggingFace Agent Benchmark- Run Mode | |||
---|---|---|---|
Model | Tool Selectionโ | Tool Usedโ | Codeโ |
GPT-4 | 100 | 100 | 97.4 |
GPT-3.5 | 95.4 | 96.3 | 87.0 |
StarCoder-Base-15B | 86.1 | 87.0 | 68.9 |
StarCoder-15B | 87.0 | 88.0 | 68.9 |
Qwen-7B-Chat | 87.0 | 87.0 | 71.5 |
Qwen-14B-Chat | 93.5 | 94.4 | 87.0 |
HuggingFace Agent Benchmark - Chat Mode | |||
---|---|---|---|
Model | Tool Selectionโ | Tool Usedโ | Codeโ |
GPT-4 | 97.9 | 97.9 | 98.5 |
GPT-3.5 | 97.3 | 96.8 | 89.6 |
StarCoder-Base-15B | 97.9 | 97.9 | 91.1 |
StarCoder-15B | 97.9 | 97.9 | 89.6 |
Qwen-7B-Chat | 94.7 | 94.7 | 85.1 |
Qwen-14B-Chat | 97.9 | 97.9 | 95.5 |
x86 ๅนณๅฐ (x86 Platforms)
ๅจ ้ ท็ฟโข/่ณๅผบยฎ ๅฏๆฉๅฑๅค็ๅจๆ Arcโข GPU ไธ้จ็ฝฒ้ๅๆจกๅๆถ๏ผๅปบ่ฎฎไฝฟ็จ OpenVINOโข Toolkitไปฅๅ ๅๅฉ็จ็กฌไปถ๏ผๅฎ็ฐๆดๅฅฝ็ๆจ็ๆง่ฝใๆจๅฏไปฅๅฎ่ฃ ๅนถ่ฟ่กๆญค example notebookใ็ธๅ ณ้ฎ้ข๏ผๆจๅฏๅจOpenVINO repoไธญๆไบคใ
When deploy on Coreโข/Xeonยฎ Scalable Processors or with Arcโข GPU, OpenVINOโข Toolkit is recommended. You can install and run this example notebook. For related issues, you are welcome to file an issue at OpenVINO repo.
FAQ
ๅฆ้ๅฐ้ฎ้ข๏ผๆฌ่ฏทๆฅ้ FAQไปฅๅissueๅบ๏ผๅฆไปๆ ๆณ่งฃๅณๅๆไบคissueใ
If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue.
ๅผ็จ (Citation)
ๅฆๆไฝ ่งๅพๆไปฌ็ๅทฅไฝๅฏนไฝ ๆๅธฎๅฉ๏ผๆฌข่ฟๅผ็จ๏ผ
If you find our work helpful, feel free to give us a cite.
@article{qwen,
title={Qwen Technical Report},
author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
journal={arXiv preprint arXiv:2309.16609},
year={2023}
}
ไฝฟ็จๅ่ฎฎ๏ผLicense Agreement๏ผ
ๆไปฌ็ไปฃ็ ๅๆจกๅๆ้ๅฏนๅญฆๆฏ็ ็ฉถๅฎๅ จๅผๆพ๏ผๅนถๆฏๆๅ็จใ่ฏทๆฅ็LICENSEไบ่งฃๅ ทไฝ็ๅผๆบๅ่ฎฎ็ป่ใๅฆ้ๅ็จ๏ผ่ฏทๅกซๅ้ฎๅท็ณ่ฏทใ
Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply.
่็ณปๆไปฌ๏ผContact Us๏ผ
ๅฆๆไฝ ๆณ็ปๆไปฌ็็ ๅๅข้ๅไบงๅๅข้็่จ๏ผๆฌข่ฟๅ ๅ ฅๆไปฌ็ๅพฎไฟก็พคใ้้็พคไปฅๅDiscord๏ผๅๆถ๏ผไนๆฌข่ฟ้่ฟ้ฎไปถ๏ผ[email protected]๏ผ่็ณปๆไปฌใ
If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [email protected].
- Downloads last month
- 96,367