Baichuan-7B
Baichuan-7Bๆฏ็ฑ็พๅทๆบ่ฝๅผๅ็ไธไธชๅผๆบ็ๅคง่งๆจก้ข่ฎญ็ปๆจกๅใๅบไบTransformer็ปๆ๏ผๅจๅคง็บฆ1.2ไธไบฟtokensไธ่ฎญ็ป็70ไบฟๅๆฐๆจกๅ๏ผๆฏๆไธญ่ฑๅ่ฏญ๏ผไธไธๆ็ชๅฃ้ฟๅบฆไธบ4096ใๅจๆ ๅ็ไธญๆๅ่ฑๆๆๅจbenchmark๏ผC-EVAL/MMLU๏ผไธๅๅๅพๅๅฐบๅฏธๆๅฅฝ็ๆๆใ
ๅฆๆๅธๆไฝฟ็จBaichuan-7B๏ผๅฆ่ฟ่กๆจ็ใFinetune็ญ๏ผ๏ผๆไปฌๆจ่ไฝฟ็จ้ ๅฅไปฃ็ ๅบBaichuan-7Bใ
Baichuan-7B is an open-source large-scale pre-trained model developed by Baichuan Intelligent Technology. Based on the Transformer architecture, it is a model with 7 billion parameters trained on approximately 1.2 trillion tokens. It supports both Chinese and English, with a context window length of 4096. It achieves the best performance of its size on standard Chinese and English authoritative benchmarks (C-EVAL/MMLU).
If you wish to use Baichuan-7B (for inference, finetuning, etc.), we recommend using the accompanying code library Baichuan-7B.
Why use Baichuan-7B
ๅจๅๅฐบๅฏธๆจกๅไธญBaichuan-7B่พพๅฐไบ็ฎๅSOTA็ๆฐดๅนณ๏ผๅ่ไธ้ขMMLUๆๆ
Baichuan-7Bไฝฟ็จ่ชๆ็ไธญ่ฑๆๅ่ฏญ่ฏญๆ่ฟ่ก่ฎญ็ป๏ผๅจไธญๆไธ่ฟ่กไผๅ๏ผๅจC-Eval่พพๅฐSOTAๆฐดๅนณ
ไธๅไบLLaMAๅฎๅ จ็ฆๆญขๅไธไฝฟ็จ๏ผBaichuan-7Bไฝฟ็จๆดๅฎฝๆพ็ๅผๆบๅ่ฎฎ๏ผๅ ่ฎธ็จไบๅไธ็ฎ็
Among models of the same size, Baichuan-7B has achieved the current state-of-the-art (SOTA) level, as evidenced by the following MMLU metrics.
Baichuan-7B is trained on proprietary bilingual Chinese-English corpora, optimized for Chinese, and achieves SOTA performance on C-Eval.
Unlike LLaMA, which completely prohibits commercial use, Baichuan-7B employs a more lenient open-source license, allowing for commercial purposes.
How to Get Started with the Model
ๅฆไธๆฏไธไธชไฝฟ็จBaichuan-7B่ฟ่ก1-shotๆจ็็ไปปๅก๏ผๆ นๆฎไฝๅ็ปๅบไฝ่ ๅ๏ผๆญฃ็กฎ่พๅบไธบ"ๅค้จๅฏๅ->ๆๅ้"
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('็ป้นณ้ๆฅผ->็ไนๆถฃ\nๅค้จๅฏๅ->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
The following is a task of performing 1-shot inference using Baichuan-7B, where the author's name is given based on the work, with the correct output being "One Hundred Years of Solitude->Gabriel Garcia Marquez"
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
Model Details
Model Description
- Developed by: ็พๅทๆบ่ฝ(Baichuan Intelligent Technology)
- Email: [email protected]
- Language(s) (NLP): Chinese/English
- License: Baichuan-7B License
Model Sources
ๆดไฝๆจกๅๅบไบๆ ๅ็Transformer็ปๆ๏ผๆไปฌ้็จไบๅLLaMAไธๆ ท็ๆจกๅ่ฎพ่ฎก
- Position Embedding๏ผ้็จrotary-embedding๏ผๆฏ็ฐ้ถๆฎต่ขซๅคงๅคๆฐๆจกๅ้็จ็ไฝ็ฝฎ็ผ็ ๆนๆก๏ผๅ ทๆๅพๅฅฝ็ๅคๆจๆงใ
- Feedforward Layer๏ผ้็จSwiGLU๏ผFeedforwardๅๅไธบ(8/3)ๅ็้ๅซๅฑๅคงๅฐ๏ผๅณ11008ใ
- Layer Normalization: ๅบไบRMSNorm็Pre-Normalizationใ
ๅ ทไฝๅๆฐๅ่งไธ่กจ
Hyperparameter | Value |
---|---|
n_parameters | 7000559616 |
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 64000 |
sequence length | 4096 |
The overall model is based on the standard Transformer structure, and we have adopted the same model design as LLaMA:
- Position Embedding: We use rotary-embedding, which is the position encoding scheme adopted by most models at this stage, and it has excellent extrapolation capabilities.
- Feedforward Layer: We use SwiGLU. The feedforward changes to (8/3) times the size of the hidden layer, that is, 11008.
- Layer Normalization: Pre-Normalization based on RMSNorm.
The specific parameters are as follows:
Hyperparameter | Value |
---|---|
n_parameters | 7000559616 |
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 64000 |
sequence length | 4096 |
Uses
Downstream Use
ๆไปฌๅๆถๅผๆบๅบไบๅๆฌๆจกๅ้ ๅฅ็่ฎญ็ปไปฃ็ ๏ผๅ ่ฎธ่ฟ่ก้ซๆ็Finetune็จไบไธๆธธไปปๅก๏ผๅ ทไฝๅ่งBaichuan-7Bใ
We have also open-sourced the training code that accompanies this model, allowing for efficient finetuning for downstream tasks. For more details, please refer to Baichuan-7B.
Out-of-Scope Use
ๅจๆฒกๆๅ ๅ่ฏไผฐ้ฃ้ฉๅ้ๅ็ผ่งฃๆชๆฝ็ๆ ๅตไธๆๅ ฅ็ไบงไฝฟ็จ๏ผไปปไฝๅฏ่ฝ่ขซ่งไธบไธ่ด่ดฃไปปๆๆๅฎณ็ไฝฟ็จๆกไพใ
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
Bias, Risks, and Limitations
Baichuan-7Bๅฏ่ฝไผไบง็ไบๅฎไธไธๆญฃ็กฎ็่พๅบ๏ผไธๅบไพ่ตๅฎไบง็ไบๅฎไธๅ็กฎ็ไฟกๆฏใBaichuan-7Bๆฏๅจๅ็งๅ ฌๅ ฑๆฐๆฎ้ไธ่ฟ่ก่ฎญ็ป็ใๅฐฝ็ฎกๆไปฌๅทฒ็ปๅๅบไบๅทจๅคง็ๅชๅๆฅๆธ ๆด้ข่ฎญ็ปๆฐๆฎ๏ผไฝ่ฟไธชๆจกๅๅฏ่ฝไผ็ๆๆทซ็งฝใๅ่งๆๅ ถไปๅ็ฏๆง็่พๅบใ
Baichuan-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information. Baichuan-7B was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
Training Details
่ฎญ็ปๅ ทไฝ่ฎพ็ฝฎๅ่งBaichuan-7Bใ
For specific training settings, please refer to Baichuan-7B.
Evaluation
ไธญๆ่ฏๆต
C-Eval
CEvalๆฐๆฎ้ๆฏไธไธชๅ จ้ข็ไธญๆๅบ็กๆจกๅ่ฏๆตๆฐๆฎ้๏ผๆถต็ไบ52ไธชๅญฆ็งๅๅไธช้พๅบฆ็็บงๅซใๆไปฌไฝฟ็จ่ฏฅๆฐๆฎ้็dev้ไฝไธบfew-shot็ๆฅๆบ๏ผๅจtest้ไธ่ฟ่กไบ5-shotๆต่ฏใ
Model 5-shot | Average | Avg(Hard) | STEM | Social Sciences | Humanities | Others |
---|---|---|---|---|---|---|
GPT-4 | 68.7 | 54.9 | 67.1 | 77.6 | 64.5 | 67.8 |
ChatGPT | 54.4 | 41.4 | 52.9 | 61.8 | 50.9 | 53.6 |
Claude-v1.3 | 54.2 | 39.0 | 51.9 | 61.7 | 52.1 | 53.7 |
Claude-instant-v1.0 | 45.9 | 35.5 | 43.1 | 53.8 | 44.2 | 45.4 |
moss-moon-003-base (16B) | 27.4 | 24.5 | 27.0 | 29.1 | 27.2 | 26.9 |
Ziya-LLaMA-13B-pretrain | 30.2 | 22.7 | 27.7 | 34.4 | 32.0 | 28.9 |
LLaMA-7B-hf | 27.1 | 25.9 | 27.1 | 26.8 | 27.9 | 26.3 |
ChatGLM-6B | 34.5 | 23.1 | 30.4 | 39.6 | 37.4 | 34.5 |
Falcon-7B | 25.8 | 24.3 | 25.8 | 26.0 | 25.8 | 25.6 |
Open-LLaMA-v2-pretrain (7B) | 24.0 | 22.5 | 23.1 | 25.3 | 25.2 | 23.2 |
TigerBot-7B-base | 25.7 | 27.0 | 27.3 | 24.7 | 23.4 | 26.1 |
Aquila-7B* | 25.5 | 25.2 | 25.6 | 24.6 | 25.2 | 26.6 |
BLOOM-7B | 22.8 | 20.2 | 21.8 | 23.3 | 23.9 | 23.3 |
BLOOMZ-7B | 35.7 | 25.8 | 31.3 | 43.5 | 36.6 | 35.6 |
Baichuan-7B | 42.8 | 31.5 | 38.2 | 52.0 | 46.2 | 39.3 |
Gaokao
Gaokao ๆฏไธไธชไปฅไธญๅฝ้ซ่้ขไฝไธบ่ฏๆตๅคง่ฏญ่จๆจกๅ่ฝๅ็ๆฐๆฎ้๏ผ็จไปฅ่ฏไผฐๆจกๅ็่ฏญ่จ่ฝๅๅ้ป่พๆจ็่ฝๅใ ๆไปฌๅชไฟ็ไบๅ ถไธญ็ๅ้กน้ๆฉ้ข๏ผๅนถๅฏนๆๆๆจกๅ่ฟ่ก็ปไธ5-shotๆต่ฏใ
ไปฅไธๆฏๆต่ฏ็็ปๆใ
Model | Average |
---|---|
Open-LLaMA-v2-pretrain | 21.41 |
Ziya-LLaMA-13B-pretrain | 23.17 |
Falcon-7B | 23.98 |
TigerBot-7B-base | 25.94 |
LLaMA-7B | 27.81 |
ChatGLM-6B | 21.41 |
BLOOM-7B | 26.96 |
BLOOMZ-7B | 28.72 |
Aquila-7B* | 24.39 |
Baichuan-7B | 36.24 |
AGIEval
AGIEval ๆจๅจ่ฏไผฐๆจกๅ็่ฎค็ฅๅ่งฃๅณ้ฎ้ข็ธๅ ณ็ไปปๅกไธญ็ไธ่ฌ่ฝๅใ ๆไปฌๅชไฟ็ไบๅ ถไธญ็ๅ้ไธๅ้กน้ๆฉ้ข๏ผ้ๆบๅๅๅๅฏนๆๆๆจกๅ่ฟ่กไบ็ปไธ5-shotๆต่ฏใ
Model | Average |
---|---|
Open-LLaMA-v2-pretrain | 23.49 |
Ziya-LLaMA-13B-pretrain | 27.64 |
Falcon-7B | 27.18 |
TigerBot-7B-base | 25.19 |
LLaMA-7B | 28.17 |
ChatGLM-6B | 23.49 |
BLOOM-7B | 26.55 |
BLOOMZ-7B | 30.27 |
Aquila-7B* | 25.58 |
Baichuan-7B | 34.44 |
*ๅ ถไธญAquilaๆจกๅๆฅๆบไบๆบๆบๅฎๆน็ฝ็ซ๏ผไป ๅๅ่
English Leaderboard
In addition to Chinese, we also tested the model's performance in English.
MMLU
MMLU is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty ranges from high school level to expert level, making it a mainstream LLM evaluation dataset.
We adopted the open-source evaluation scheme, and the final 5-shot results are as follows:
Model | Humanities | Social Sciences | STEM | Other | Average |
---|---|---|---|---|---|
LLaMA-7B2 | 34.0 | 38.3 | 30.5 | 38.1 | 35.1 |
Falcon-7B1 | - | - | - | - | 35.0 |
mpt-7B1 | - | - | - | - | 35.6 |
ChatGLM-6B0 | 35.4 | 41.0 | 31.3 | 40.5 | 36.9 |
BLOOM 7B0 | 25.0 | 24.4 | 26.5 | 26.4 | 25.5 |
BLOOMZ 7B0 | 31.3 | 42.1 | 34.4 | 39.0 | 36.1 |
moss-moon-003-base (16B)0 | 24.2 | 22.8 | 22.4 | 24.4 | 23.6 |
moss-moon-003-sft (16B)0 | 30.5 | 33.8 | 29.3 | 34.4 | 31.9 |
Baichuan-7B0 | 38.4 | 48.9 | 35.6 | 48.1 | 42.3 |
The superscript in the Model column indicates the source of the results.
0:reimplemented
1:https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
2:https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
Our Group
- Downloads last month
- 13,418