Baichuan-7B

Baichuan-7Bๆ˜ฏ็”ฑ็™พๅทๆ™บ่ƒฝๅผ€ๅ‘็š„ไธ€ไธชๅผ€ๆบ็š„ๅคง่ง„ๆจก้ข„่ฎญ็ปƒๆจกๅž‹ใ€‚ๅŸบไบŽTransformer็ป“ๆž„๏ผŒๅœจๅคง็บฆ1.2ไธ‡ไบฟtokensไธŠ่ฎญ็ปƒ็š„70ไบฟๅ‚ๆ•ฐๆจกๅž‹๏ผŒๆ”ฏๆŒไธญ่‹ฑๅŒ่ฏญ๏ผŒไธŠไธ‹ๆ–‡็ช—ๅฃ้•ฟๅบฆไธบ4096ใ€‚ๅœจๆ ‡ๅ‡†็š„ไธญๆ–‡ๅ’Œ่‹ฑๆ–‡ๆƒๅจbenchmark๏ผˆC-EVAL/MMLU๏ผ‰ไธŠๅ‡ๅ–ๅพ—ๅŒๅฐบๅฏธๆœ€ๅฅฝ็š„ๆ•ˆๆžœใ€‚

ๅฆ‚ๆžœๅธŒๆœ›ไฝฟ็”จBaichuan-7B๏ผˆๅฆ‚่ฟ›่กŒๆŽจ็†ใ€Finetune็ญ‰๏ผ‰๏ผŒๆˆ‘ไปฌๆŽจ่ไฝฟ็”จ้…ๅฅ—ไปฃ็ ๅบ“Baichuan-7Bใ€‚

Baichuan-7B is an open-source large-scale pre-trained model developed by Baichuan Intelligent Technology. Based on the Transformer architecture, it is a model with 7 billion parameters trained on approximately 1.2 trillion tokens. It supports both Chinese and English, with a context window length of 4096. It achieves the best performance of its size on standard Chinese and English authoritative benchmarks (C-EVAL/MMLU).

If you wish to use Baichuan-7B (for inference, finetuning, etc.), we recommend using the accompanying code library Baichuan-7B.

Why use Baichuan-7B

  • ๅœจๅŒๅฐบๅฏธๆจกๅž‹ไธญBaichuan-7B่พพๅˆฐไบ†็›ฎๅ‰SOTA็š„ๆฐดๅนณ๏ผŒๅ‚่€ƒไธ‹้ขMMLUๆŒ‡ๆ ‡

  • Baichuan-7Bไฝฟ็”จ่‡ชๆœ‰็š„ไธญ่‹ฑๆ–‡ๅŒ่ฏญ่ฏญๆ–™่ฟ›่กŒ่ฎญ็ปƒ๏ผŒๅœจไธญๆ–‡ไธŠ่ฟ›่กŒไผ˜ๅŒ–๏ผŒๅœจC-Eval่พพๅˆฐSOTAๆฐดๅนณ

  • ไธๅŒไบŽLLaMAๅฎŒๅ…จ็ฆๆญขๅ•†ไธšไฝฟ็”จ๏ผŒBaichuan-7Bไฝฟ็”จๆ›ดๅฎฝๆพ็š„ๅผ€ๆบๅ่ฎฎ๏ผŒๅ…่ฎธ็”จไบŽๅ•†ไธš็›ฎ็š„

  • Among models of the same size, Baichuan-7B has achieved the current state-of-the-art (SOTA) level, as evidenced by the following MMLU metrics.

  • Baichuan-7B is trained on proprietary bilingual Chinese-English corpora, optimized for Chinese, and achieves SOTA performance on C-Eval.

  • Unlike LLaMA, which completely prohibits commercial use, Baichuan-7B employs a more lenient open-source license, allowing for commercial purposes.

How to Get Started with the Model

ๅฆ‚ไธ‹ๆ˜ฏไธ€ไธชไฝฟ็”จBaichuan-7B่ฟ›่กŒ1-shotๆŽจ็†็š„ไปปๅŠก๏ผŒๆ นๆฎไฝœๅ“็ป™ๅ‡บไฝœ่€…ๅ๏ผŒๆญฃ็กฎ่พ“ๅ‡บไธบ"ๅคœ้›จๅฏ„ๅŒ—->ๆŽๅ•†้š"

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('็™ป้นณ้›€ๆฅผ->็Ž‹ไน‹ๆถฃ\nๅคœ้›จๅฏ„ๅŒ—->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

The following is a task of performing 1-shot inference using Baichuan-7B, where the author's name is given based on the work, with the correct output being "One Hundred Years of Solitude->Gabriel Garcia Marquez"

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Model Details

Model Description

Model Sources

ๆ•ดไฝ“ๆจกๅž‹ๅŸบไบŽๆ ‡ๅ‡†็š„Transformer็ป“ๆž„๏ผŒๆˆ‘ไปฌ้‡‡็”จไบ†ๅ’ŒLLaMAไธ€ๆ ท็š„ๆจกๅž‹่ฎพ่ฎก

  • Position Embedding๏ผš้‡‡็”จrotary-embedding๏ผŒๆ˜ฏ็Žฐ้˜ถๆฎต่ขซๅคงๅคšๆ•ฐๆจกๅž‹้‡‡็”จ็š„ไฝ็ฝฎ็ผ–็ ๆ–นๆกˆ๏ผŒๅ…ทๆœ‰ๅพˆๅฅฝ็š„ๅค–ๆŽจๆ€งใ€‚
  • Feedforward Layer๏ผš้‡‡็”จSwiGLU๏ผŒFeedforwardๅ˜ๅŒ–ไธบ(8/3)ๅ€็š„้šๅซๅฑ‚ๅคงๅฐ๏ผŒๅณ11008ใ€‚
  • Layer Normalization: ๅŸบไบŽRMSNorm็š„Pre-Normalizationใ€‚

ๅ…ทไฝ“ๅ‚ๆ•ฐๅ’Œ่งไธ‹่กจ

Hyperparameter Value
n_parameters 7000559616
n_layers 32
n_heads 32
d_model 4096
vocab size 64000
sequence length 4096

The overall model is based on the standard Transformer structure, and we have adopted the same model design as LLaMA:

  • Position Embedding: We use rotary-embedding, which is the position encoding scheme adopted by most models at this stage, and it has excellent extrapolation capabilities.
  • Feedforward Layer: We use SwiGLU. The feedforward changes to (8/3) times the size of the hidden layer, that is, 11008.
  • Layer Normalization: Pre-Normalization based on RMSNorm.

The specific parameters are as follows:

Hyperparameter Value
n_parameters 7000559616
n_layers 32
n_heads 32
d_model 4096
vocab size 64000
sequence length 4096

Uses

Downstream Use

ๆˆ‘ไปฌๅŒๆ—ถๅผ€ๆบๅ‡บไบ†ๅ’Œๆœฌๆจกๅž‹้…ๅฅ—็š„่ฎญ็ปƒไปฃ็ ๏ผŒๅ…่ฎธ่ฟ›่กŒ้ซ˜ๆ•ˆ็š„Finetune็”จไบŽไธ‹ๆธธไปปๅŠก๏ผŒๅ…ทไฝ“ๅ‚่งBaichuan-7Bใ€‚

We have also open-sourced the training code that accompanies this model, allowing for efficient finetuning for downstream tasks. For more details, please refer to Baichuan-7B.

Out-of-Scope Use

ๅœจๆฒกๆœ‰ๅ……ๅˆ†่ฏ„ไผฐ้ฃŽ้™ฉๅ’Œ้‡‡ๅ–็ผ“่งฃๆŽชๆ–ฝ็š„ๆƒ…ๅ†ตไธ‹ๆŠ•ๅ…ฅ็”Ÿไบงไฝฟ็”จ๏ผ›ไปปไฝ•ๅฏ่ƒฝ่ขซ่ง†ไธบไธ่ดŸ่ดฃไปปๆˆ–ๆœ‰ๅฎณ็š„ไฝฟ็”จๆกˆไพ‹ใ€‚

Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

Bias, Risks, and Limitations

Baichuan-7Bๅฏ่ƒฝไผšไบง็”Ÿไบ‹ๅฎžไธŠไธๆญฃ็กฎ็š„่พ“ๅ‡บ๏ผŒไธๅบ”ไพ่ต–ๅฎƒไบง็”Ÿไบ‹ๅฎžไธŠๅ‡†็กฎ็š„ไฟกๆฏใ€‚Baichuan-7Bๆ˜ฏๅœจๅ„็งๅ…ฌๅ…ฑๆ•ฐๆฎ้›†ไธŠ่ฟ›่กŒ่ฎญ็ปƒ็š„ใ€‚ๅฐฝ็ฎกๆˆ‘ไปฌๅทฒ็ปๅšๅ‡บไบ†ๅทจๅคง็š„ๅŠชๅŠ›ๆฅๆธ…ๆด—้ข„่ฎญ็ปƒๆ•ฐๆฎ๏ผŒไฝ†่ฟ™ไธชๆจกๅž‹ๅฏ่ƒฝไผš็”Ÿๆˆๆทซ็งฝใ€ๅ่งๆˆ–ๅ…ถไป–ๅ†’็Šฏๆ€ง็š„่พ“ๅ‡บใ€‚

Baichuan-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information. Baichuan-7B was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Training Details

่ฎญ็ปƒๅ…ทไฝ“่ฎพ็ฝฎๅ‚่งBaichuan-7Bใ€‚

For specific training settings, please refer to Baichuan-7B.

Evaluation

ไธญๆ–‡่ฏ„ๆต‹

C-Eval

CEvalๆ•ฐๆฎ้›†ๆ˜ฏไธ€ไธชๅ…จ้ข็š„ไธญๆ–‡ๅŸบ็ก€ๆจกๅž‹่ฏ„ๆต‹ๆ•ฐๆฎ้›†๏ผŒๆถต็›–ไบ†52ไธชๅญฆ็ง‘ๅ’Œๅ››ไธช้šพๅบฆ็š„็บงๅˆซใ€‚ๆˆ‘ไปฌไฝฟ็”จ่ฏฅๆ•ฐๆฎ้›†็š„dev้›†ไฝœไธบfew-shot็š„ๆฅๆบ๏ผŒๅœจtest้›†ไธŠ่ฟ›่กŒไบ†5-shotๆต‹่ฏ•ใ€‚

Model 5-shot Average Avg(Hard) STEM Social Sciences Humanities Others
GPT-4 68.7 54.9 67.1 77.6 64.5 67.8
ChatGPT 54.4 41.4 52.9 61.8 50.9 53.6
Claude-v1.3 54.2 39.0 51.9 61.7 52.1 53.7
Claude-instant-v1.0 45.9 35.5 43.1 53.8 44.2 45.4
moss-moon-003-base (16B) 27.4 24.5 27.0 29.1 27.2 26.9
Ziya-LLaMA-13B-pretrain 30.2 22.7 27.7 34.4 32.0 28.9
LLaMA-7B-hf 27.1 25.9 27.1 26.8 27.9 26.3
ChatGLM-6B 34.5 23.1 30.4 39.6 37.4 34.5
Falcon-7B 25.8 24.3 25.8 26.0 25.8 25.6
Open-LLaMA-v2-pretrain (7B) 24.0 22.5 23.1 25.3 25.2 23.2
TigerBot-7B-base 25.7 27.0 27.3 24.7 23.4 26.1
Aquila-7B* 25.5 25.2 25.6 24.6 25.2 26.6
BLOOM-7B 22.8 20.2 21.8 23.3 23.9 23.3
BLOOMZ-7B 35.7 25.8 31.3 43.5 36.6 35.6
Baichuan-7B 42.8 31.5 38.2 52.0 46.2 39.3

Gaokao

Gaokao ๆ˜ฏไธ€ไธชไปฅไธญๅ›ฝ้ซ˜่€ƒ้ข˜ไฝœไธบ่ฏ„ๆต‹ๅคง่ฏญ่จ€ๆจกๅž‹่ƒฝๅŠ›็š„ๆ•ฐๆฎ้›†๏ผŒ็”จไปฅ่ฏ„ไผฐๆจกๅž‹็š„่ฏญ่จ€่ƒฝๅŠ›ๅ’Œ้€ป่พ‘ๆŽจ็†่ƒฝๅŠ›ใ€‚ ๆˆ‘ไปฌๅชไฟ็•™ไบ†ๅ…ถไธญ็š„ๅ•้กน้€‰ๆ‹ฉ้ข˜๏ผŒๅนถๅฏนๆ‰€ๆœ‰ๆจกๅž‹่ฟ›่กŒ็ปŸไธ€5-shotๆต‹่ฏ•ใ€‚

ไปฅไธ‹ๆ˜ฏๆต‹่ฏ•็š„็ป“ๆžœใ€‚

Model Average
Open-LLaMA-v2-pretrain 21.41
Ziya-LLaMA-13B-pretrain 23.17
Falcon-7B 23.98
TigerBot-7B-base 25.94
LLaMA-7B 27.81
ChatGLM-6B 21.41
BLOOM-7B 26.96
BLOOMZ-7B 28.72
Aquila-7B* 24.39
Baichuan-7B 36.24

AGIEval

AGIEval ๆ—จๅœจ่ฏ„ไผฐๆจกๅž‹็š„่ฎค็Ÿฅๅ’Œ่งฃๅ†ณ้—ฎ้ข˜็›ธๅ…ณ็š„ไปปๅŠกไธญ็š„ไธ€่ˆฌ่ƒฝๅŠ›ใ€‚ ๆˆ‘ไปฌๅชไฟ็•™ไบ†ๅ…ถไธญ็š„ๅ››้€‰ไธ€ๅ•้กน้€‰ๆ‹ฉ้ข˜๏ผŒ้šๆœบๅˆ’ๅˆ†ๅŽๅฏนๆ‰€ๆœ‰ๆจกๅž‹่ฟ›่กŒไบ†็ปŸไธ€5-shotๆต‹่ฏ•ใ€‚

Model Average
Open-LLaMA-v2-pretrain 23.49
Ziya-LLaMA-13B-pretrain 27.64
Falcon-7B 27.18
TigerBot-7B-base 25.19
LLaMA-7B 28.17
ChatGLM-6B 23.49
BLOOM-7B 26.55
BLOOMZ-7B 30.27
Aquila-7B* 25.58
Baichuan-7B 34.44

*ๅ…ถไธญAquilaๆจกๅž‹ๆฅๆบไบŽๆ™บๆบๅฎ˜ๆ–น็ฝ‘็ซ™๏ผŒไป…ๅšๅ‚่€ƒ

English Leaderboard

In addition to Chinese, we also tested the model's performance in English.

MMLU

MMLU is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty ranges from high school level to expert level, making it a mainstream LLM evaluation dataset.

We adopted the open-source evaluation scheme, and the final 5-shot results are as follows:

Model Humanities Social Sciences STEM Other Average
LLaMA-7B2 34.0 38.3 30.5 38.1 35.1
Falcon-7B1 - - - - 35.0
mpt-7B1 - - - - 35.6
ChatGLM-6B0 35.4 41.0 31.3 40.5 36.9
BLOOM 7B0 25.0 24.4 26.5 26.4 25.5
BLOOMZ 7B0 31.3 42.1 34.4 39.0 36.1
moss-moon-003-base (16B)0 24.2 22.8 22.4 24.4 23.6
moss-moon-003-sft (16B)0 30.5 33.8 29.3 34.4 31.9
Baichuan-7B0 38.4 48.9 35.6 48.1 42.3

The superscript in the Model column indicates the source of the results.

0:reimplemented
1:https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
2:https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu

Our Group

WeChat

Downloads last month
13,418
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for baichuan-inc/Baichuan-7B

Adapters
3 models
Quantizations
1 model

Spaces using baichuan-inc/Baichuan-7B 57