Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ inference: false
|
|
18 |
**XVERSE-MoE-A4.2B** 是由深圳元象科技自主研发的支持多语言的大语言模型(Large Language Model),使用混合专家模型(MoE,Mixture-of-experts)架构,模型的总参数规模为 258 亿,实际激活的参数量为 42 亿,本次开源的模型为底座模型 **XVERSE-MoE-A4.2B**,主要特点如下:
|
19 |
|
20 |
- **模型结构**:XVERSE-MoE-A4.2B 为 Decoder-only 的 Transformer 架构,将密集模型的 FFN 层扩展为专家层,不同于传统 MoE 中每个专家的大小与标准 FFN 相同(如Mixtral 8x7B ),使用了更细粒度的专家,每个专家是标准 FFN 大小的 1/4,并设置了共享专家(Shared Expert)和非共享专家(Non-shared Expert)两类,共享专家在计算时始终被激活,非共享专家通过 Router 选择性激活。
|
21 |
-
- **训练数据**:构建了
|
22 |
- **训练框架**:针对 MoE 模型中独有的专家路由和权重计算逻辑,进行了深入定制优化,开发出一套高效的融合算子,以提升计算效率。同时,为解决 MoE 模型显存占用和通信量大的挑战,设计了计算、通信和 CPU-Offload 的 Overlap 处理方式,从而提高整体吞吐量。
|
23 |
|
24 |
**XVERSE-MoE-A4.2B** 的模型大小、架构和学习率如下:
|
@@ -45,18 +45,18 @@ The models sizes, architectures and learning rate of **XVERSE-MoE-A4.2B** are sh
|
|
45 |
|
46 |
为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
|
47 |
|
48 |
-
| 数据集 | XVERSE-MoE-A4.2B
|
49 |
-
| ------------------------ |
|
50 |
-
| C-Eval |
|
51 |
-
| CMMLU |
|
52 |
-
| Gaokao-Bench<sup>1</sup> |
|
53 |
-
| MMLU |
|
54 |
-
| AGIEval<sup>1</sup> |
|
55 |
-
| RACE-M |
|
56 |
-
| CommonSenseQA |
|
57 |
-
| PIQA |
|
58 |
-
| GSM8K |
|
59 |
-
| HumanEval |
|
60 |
|
61 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
62 |
|
@@ -67,18 +67,18 @@ The models sizes, architectures and learning rate of **XVERSE-MoE-A4.2B** are sh
|
|
67 |
|
68 |
To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
|
69 |
|
70 |
-
| Dataset | XVERSE-MoE-A4.2B
|
71 |
-
| ------------------------ |
|
72 |
-
| C-Eval |
|
73 |
-
| CMMLU |
|
74 |
-
| Gaokao-Bench<sup>1</sup> |
|
75 |
-
| MMLU |
|
76 |
-
| AGIEval<sup>1</sup> |
|
77 |
-
| RACE-M |
|
78 |
-
| CommonSenseQA |
|
79 |
-
| PIQA |
|
80 |
-
| GSM8K |
|
81 |
-
| HumanEval |
|
82 |
|
83 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
84 |
|
|
|
18 |
**XVERSE-MoE-A4.2B** 是由深圳元象科技自主研发的支持多语言的大语言模型(Large Language Model),使用混合专家模型(MoE,Mixture-of-experts)架构,模型的总参数规模为 258 亿,实际激活的参数量为 42 亿,本次开源的模型为底座模型 **XVERSE-MoE-A4.2B**,主要特点如下:
|
19 |
|
20 |
- **模型结构**:XVERSE-MoE-A4.2B 为 Decoder-only 的 Transformer 架构,将密集模型的 FFN 层扩展为专家层,不同于传统 MoE 中每个专家的大小与标准 FFN 相同(如Mixtral 8x7B ),使用了更细粒度的专家,每个专家是标准 FFN 大小的 1/4,并设置了共享专家(Shared Expert)和非共享专家(Non-shared Expert)两类,共享专家在计算时始终被激活,非共享专家通过 Router 选择性激活。
|
21 |
+
- **训练数据**:构建了 2.7 万亿 token 的高质量、多样化的数据对模型进行充分训练,包含中、英、俄、西等 40 多种语言,通过精细化设置不同类型数据的采样比例,使得中英两种语言表现优异,也能兼顾其他语言效果;模型使用 8K 长度的训练样本进行训练。
|
22 |
- **训练框架**:针对 MoE 模型中独有的专家路由和权重计算逻辑,进行了深入定制优化,开发出一套高效的融合算子,以提升计算效率。同时,为解决 MoE 模型显存占用和通信量大的挑战,设计了计算、通信和 CPU-Offload 的 Overlap 处理方式,从而提高整体吞吐量。
|
23 |
|
24 |
**XVERSE-MoE-A4.2B** 的模型大小、架构和学习率如下:
|
|
|
45 |
|
46 |
为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
|
47 |
|
48 |
+
| 数据集 | XVERSE-MoE-A4.2B | XVERSE-13B-2 | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B | DeepSeek-MoE-16B |
|
49 |
+
| ------------------------ | :--------------: | :----------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: | :--------------: |
|
50 |
+
| C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 | 40.6 |
|
51 |
+
| CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 | 42.5 |
|
52 |
+
| Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 | 29.1 |
|
53 |
+
| MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 | 45 |
|
54 |
+
| AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 | 31.7 |
|
55 |
+
| RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 | 61.9 |
|
56 |
+
| CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 | 54.8 |
|
57 |
+
| PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 | 80.2 |
|
58 |
+
| GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 | 18.8 |
|
59 |
+
| HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 | 26.8 |
|
60 |
|
61 |
> <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
|
62 |
|
|
|
67 |
|
68 |
To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
|
69 |
|
70 |
+
| Dataset | XVERSE-MoE-A4.2B | XVERSE-13B-2 | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B | DeepSeek-MoE-16B |
|
71 |
+
| ------------------------ | :--------------: | :----------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: | :--------------: |
|
72 |
+
| C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 | 40.6 |
|
73 |
+
| CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 | 42.5 |
|
74 |
+
| Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 | 29.1 |
|
75 |
+
| MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 | 45 |
|
76 |
+
| AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 | 31.7 |
|
77 |
+
| RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 | 61.9 |
|
78 |
+
| CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 | 54.8 |
|
79 |
+
| PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 | 80.2 |
|
80 |
+
| GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 | 18.8 |
|
81 |
+
| HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 | 26.8 |
|
82 |
|
83 |
> <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
|
84 |
|