nouamanetazi HF staff commited on
Commit
b00a86a
Β·
1 Parent(s): 5ee9a01
Files changed (1) hide show
  1. src/index.html +2 -2
src/index.html CHANGED
@@ -80,7 +80,7 @@
80
  This open source book is here to change that. Starting from the basics, we'll walk you through the knowledge necessary to scale the training of large language models (LLMs) from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.
81
  </p>
82
 
83
- <p>As the size of the clusters used to train these models has grown, various techniques, such as data parallelism, tensor parallelism, pipeline parallelism, and context parallelism as well as ZeRO and kernel fusion, have been invented to make sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the most efficient use of this expensive hardware. <!-- RH: The following sentence doesn't quite make sense (it's missing something - what is it that happens or is the case "even more"?). Clarify, or leave it out? --> <!-- Even more, as the challenge of scaling up AI training goes beyond just building the initial models, and teams have found that fine-tuning large models on specialized data often produces the best results, generally involving the same distributed training techniques. --> In this book, we'll progressively go over all of these techniques – from the simplest to the most refined ones – while maintaining a single story line to help you understand where each method comes from.</p>
84
 
85
  <aside>If you have questions or remarks, open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
86
 
@@ -254,7 +254,7 @@
254
 
255
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
256
 
257
- <p><strong>Real training efficiency benchmarks:</strong> How to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips used, interconnect, etc., so we can’t give a single unified recipe for this. What we will give you is a way to benchmark several setups <!-- RH: Here, do you mean "and the results on our cluster" or something like "...several setups. This is what we've done on our cluster."? This sentence doesn't quite make sense as written. --> and it is what we have done on our cluster. We ran over 4,100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
258
 
259
  <!-- <iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" scrolling="no" frameborder="0" height="840" width="720"></iframe> -->
260
  <div id="fragment-benchmarks_interactive"></div>
 
80
  This open source book is here to change that. Starting from the basics, we'll walk you through the knowledge necessary to scale the training of large language models (LLMs) from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.
81
  </p>
82
 
83
+ <p>As the size of the clusters used to train these models has grown, various techniques, such as data parallelism, tensor parallelism, pipeline parallelism, and context parallelism as well as ZeRO and kernel fusion, have been invented to make sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the most efficient use of this expensive hardware. These distributed training techniques are not only important for building initial models but have also become essential for fine-tuning large models on specialized data, which often produces the best results. In this book, we'll progressively go over all of these techniques – from the simplest to the most refined ones – while maintaining a single story line to help you understand where each method comes from.</p>
84
 
85
  <aside>If you have questions or remarks, open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
86
 
 
254
 
255
  <!-- <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p> -->
256
 
257
+ <p><strong>3. Real training efficiency benchmarks:</strong> How to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips used, interconnect, etc., so we can't give a single unified recipe for this. What we will give you is a way to benchmark several setups. This is what we've done on our cluster. We ran over 4,100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. </p>
258
 
259
  <!-- <iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" scrolling="no" frameborder="0" height="840" width="720"></iframe> -->
260
  <div id="fragment-benchmarks_interactive"></div>