nouamanetazi HF staff commited on
Commit
a9321ed
·
1 Parent(s): b00a86a
Files changed (1) hide show
  1. src/index.html +8 -8
src/index.html CHANGED
@@ -270,7 +270,7 @@
270
  <li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
271
  <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
272
  </ol>
273
- <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). <!-- RH: Do the previous changes make sense? --> Finding the right balance is key to scaling training.</p>
274
  <p>
275
  As this book covers a lot of ground, we've made a <a href="assets/images/ultra-cheatsheet.svg">cheatsheet</a> to help you navigate it and get the general takeaways. Keep it close by as you navigate these stormy waters!
276
  </p>
@@ -341,7 +341,7 @@
341
 
342
  <aside>For instance, during DeepSeek-V3/R1 training<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>, the batch size is gradually increased from 3,072 input sequences to 15,360 in the training of the first 469B tokens, then kept at 15,360 input samples for the remaining training.</aside>
343
 
344
- <p>Batch size also affects the time it takes to train on a given text dataset: a small batch size will require more optimizer steps <!-- RH: Is it OK to switch between optimization step and optimizer step, or would it be better to stick with one term or the other throughout the book? --> to train on the same amount of samples. Optimizer steps are costly (in compute time), and the total time to train will thus increase compared to using a larger batch size. That being said, note that the batch size can often be adjusted quite widely around the optimal batch size without major impact on the performance of the model - that is, the sensitivity of final model performance to the exact batch size value is usually rather low around the optimal batch size.</p>
345
 
346
  <p>In the LLM pretraining community, batch sizes are commonly reported in terms of tokens rather than number of samples (<d-math>bst</d-math> = batch size tokens). This makes training numbers generally independent of the exact input sequence length used during the training.</p>
347
 
@@ -377,7 +377,7 @@
377
  You might think that you could compute the memory requirements for a model exactly, but there are a few additional memory occupants that make it hard to be precise:
378
  <ul>
379
  <li>CUDA kernels typically require 1-2 GB of GPU memory, which you can quickly verify by running <code>import torch; torch.ones((1, 1)).to("cuda")</code> and then checking the GPU memory with <code>nvidia-smi</code>.</li>
380
- <li><!-- RH: How about "Some memory is used for buffers and intermediate results, and..."? -->Some rest memory usage from buffers, intermediate results, and there's some memory that cant be used due to fragmentation.</li>
381
  </ul>
382
  We’ll neglect these last two contributors, as they are typically small and constant factors.
383
  </p></div>
@@ -434,7 +434,7 @@
434
  \end{aligned}
435
  </d-math>
436
 
437
- <p>Now, let’s have a look at how things change if we use a lower precision. For stability reasons (see the section on <a target="_self" href="#mixed_precision_training">mixed precision training</a> later in the book), we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations – requiring 2 bytes per parameter and gradient – as well as <!-- RH: as well as storing? Or OK as is (i.e., implying "as well as using")? --> an additional copy of the model weights and gradients in FP32, making 12 bytes per parameter in total. In addition to the parameters and gradients, we need to store the optimizer states; for the Adam optimizer, this requires the momentum and the variance, usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>You'll see some more details below when we cover the ZeRO methods.</aside>
440
 
@@ -516,7 +516,7 @@
516
 
517
  <p>Here, <d-math>L</d-math> is the number of layers, <d-math>seq</d-math> the sequence length, <d-math>bs</d-math> the batch size in samples, <d-math>h</d-math> the hidden dimension of the model, and <d-math>n_{heads}</d-math> the number of heads.</p>
518
 
519
- <p>For the exact derivation of the numbers, you can follow this <!-- RH: the? --> original NVIDIA paper on recomputation <d-cite bibtex-key="korthikanti2022recomputation"></d-cite> - it essentially requires you to do some accounting of all the sizes of intermediate activations between each operation in a transformer layer.</p>
520
 
521
  <p>An interesting observation here is that memory usage is not static for a given model; rather, it scales linearly with the batch size and quadratically with the sequence length. This means the activation memory is the part that will blow up when we increase our batch size or train with longer sequences. We can use this equation to look at how memory usage changes for various sequence lengths, for example for Llama models (<code>bs=1</code>):</p>
522
 
@@ -551,7 +551,7 @@
551
  <li><strong>Selective:</strong> In general, we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of floating-point operations per second (FLOPS). It turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing the expensive feedforward computations. For a GPT-3 (175B) model, this means <strong>a 70% activation memory reduction at a 2.7% compute cost</strong>.</li>
552
  </ul>
553
 
554
- <aside>In recent models like DeepSeek-V3, selective checkpointing is performed, using so-called Multi-head Latent Attention (MLA) to optimize activation memory usage.</aside><!-- RH: Does that work? -->
555
 
556
  <p>Let’s see how drastically recomputation strategies can reduce the memory footprint in practice, and how selective recomputation strikes a nice balance between memory savings and recomputation cost:</p>
557
  <!-- RH: In this figure, change "optimizer" to "optimizer states" on the right. -->
@@ -563,9 +563,9 @@
563
  <p class="note-box-title">📝 Note</p>
564
  <div class="note-box-content">
565
  <p>
566
- When youre measuring how efficient your training setup is at using your GPU/TPU/accelerator, you usually want to take recomputation into account to compute total FLOPS and compare this to the theoretical maximum FLOPS of the GPU/TPU/accelerator. Taking recomputation into account when calculating FLOPS for a training step gives a value called hardware FLOPS,” which is the real number of operations performed on the accelerator. Dividing this number by the duration of the training step and the maximum accelerator FLOPS yields the <strong><em>hardware FLOPS utilization (HFU)</em></strong>.<!-- RH: Are you saying you first divide the hardware FLOPS value by the duration of the training step (if so, in what unit - seconds?), then divide the result by the maximum accelerator FLOPS? -->
567
  </p><p>
568
- However, what really matters at the end of the day is the total time needed to train a model on a given dataset. So, for example, when comparing various GPUs/TPUs/accelerators, if one of these provides enough memory to skip recomputation and thus performs less operations per second (lower HFU) but still trains faster, it should be rewarded, not punished. Thus, an alternative is to compute what is called <strong><em>model FLOPS utilization (MFU)</em></strong>, which, in contrast to HFU, only takes into account the required operations for the forward and backward passes through the model and does not include recomputation in the measured FLOPS. This value is thus more specific to the model than the training implementation.
569
  </p>
570
  </div>
571
  </div>
 
270
  <li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
271
  <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
272
  </ol>
273
+ <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
274
  <p>
275
  As this book covers a lot of ground, we've made a <a href="assets/images/ultra-cheatsheet.svg">cheatsheet</a> to help you navigate it and get the general takeaways. Keep it close by as you navigate these stormy waters!
276
  </p>
 
341
 
342
  <aside>For instance, during DeepSeek-V3/R1 training<d-cite bibtex-key="deepseekai2024deepseekv3technicalreport"></d-cite>, the batch size is gradually increased from 3,072 input sequences to 15,360 in the training of the first 469B tokens, then kept at 15,360 input samples for the remaining training.</aside>
343
 
344
+ <p>Batch size also affects the time it takes to train on a given text dataset: a small batch size will require more optimizer steps to train on the same amount of samples. Optimizer steps are costly (in compute time), and the total time to train will thus increase compared to using a larger batch size. That being said, note that the batch size can often be adjusted quite widely around the optimal batch size without major impact on the performance of the model - that is, the sensitivity of final model performance to the exact batch size value is usually rather low around the optimal batch size.</p>
345
 
346
  <p>In the LLM pretraining community, batch sizes are commonly reported in terms of tokens rather than number of samples (<d-math>bst</d-math> = batch size tokens). This makes training numbers generally independent of the exact input sequence length used during the training.</p>
347
 
 
377
  You might think that you could compute the memory requirements for a model exactly, but there are a few additional memory occupants that make it hard to be precise:
378
  <ul>
379
  <li>CUDA kernels typically require 1-2 GB of GPU memory, which you can quickly verify by running <code>import torch; torch.ones((1, 1)).to("cuda")</code> and then checking the GPU memory with <code>nvidia-smi</code>.</li>
380
+ <li>Some memory is used for buffers and intermediate results, and there's some memory that can't be used due to fragmentation.</li>
381
  </ul>
382
  We’ll neglect these last two contributors, as they are typically small and constant factors.
383
  </p></div>
 
434
  \end{aligned}
435
  </d-math>
436
 
437
+ <p>Now, let’s have a look at how things change if we use a lower precision. For stability reasons (see the section on <a target="_self" href="#mixed_precision_training">mixed precision training</a> later in the book), we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations – requiring 2 bytes per parameter and gradient – as well as storing an additional copy of the model weights and gradients in FP32, making 12 bytes per parameter in total. In addition to the parameters and gradients, we need to store the optimizer states; for the Adam optimizer, this requires the momentum and the variance, usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>You'll see some more details below when we cover the ZeRO methods.</aside>
440
 
 
516
 
517
  <p>Here, <d-math>L</d-math> is the number of layers, <d-math>seq</d-math> the sequence length, <d-math>bs</d-math> the batch size in samples, <d-math>h</d-math> the hidden dimension of the model, and <d-math>n_{heads}</d-math> the number of heads.</p>
518
 
519
+ <p>For the exact derivation of the numbers, you can follow the original NVIDIA paper on recomputation <d-cite bibtex-key="korthikanti2022recomputation"></d-cite> - it essentially requires you to do some accounting of all the sizes of intermediate activations between each operation in a transformer layer.</p>
520
 
521
  <p>An interesting observation here is that memory usage is not static for a given model; rather, it scales linearly with the batch size and quadratically with the sequence length. This means the activation memory is the part that will blow up when we increase our batch size or train with longer sequences. We can use this equation to look at how memory usage changes for various sequence lengths, for example for Llama models (<code>bs=1</code>):</p>
522
 
 
551
  <li><strong>Selective:</strong> In general, we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of floating-point operations per second (FLOPS). It turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing the expensive feedforward computations. For a GPT-3 (175B) model, this means <strong>a 70% activation memory reduction at a 2.7% compute cost</strong>.</li>
552
  </ul>
553
 
554
+ <aside>In recent models like DeepSeek-V3, selective checkpointing is performed, optimizing activation memory usage by storing an even smaller size of attention activation —using so-called "Multi-Head Latent Attention" (MLA).</aside>
555
 
556
  <p>Let’s see how drastically recomputation strategies can reduce the memory footprint in practice, and how selective recomputation strikes a nice balance between memory savings and recomputation cost:</p>
557
  <!-- RH: In this figure, change "optimizer" to "optimizer states" on the right. -->
 
563
  <p class="note-box-title">📝 Note</p>
564
  <div class="note-box-content">
565
  <p>
566
+ When you're measuring how efficient your training setup is at using your GPU/TPU/accelerator, you usually want to take recomputation into account to compute total FLOPs (floating-point operations) and compare this to the theoretical maximum FLOPS (floating-point operations per second) of the GPU/TPU/accelerator. Taking recomputation into account when calculating FLOPs for a training step gives a value called "hardware FLOPs," which is the real number of operations performed on the accelerator. Dividing this hardware FLOPs value by the duration of the training step (in seconds) gives you the actual FLOPS achieved. Then, dividing this achieved FLOPS by the maximum accelerator FLOPS yields the <strong><em>hardware FLOPS utilization (HFU)</em></strong>.
567
  </p><p>
568
+ However, what really matters at the end of the day is the total time needed to train a model on a given dataset. So, for example, when comparing various GPUs/TPUs/accelerators, if one of these provides enough memory to skip recomputation and thus performs fewer total operations (lower hardware FLOPs) but still trains faster, it should be rewarded, not punished. Thus, an alternative is to compute what is called <strong><em>model FLOPS utilization (MFU)</em></strong>, which, in contrast to HFU, only takes into account the required operations for the forward and backward passes through the model and does not include recomputation in the measured FLOPs. This value is thus more specific to the model than the training implementation.
569
  </p>
570
  </div>
571
  </div>