nouamanetazi HF staff commited on
Commit
835c7e8
·
1 Parent(s): a9321ed
Files changed (1) hide show
  1. src/index.html +15 -15
src/index.html CHANGED
@@ -605,7 +605,7 @@
605
 
606
  <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients that persist throughout a training step, whereas without gradient accumulation, in the backward pass gradients are computed while freeing the activation memory, which means lower peak memory use.</aside>
607
 
608
- <p>Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by computing only partial micro-batches. <!-- RH: computing gradients only for partial micro-batches? Or OK as is? --></p>
609
 
610
  <p>One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training. No free lunch!</p>
611
 
@@ -634,20 +634,20 @@
634
  <p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
635
 
636
  <ul>
637
- <li>A CPU thread <!-- RH: Or "CPU threads"? If so, also change in the figure label below. --> launching kernels asynchronously on the GPU</li>
638
  <li>Multiple CUDA streams handling compute and communication in parallel</li>
639
  <li>Kernel execution times and memory allocation</li>
640
  </ul>
641
 
642
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
643
- <div class="figure-legend"><p>Example trace showing a CPU thread launching kernels asynchronously on the GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
644
 
645
  <p>The trace helps identify bottlenecks like:</p>
646
  <ul>
647
  <li>Sequential compute and communication that could be overlapped</li>
648
  <li>Idle GPU time waiting for data transfers</li>
649
- <li>Memory movement between CPU and GPU</li>
650
- <li>Kernel launch overhead from CPU <!-- RH: on the CPU? --></li>
651
  </ul>
652
 
653
  <p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace will clearly show if gradient synchronization is properly overlapped with backward computation, as we'll discuss later.</p>
@@ -706,7 +706,7 @@
706
  if p.requires_grad is True:
707
  p.register_post_accumulate_grad_hook(hook)</d-code>
708
 
709
- <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with the backward pass <!-- RH: backward passes? -->, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
710
 
711
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
712
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -770,7 +770,7 @@
770
 
771
  <p>Given a targeted global batch size, we can thus trade gradient accumulation steps for data-parallel processes to speed up training.</p>
772
 
773
- <p>In practice, people tend to maximize the number of data-parallel nodes (DP) over gradient accumulation as much as possible since it's inherently parallel, unlike the sequential nature of gradient accumulation<!-- RH: Do you mean something like "people tend to focus on maximizing <d-math>dp</d-math> rather than <d-math>grad\_acc</d-math>, because gradient accumulation is sequential"? -->. Gradient accumulation is then added on top of data parallelism to achieve the target global batch size, when scaling data parallelism alone is not sufficient before you run out of GPUs.</p>
774
 
775
  <aside>A good resource for further reading on data parallelism is <a href="https://siboehm.com/articles/22/data-parallel-training">https://siboehm.com/articles/22/data-parallel-training</a>.
776
  </aside>
@@ -782,7 +782,7 @@
782
 
783
  <ol>
784
  <li>We first determine the best (global) batch size in tokens, either by consulting the literature or by running experiments measuring model convergence.</li>
785
- <li>We then select a sequence length for training, again by either consulting the literature or running experiments. Generally, 2-8k tokens works reliably well for the evaluations we have today <!-- RH: Is "evaluations" the right word to use there? I'm not sure what you mean by that. --> (we won’t dive into training recipes here, but teams usually increase the sequence <!-- RH: sequence length? Or OK as is? --> at the end of the training, adding some longer context data samples into the mix to reach the longer context sizes of today).</li>
786
  <li>We now know the batch size (<d-math>gbs</d-math>). We can find the maximum local batch size (<d-math>mbs</d-math>) on a single GPU by increasing the local batch size until we run out of memory.</li>
787
  <li>Finally, we determine the number of available GPUs for our target <d-math>dp</d-math>. The ratio of <d-math>gbs</d-math> to <d-math>dp</d-math> gives us the remaining number of gradient accumulation steps needed for the desired <d-math>gbs</d-math>. </li>
788
  </ol>
@@ -792,7 +792,7 @@
792
 
793
  <p>If the gradient accumulation ratio is lower than 1 - i.e., we have too many GPUs/are GPU-rich 🤑 (!) - we can either choose not to use all our GPUs, explore a larger <d-math>gbs</d-math>, or test if a lower <d-math>mbs</d-math> will speed up training. In the latter case we’ll end up prioritizing throughput over individual GPU compute efficiency, using a smaller <d-math>mbs</d-math> than possible in order to speed up training.</p>
794
 
795
- <p>It's time to look at a concrete example. Lets say we want to train a recent model with a <d-math>gbs</d-math> of 4M tokens and a sequence length of 4k. Our batch size will thus be 1,024 samples (we pick the closest power of 2). Let's assume we observe that a single GPU can only fit <d-math>mbs</d-math>=2 in memory, and we have 128 GPUs available for training. This means with 4 gradient accumulation steps, well achieve our goal of 1,024 samples or 4M tokens per training step. Now, what if we suddenly have 512 GPUs available? We can achieve the same <d-math>gbs</d-math> and thus identical training by keeping <d-math>mbs</d-math>=2 and setting the number of gradient accumulation steps to 1 and achieve faster training!</p> <!-- RH: It seems odd to say you achieve identical training AND achieve faster training - is it identical, or is it faster? Maybe you should specify in what way it's identical, if it's both? -->
796
 
797
  <div class="note-box">
798
  <p class="note-box-title">📝 Note</p>
@@ -841,23 +841,23 @@
841
  <p>The sharding paradigm is closely related to DP, so we’ll have a look at it first by investigating the ZeRO method.</p>
842
 
843
 
844
- <h3>ZeRO</h3> <!-- RH: I'd spell it out either in the heading or the paragraph below; both seems like overkill. -->
845
 
846
- <p>In this section we will introduce DeepSpeed ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer), a memory optimization technology designed to reduce memory redundancy in LLM training.</p>
847
 
848
  <p>While data parallelism is an efficient way to scale training, the naive replication of optimizer states, gradients, and parameters across each DP rank introduces significant memory redundancy. ZeRO eliminates this by partitioning the optimizer states, gradients, and parameters across the data parallel dimension, while still allowing computation with the full set of parameters. This sometimes requires more communications between DP ranks, which may or may not be fully overlapped, as we’ll see next!</p>
849
 
850
- <aside>We’ll focus on ZeRO-1 to ZeRO-3 in this book, as this should give a broad view of how this technology helps reduce memory <!-- RH: reduce memory use / the memory footprint? --> while showing the trade-offs to take into account. You can find details on more ZeRO flavors in the <a href="https://www.deepspeed.ai/tutorials/zero/">DeepSpeed docs</a>.</aside>
851
 
852
  <p>This approach is organized into three possible optimization stages:</p>
853
 
854
  <ul>
855
  <li>ZeRO-1: optimizer state partitioning</li>
856
  <li>ZeRO-2: optimizer state + gradient partitioning</li>
857
- <li>ZeRO-3: optimizer state + gradient + parameter partitioning</li> <!-- RH: I'd avoid saying ZeRO-3 is "also called FSDP." You explain later that that's PyTorch's native implementation. -->
858
  </ul>
859
 
860
- <aside>When we say "partitioning" here, it means along the DP axis, as ZeRO is part of data parallelism. <!-- RH: Maybe "is related to data parallelism" or "is a data-parallel method"? --> We’ll see later that we can partition along other axes as well.</aside>
861
 
862
  <p>You might have noticed that activations is missing from the list of things we can shard. Since each DP replica of the model receives a different micro-batch, the activations on each DP rank also differ, so they are not duplicated and thus can’t be sharded!</p>
863
 
@@ -943,7 +943,7 @@
943
 
944
  <p>Now that we’ve sharded gradients as well, are we done, or can we keep getting away with this? Well, sort of. <!-- RH: Could that just say "...or can we keep making improvements?"? I'm not sure what you mean by getting away with this or (at this point, at least) why it's only "sort of." --> Here comes ZeRO-3!</p>
945
 
946
- <h4>ZeRO-3: Adding <strong>parameter partitioning</strong></h4>
947
 
948
  <p>For stage 3, we extend the above approach of sharding optimizer states and gradients over DP replicas to sharding the model’s parameters.</p>
949
 
 
605
 
606
  <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients that persist throughout a training step, whereas without gradient accumulation, in the backward pass gradients are computed while freeing the activation memory, which means lower peak memory use.</aside>
607
 
608
+ <p>Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by processing smaller micro-batches sequentially. This reduces stored activations and gradients since only one micro-batch's worth of activations needs to be kept in memory at a time, which helps reduce the overall activation memory footprint.</p>
609
 
610
  <p>One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training. No free lunch!</p>
611
 
 
634
  <p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
635
 
636
  <ul>
637
+ <li>A CPU threads launching kernels asynchronously on the GPU</li>
638
  <li>Multiple CUDA streams handling compute and communication in parallel</li>
639
  <li>Kernel execution times and memory allocation</li>
640
  </ul>
641
 
642
  <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
643
+ <div class="figure-legend"><p>Example trace showing a CPU threads launching kernels asynchronously on the GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
644
 
645
  <p>The trace helps identify bottlenecks like:</p>
646
  <ul>
647
  <li>Sequential compute and communication that could be overlapped</li>
648
  <li>Idle GPU time waiting for data transfers</li>
649
+ <li>CUDA Syncs and memory movement between CPU and GPU</li>
650
+ <li>Kernel launch overhead on the GPU</li>
651
  </ul>
652
 
653
  <p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace will clearly show if gradient synchronization is properly overlapped with backward computation, as we'll discuss later.</p>
 
706
  if p.requires_grad is True:
707
  p.register_post_accumulate_grad_hook(hook)</d-code>
708
 
709
+ <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with the backward pass within the same training step, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
710
 
711
  <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
712
  <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
 
770
 
771
  <p>Given a targeted global batch size, we can thus trade gradient accumulation steps for data-parallel processes to speed up training.</p>
772
 
773
+ <p>In practice, people tend to maximize the data-parallel size (<d-math>dp</d-math>) over gradient accumulation (<d-math>grad\_acc</d-math>) as much as possible since data parallelism is inherently parallel, unlike the sequential nature of gradient accumulation. Gradient accumulation is then added on top of data parallelism to achieve the target global batch size, when scaling data parallelism alone is not sufficient before you run out of GPUs.</p>
774
 
775
  <aside>A good resource for further reading on data parallelism is <a href="https://siboehm.com/articles/22/data-parallel-training">https://siboehm.com/articles/22/data-parallel-training</a>.
776
  </aside>
 
782
 
783
  <ol>
784
  <li>We first determine the best (global) batch size in tokens, either by consulting the literature or by running experiments measuring model convergence.</li>
785
+ <li>We then select a sequence length for training, again by either consulting the literature or running experiments. Generally, 2-8k tokens works reliably well for the evaluation benchmarks we have today (we won’t dive into training recipes here, but teams usually increase the sequence length at the end of the training, adding some longer context data samples into the mix to reach the longer context sizes of today).</li>
786
  <li>We now know the batch size (<d-math>gbs</d-math>). We can find the maximum local batch size (<d-math>mbs</d-math>) on a single GPU by increasing the local batch size until we run out of memory.</li>
787
  <li>Finally, we determine the number of available GPUs for our target <d-math>dp</d-math>. The ratio of <d-math>gbs</d-math> to <d-math>dp</d-math> gives us the remaining number of gradient accumulation steps needed for the desired <d-math>gbs</d-math>. </li>
788
  </ol>
 
792
 
793
  <p>If the gradient accumulation ratio is lower than 1 - i.e., we have too many GPUs/are GPU-rich 🤑 (!) - we can either choose not to use all our GPUs, explore a larger <d-math>gbs</d-math>, or test if a lower <d-math>mbs</d-math> will speed up training. In the latter case we’ll end up prioritizing throughput over individual GPU compute efficiency, using a smaller <d-math>mbs</d-math> than possible in order to speed up training.</p>
794
 
795
+ <p>It's time to look at a concrete example. Let's say we want to train a recent model with a <d-math>gbs</d-math> of 4M tokens and a sequence length of 4k. Our batch size will thus be 1,024 samples (we pick the closest power of 2). Let's assume we observe that a single GPU can only fit <d-math>mbs</d-math>=2 in memory, and we have 128 GPUs available for training. This means with 4 gradient accumulation steps, we'll achieve our goal of 1,024 samples or 4M tokens per training step. Now, what if we suddenly have 512 GPUs available? We can achieve the same <d-math>gbs</d-math> by keeping <d-math>mbs</d-math>=2 and setting the number of gradient accumulation steps to 1, which will result in faster training!</p>
796
 
797
  <div class="note-box">
798
  <p class="note-box-title">📝 Note</p>
 
841
  <p>The sharding paradigm is closely related to DP, so we’ll have a look at it first by investigating the ZeRO method.</p>
842
 
843
 
844
+ <h3>Zero Redundancy Optimizer (ZeRO)</h3>
845
 
846
+ <p>In this section we will introduce DeepSpeed ZeRO, a memory optimization technology designed to reduce memory redundancy in LLM training.</p>
847
 
848
  <p>While data parallelism is an efficient way to scale training, the naive replication of optimizer states, gradients, and parameters across each DP rank introduces significant memory redundancy. ZeRO eliminates this by partitioning the optimizer states, gradients, and parameters across the data parallel dimension, while still allowing computation with the full set of parameters. This sometimes requires more communications between DP ranks, which may or may not be fully overlapped, as we’ll see next!</p>
849
 
850
+ <aside>We’ll focus on ZeRO-1 to ZeRO-3 in this book, as this should give a broad view of how this technology helps reduce the memory usage while showing the trade-offs to take into account. You can find details on more ZeRO flavors in the <a href="https://www.deepspeed.ai/tutorials/zero/">DeepSpeed docs</a>.</aside>
851
 
852
  <p>This approach is organized into three possible optimization stages:</p>
853
 
854
  <ul>
855
  <li>ZeRO-1: optimizer state partitioning</li>
856
  <li>ZeRO-2: optimizer state + gradient partitioning</li>
857
+ <li>ZeRO-3: optimizer state + gradient + parameter partitioning</li>
858
  </ul>
859
 
860
+ <aside>When we say "partitioning" here, it means along the DP axis, as ZeRO is a data-parallel method. We’ll see later that we can partition along other axes as well.</aside>
861
 
862
  <p>You might have noticed that activations is missing from the list of things we can shard. Since each DP replica of the model receives a different micro-batch, the activations on each DP rank also differ, so they are not duplicated and thus can’t be sharded!</p>
863
 
 
943
 
944
  <p>Now that we’ve sharded gradients as well, are we done, or can we keep getting away with this? Well, sort of. <!-- RH: Could that just say "...or can we keep making improvements?"? I'm not sure what you mean by getting away with this or (at this point, at least) why it's only "sort of." --> Here comes ZeRO-3!</p>
945
 
946
+ <h4>ZeRO-3: Adding <strong>parameter partitioning</strong> (FSDP)</h4>
947
 
948
  <p>For stage 3, we extend the above approach of sharding optimizer states and gradients over DP replicas to sharding the model’s parameters.</p>
949