Commit
·
835c7e8
1
Parent(s):
a9321ed
moar
Browse files- src/index.html +15 -15
src/index.html
CHANGED
@@ -605,7 +605,7 @@
|
|
605 |
|
606 |
<aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients that persist throughout a training step, whereas without gradient accumulation, in the backward pass gradients are computed while freeing the activation memory, which means lower peak memory use.</aside>
|
607 |
|
608 |
-
<p>Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by
|
609 |
|
610 |
<p>One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training. No free lunch!</p>
|
611 |
|
@@ -634,20 +634,20 @@
|
|
634 |
<p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
|
635 |
|
636 |
<ul>
|
637 |
-
<li>A CPU
|
638 |
<li>Multiple CUDA streams handling compute and communication in parallel</li>
|
639 |
<li>Kernel execution times and memory allocation</li>
|
640 |
</ul>
|
641 |
|
642 |
<p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
|
643 |
-
<div class="figure-legend"><p>Example trace showing a CPU
|
644 |
|
645 |
<p>The trace helps identify bottlenecks like:</p>
|
646 |
<ul>
|
647 |
<li>Sequential compute and communication that could be overlapped</li>
|
648 |
<li>Idle GPU time waiting for data transfers</li>
|
649 |
-
<li>
|
650 |
-
<li>Kernel launch overhead
|
651 |
</ul>
|
652 |
|
653 |
<p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace will clearly show if gradient synchronization is properly overlapped with backward computation, as we'll discuss later.</p>
|
@@ -706,7 +706,7 @@
|
|
706 |
if p.requires_grad is True:
|
707 |
p.register_post_accumulate_grad_hook(hook)</d-code>
|
708 |
|
709 |
-
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with the backward pass
|
710 |
|
711 |
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
712 |
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
@@ -770,7 +770,7 @@
|
|
770 |
|
771 |
<p>Given a targeted global batch size, we can thus trade gradient accumulation steps for data-parallel processes to speed up training.</p>
|
772 |
|
773 |
-
<p>In practice, people tend to maximize the
|
774 |
|
775 |
<aside>A good resource for further reading on data parallelism is <a href="https://siboehm.com/articles/22/data-parallel-training">https://siboehm.com/articles/22/data-parallel-training</a>.
|
776 |
</aside>
|
@@ -782,7 +782,7 @@
|
|
782 |
|
783 |
<ol>
|
784 |
<li>We first determine the best (global) batch size in tokens, either by consulting the literature or by running experiments measuring model convergence.</li>
|
785 |
-
<li>We then select a sequence length for training, again by either consulting the literature or running experiments. Generally, 2-8k tokens works reliably well for the
|
786 |
<li>We now know the batch size (<d-math>gbs</d-math>). We can find the maximum local batch size (<d-math>mbs</d-math>) on a single GPU by increasing the local batch size until we run out of memory.</li>
|
787 |
<li>Finally, we determine the number of available GPUs for our target <d-math>dp</d-math>. The ratio of <d-math>gbs</d-math> to <d-math>dp</d-math> gives us the remaining number of gradient accumulation steps needed for the desired <d-math>gbs</d-math>. </li>
|
788 |
</ol>
|
@@ -792,7 +792,7 @@
|
|
792 |
|
793 |
<p>If the gradient accumulation ratio is lower than 1 - i.e., we have too many GPUs/are GPU-rich 🤑 (!) - we can either choose not to use all our GPUs, explore a larger <d-math>gbs</d-math>, or test if a lower <d-math>mbs</d-math> will speed up training. In the latter case we’ll end up prioritizing throughput over individual GPU compute efficiency, using a smaller <d-math>mbs</d-math> than possible in order to speed up training.</p>
|
794 |
|
795 |
-
<p>It's time to look at a concrete example. Let
|
796 |
|
797 |
<div class="note-box">
|
798 |
<p class="note-box-title">📝 Note</p>
|
@@ -841,23 +841,23 @@
|
|
841 |
<p>The sharding paradigm is closely related to DP, so we’ll have a look at it first by investigating the ZeRO method.</p>
|
842 |
|
843 |
|
844 |
-
<h3>ZeRO</h3>
|
845 |
|
846 |
-
<p>In this section we will introduce DeepSpeed ZeRO
|
847 |
|
848 |
<p>While data parallelism is an efficient way to scale training, the naive replication of optimizer states, gradients, and parameters across each DP rank introduces significant memory redundancy. ZeRO eliminates this by partitioning the optimizer states, gradients, and parameters across the data parallel dimension, while still allowing computation with the full set of parameters. This sometimes requires more communications between DP ranks, which may or may not be fully overlapped, as we’ll see next!</p>
|
849 |
|
850 |
-
<aside>We’ll focus on ZeRO-1 to ZeRO-3 in this book, as this should give a broad view of how this technology helps reduce
|
851 |
|
852 |
<p>This approach is organized into three possible optimization stages:</p>
|
853 |
|
854 |
<ul>
|
855 |
<li>ZeRO-1: optimizer state partitioning</li>
|
856 |
<li>ZeRO-2: optimizer state + gradient partitioning</li>
|
857 |
-
<li>ZeRO-3: optimizer state + gradient + parameter partitioning</li>
|
858 |
</ul>
|
859 |
|
860 |
-
<aside>When we say "partitioning" here, it means along the DP axis, as ZeRO is
|
861 |
|
862 |
<p>You might have noticed that activations is missing from the list of things we can shard. Since each DP replica of the model receives a different micro-batch, the activations on each DP rank also differ, so they are not duplicated and thus can’t be sharded!</p>
|
863 |
|
@@ -943,7 +943,7 @@
|
|
943 |
|
944 |
<p>Now that we’ve sharded gradients as well, are we done, or can we keep getting away with this? Well, sort of. <!-- RH: Could that just say "...or can we keep making improvements?"? I'm not sure what you mean by getting away with this or (at this point, at least) why it's only "sort of." --> Here comes ZeRO-3!</p>
|
945 |
|
946 |
-
<h4>ZeRO-3: Adding <strong>parameter partitioning</strong
|
947 |
|
948 |
<p>For stage 3, we extend the above approach of sharding optimizer states and gradients over DP replicas to sharding the model’s parameters.</p>
|
949 |
|
|
|
605 |
|
606 |
<aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients that persist throughout a training step, whereas without gradient accumulation, in the backward pass gradients are computed while freeing the activation memory, which means lower peak memory use.</aside>
|
607 |
|
608 |
+
<p>Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by processing smaller micro-batches sequentially. This reduces stored activations and gradients since only one micro-batch's worth of activations needs to be kept in memory at a time, which helps reduce the overall activation memory footprint.</p>
|
609 |
|
610 |
<p>One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training. No free lunch!</p>
|
611 |
|
|
|
634 |
<p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
|
635 |
|
636 |
<ul>
|
637 |
+
<li>A CPU threads launching kernels asynchronously on the GPU</li>
|
638 |
<li>Multiple CUDA streams handling compute and communication in parallel</li>
|
639 |
<li>Kernel execution times and memory allocation</li>
|
640 |
</ul>
|
641 |
|
642 |
<p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
|
643 |
+
<div class="figure-legend"><p>Example trace showing a CPU threads launching kernels asynchronously on the GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
|
644 |
|
645 |
<p>The trace helps identify bottlenecks like:</p>
|
646 |
<ul>
|
647 |
<li>Sequential compute and communication that could be overlapped</li>
|
648 |
<li>Idle GPU time waiting for data transfers</li>
|
649 |
+
<li>CUDA Syncs and memory movement between CPU and GPU</li>
|
650 |
+
<li>Kernel launch overhead on the GPU</li>
|
651 |
</ul>
|
652 |
|
653 |
<p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace will clearly show if gradient synchronization is properly overlapped with backward computation, as we'll discuss later.</p>
|
|
|
706 |
if p.requires_grad is True:
|
707 |
p.register_post_accumulate_grad_hook(hook)</d-code>
|
708 |
|
709 |
+
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with the backward pass within the same training step, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
|
710 |
|
711 |
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
712 |
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
|
|
770 |
|
771 |
<p>Given a targeted global batch size, we can thus trade gradient accumulation steps for data-parallel processes to speed up training.</p>
|
772 |
|
773 |
+
<p>In practice, people tend to maximize the data-parallel size (<d-math>dp</d-math>) over gradient accumulation (<d-math>grad\_acc</d-math>) as much as possible since data parallelism is inherently parallel, unlike the sequential nature of gradient accumulation. Gradient accumulation is then added on top of data parallelism to achieve the target global batch size, when scaling data parallelism alone is not sufficient before you run out of GPUs.</p>
|
774 |
|
775 |
<aside>A good resource for further reading on data parallelism is <a href="https://siboehm.com/articles/22/data-parallel-training">https://siboehm.com/articles/22/data-parallel-training</a>.
|
776 |
</aside>
|
|
|
782 |
|
783 |
<ol>
|
784 |
<li>We first determine the best (global) batch size in tokens, either by consulting the literature or by running experiments measuring model convergence.</li>
|
785 |
+
<li>We then select a sequence length for training, again by either consulting the literature or running experiments. Generally, 2-8k tokens works reliably well for the evaluation benchmarks we have today (we won’t dive into training recipes here, but teams usually increase the sequence length at the end of the training, adding some longer context data samples into the mix to reach the longer context sizes of today).</li>
|
786 |
<li>We now know the batch size (<d-math>gbs</d-math>). We can find the maximum local batch size (<d-math>mbs</d-math>) on a single GPU by increasing the local batch size until we run out of memory.</li>
|
787 |
<li>Finally, we determine the number of available GPUs for our target <d-math>dp</d-math>. The ratio of <d-math>gbs</d-math> to <d-math>dp</d-math> gives us the remaining number of gradient accumulation steps needed for the desired <d-math>gbs</d-math>. </li>
|
788 |
</ol>
|
|
|
792 |
|
793 |
<p>If the gradient accumulation ratio is lower than 1 - i.e., we have too many GPUs/are GPU-rich 🤑 (!) - we can either choose not to use all our GPUs, explore a larger <d-math>gbs</d-math>, or test if a lower <d-math>mbs</d-math> will speed up training. In the latter case we’ll end up prioritizing throughput over individual GPU compute efficiency, using a smaller <d-math>mbs</d-math> than possible in order to speed up training.</p>
|
794 |
|
795 |
+
<p>It's time to look at a concrete example. Let's say we want to train a recent model with a <d-math>gbs</d-math> of 4M tokens and a sequence length of 4k. Our batch size will thus be 1,024 samples (we pick the closest power of 2). Let's assume we observe that a single GPU can only fit <d-math>mbs</d-math>=2 in memory, and we have 128 GPUs available for training. This means with 4 gradient accumulation steps, we'll achieve our goal of 1,024 samples or 4M tokens per training step. Now, what if we suddenly have 512 GPUs available? We can achieve the same <d-math>gbs</d-math> by keeping <d-math>mbs</d-math>=2 and setting the number of gradient accumulation steps to 1, which will result in faster training!</p>
|
796 |
|
797 |
<div class="note-box">
|
798 |
<p class="note-box-title">📝 Note</p>
|
|
|
841 |
<p>The sharding paradigm is closely related to DP, so we’ll have a look at it first by investigating the ZeRO method.</p>
|
842 |
|
843 |
|
844 |
+
<h3>Zero Redundancy Optimizer (ZeRO)</h3>
|
845 |
|
846 |
+
<p>In this section we will introduce DeepSpeed ZeRO, a memory optimization technology designed to reduce memory redundancy in LLM training.</p>
|
847 |
|
848 |
<p>While data parallelism is an efficient way to scale training, the naive replication of optimizer states, gradients, and parameters across each DP rank introduces significant memory redundancy. ZeRO eliminates this by partitioning the optimizer states, gradients, and parameters across the data parallel dimension, while still allowing computation with the full set of parameters. This sometimes requires more communications between DP ranks, which may or may not be fully overlapped, as we’ll see next!</p>
|
849 |
|
850 |
+
<aside>We’ll focus on ZeRO-1 to ZeRO-3 in this book, as this should give a broad view of how this technology helps reduce the memory usage while showing the trade-offs to take into account. You can find details on more ZeRO flavors in the <a href="https://www.deepspeed.ai/tutorials/zero/">DeepSpeed docs</a>.</aside>
|
851 |
|
852 |
<p>This approach is organized into three possible optimization stages:</p>
|
853 |
|
854 |
<ul>
|
855 |
<li>ZeRO-1: optimizer state partitioning</li>
|
856 |
<li>ZeRO-2: optimizer state + gradient partitioning</li>
|
857 |
+
<li>ZeRO-3: optimizer state + gradient + parameter partitioning</li>
|
858 |
</ul>
|
859 |
|
860 |
+
<aside>When we say "partitioning" here, it means along the DP axis, as ZeRO is a data-parallel method. We’ll see later that we can partition along other axes as well.</aside>
|
861 |
|
862 |
<p>You might have noticed that activations is missing from the list of things we can shard. Since each DP replica of the model receives a different micro-batch, the activations on each DP rank also differ, so they are not duplicated and thus can’t be sharded!</p>
|
863 |
|
|
|
943 |
|
944 |
<p>Now that we’ve sharded gradients as well, are we done, or can we keep getting away with this? Well, sort of. <!-- RH: Could that just say "...or can we keep making improvements?"? I'm not sure what you mean by getting away with this or (at this point, at least) why it's only "sort of." --> Here comes ZeRO-3!</p>
|
945 |
|
946 |
+
<h4>ZeRO-3: Adding <strong>parameter partitioning</strong> (FSDP)</h4>
|
947 |
|
948 |
<p>For stage 3, we extend the above approach of sharding optimizer states and gradients over DP replicas to sharding the model’s parameters.</p>
|
949 |
|