Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

110

nouamanetazi HF Staff commited on Mar 11

Commit

835c7e8

1 Parent(s): a9321ed

moar

Browse files

Files changed (1) hide show

src/index.html +15 -15

src/index.html CHANGED Viewed

@@ -605,7 +605,7 @@
         <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients that persist throughout a training step, whereas without gradient accumulation, in the backward pass gradients are computed while freeing the activation memory, which means lower peak memory use.</aside>
-        <p>Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by computing only partial micro-batches. <!-- RH: computing gradients only for partial micro-batches? Or OK as is? --></p>
         <p>One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training. No free lunch!</p>
@@ -634,20 +634,20 @@
         <p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
         <ul>
-            <li>A CPU thread <!-- RH: Or "CPU threads"? If so, also change in the figure label below. --> launching kernels asynchronously on the GPU</li>
             <li>Multiple CUDA streams handling compute and communication in parallel</li>
             <li>Kernel execution times and memory allocation</li>
         </ul>
         <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
-        <div class="figure-legend"><p>Example trace showing a CPU thread launching kernels asynchronously on the GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
         <p>The trace helps identify bottlenecks like:</p>
         <ul>
             <li>Sequential compute and communication that could be overlapped</li>
             <li>Idle GPU time waiting for data transfers</li>
-            <li>Memory movement between CPU and GPU</li>
-            <li>Kernel launch overhead from CPU <!-- RH: on the CPU? --></li>
         </ul>
         <p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace will clearly show if gradient synchronization is properly overlapped with backward computation, as we'll discuss later.</p>
@@ -706,7 +706,7 @@
                     if p.requires_grad is True:
                         p.register_post_accumulate_grad_hook(hook)</d-code>
-        <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with the backward pass <!-- RH: backward passes? -->, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
@@ -770,7 +770,7 @@
         <p>Given a targeted global batch size, we can thus trade gradient accumulation steps for data-parallel processes to speed up training.</p>
-        <p>In practice, people tend to maximize the number of data-parallel nodes (DP) over gradient accumulation as much as possible since it's inherently parallel, unlike the sequential nature of gradient accumulation<!-- RH: Do you mean something like "people tend to focus on maximizing <d-math>dp</d-math> rather than <d-math>grad\_acc</d-math>, because gradient accumulation is sequential"? -->. Gradient accumulation is then added on top of data parallelism to achieve the target global batch size, when scaling data parallelism alone is not sufficient before you run out of GPUs.</p>
         <aside>A good resource for further reading on data parallelism is <a href="https://siboehm.com/articles/22/data-parallel-training">https://siboehm.com/articles/22/data-parallel-training</a>.
         </aside>
@@ -782,7 +782,7 @@
         <ol>
             <li>We first determine the best (global) batch size in tokens, either by consulting the literature or by running experiments measuring model convergence.</li>
-            <li>We then select a sequence length for training, again by either consulting the literature or running experiments. Generally, 2-8k tokens works reliably well for the evaluations we have today <!-- RH: Is "evaluations" the right word to use there? I'm not sure what you mean by that. --> (we won’t dive into training recipes here, but teams usually increase the sequence <!-- RH: sequence length? Or OK as is? --> at the end of the training, adding some longer context data samples into the mix to reach the longer context sizes of today).</li>
             <li>We now know the batch size (<d-math>gbs</d-math>). We can find the maximum local batch size (<d-math>mbs</d-math>) on a single GPU by increasing the local batch size until we run out of memory.</li>
             <li>Finally, we determine the number of available GPUs for our target <d-math>dp</d-math>. The ratio of <d-math>gbs</d-math> to <d-math>dp</d-math> gives us the remaining number of gradient accumulation steps needed for the desired <d-math>gbs</d-math>. </li>
         </ol>
@@ -792,7 +792,7 @@
         <p>If the gradient accumulation ratio is lower than 1 - i.e., we have too many GPUs/are GPU-rich 🤑 (!) - we can either choose not to use all our GPUs, explore a larger <d-math>gbs</d-math>, or test if a lower <d-math>mbs</d-math> will speed up training. In the latter case we’ll end up prioritizing throughput over individual GPU compute efficiency, using a smaller <d-math>mbs</d-math> than possible in order to speed up training.</p>
-        <p>It's time to look at a concrete example. Let’s say we want to train a recent model with a <d-math>gbs</d-math> of 4M tokens and a sequence length of 4k. Our batch size will thus be 1,024 samples (we pick the closest power of 2). Let's assume we observe that a single GPU can only fit <d-math>mbs</d-math>=2 in memory, and we have 128 GPUs available for training. This means with 4 gradient accumulation steps, we’ll achieve our goal of 1,024 samples or 4M tokens per training step. Now, what if we suddenly have 512 GPUs available? We can achieve the same <d-math>gbs</d-math> and thus identical training by keeping <d-math>mbs</d-math>=2 and setting the number of gradient accumulation steps to 1 and achieve faster training!</p> <!-- RH: It seems odd to say you achieve identical training AND achieve faster training - is it identical, or is it faster? Maybe you should specify in what way it's identical, if it's both? -->
         <div class="note-box">
             <p class="note-box-title">📝 Note</p>
@@ -841,23 +841,23 @@
         <p>The sharding paradigm is closely related to DP, so we’ll have a look at it first by investigating the ZeRO method.</p>
-        <h3>ZeRO</h3> <!-- RH: I'd spell it out either in the heading or the paragraph below; both seems like overkill. -->
-        <p>In this section we will introduce DeepSpeed ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer), a memory optimization technology designed to reduce memory redundancy in LLM training.</p>
         <p>While data parallelism is an efficient way to scale training, the naive replication of optimizer states, gradients, and parameters across each DP rank introduces significant memory redundancy. ZeRO eliminates this by partitioning the optimizer states, gradients, and parameters across the data parallel dimension, while still allowing computation with the full set of parameters. This sometimes requires more communications between DP ranks, which may or may not be fully overlapped, as we’ll see next!</p>
-        <aside>We’ll focus on ZeRO-1 to ZeRO-3 in this book, as this should give a broad view of how this technology helps reduce memory <!-- RH: reduce memory use / the memory footprint? --> while showing the trade-offs to take into account. You can find details on more ZeRO flavors in the <a href="https://www.deepspeed.ai/tutorials/zero/">DeepSpeed docs</a>.</aside>
         <p>This approach is organized into three possible optimization stages:</p>
         <ul>
             <li>ZeRO-1: optimizer state partitioning</li>
             <li>ZeRO-2: optimizer state + gradient partitioning</li>
-            <li>ZeRO-3: optimizer state + gradient + parameter partitioning</li> <!-- RH: I'd avoid saying ZeRO-3 is "also called FSDP." You explain later that that's PyTorch's native implementation. -->
         </ul>
-        <aside>When we say "partitioning" here, it means along the DP axis, as ZeRO is part of data parallelism. <!-- RH: Maybe "is related to data parallelism" or "is a data-parallel method"? --> We’ll see later that we can partition along other axes as well.</aside>
         <p>You might have noticed that activations is missing from the list of things we can shard. Since each DP replica of the model receives a different micro-batch, the activations on each DP rank also differ, so they are not duplicated and thus can’t be sharded!</p>
@@ -943,7 +943,7 @@
         <p>Now that we’ve sharded gradients as well, are we done, or can we keep getting away with this? Well, sort of. <!-- RH: Could that just say "...or can we keep making improvements?"? I'm not sure what you mean by getting away with this or (at this point, at least) why it's only "sort of." --> Here comes ZeRO-3!</p>
-        <h4>ZeRO-3: Adding <strong>parameter partitioning</strong></h4>
         <p>For stage 3, we extend the above approach of sharding optimizer states and gradients over DP replicas to sharding the model’s parameters.</p>

         <aside>Using gradient accumulation means we need to keep buffers where we accumulate gradients that persist throughout a training step, whereas without gradient accumulation, in the backward pass gradients are computed while freeing the activation memory, which means lower peak memory use.</aside>
+        <p>Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by processing smaller micro-batches sequentially. This reduces stored activations and gradients since only one micro-batch's worth of activations needs to be kept in memory at a time, which helps reduce the overall activation memory footprint.</p>
         <p>One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training. No free lunch!</p>
         <p>This generates a trace that we can visualize in TensorBoard or Chrome's trace viewer. The trace shows:</p>
         <ul>
+            <li>A CPU threads launching kernels asynchronously on the GPU</li>
             <li>Multiple CUDA streams handling compute and communication in parallel</li>
             <li>Kernel execution times and memory allocation</li>
         </ul>
         <p><img alt="profile_trace_annotated.png" src="/assets/images/profile_trace_annotated.png" /></p>
+        <div class="figure-legend"><p>Example trace showing a CPU threads launching kernels asynchronously on the GPU, with compute kernels and communication happening in parallel across different CUDA streams</p></div>
         <p>The trace helps identify bottlenecks like:</p>
         <ul>
             <li>Sequential compute and communication that could be overlapped</li>
             <li>Idle GPU time waiting for data transfers</li>
+            <li>CUDA Syncs and memory movement between CPU and GPU</li>
+            <li>Kernel launch overhead on the GPU</li>
         </ul>
         <p>Understanding these patterns is crucial for optimizing distributed training performance. For example, the trace will clearly show if gradient synchronization is properly overlapped with backward computation, as we'll discuss later.</p>
                     if p.requires_grad is True:
                         p.register_post_accumulate_grad_hook(hook)</d-code>
+        <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with the backward pass within the same training step, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
         <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
             <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
         <p>Given a targeted global batch size, we can thus trade gradient accumulation steps for data-parallel processes to speed up training.</p>
+        <p>In practice, people tend to maximize the data-parallel size (<d-math>dp</d-math>) over gradient accumulation (<d-math>grad\_acc</d-math>) as much as possible since data parallelism is inherently parallel, unlike the sequential nature of gradient accumulation. Gradient accumulation is then added on top of data parallelism to achieve the target global batch size, when scaling data parallelism alone is not sufficient before you run out of GPUs.</p>
         <aside>A good resource for further reading on data parallelism is <a href="https://siboehm.com/articles/22/data-parallel-training">https://siboehm.com/articles/22/data-parallel-training</a>.
         </aside>
         <ol>
             <li>We first determine the best (global) batch size in tokens, either by consulting the literature or by running experiments measuring model convergence.</li>
+            <li>We then select a sequence length for training, again by either consulting the literature or running experiments. Generally, 2-8k tokens works reliably well for the evaluation benchmarks we have today (we won’t dive into training recipes here, but teams usually increase the sequence length at the end of the training, adding some longer context data samples into the mix to reach the longer context sizes of today).</li>
             <li>We now know the batch size (<d-math>gbs</d-math>). We can find the maximum local batch size (<d-math>mbs</d-math>) on a single GPU by increasing the local batch size until we run out of memory.</li>
             <li>Finally, we determine the number of available GPUs for our target <d-math>dp</d-math>. The ratio of <d-math>gbs</d-math> to <d-math>dp</d-math> gives us the remaining number of gradient accumulation steps needed for the desired <d-math>gbs</d-math>. </li>
         </ol>
         <p>If the gradient accumulation ratio is lower than 1 - i.e., we have too many GPUs/are GPU-rich 🤑 (!) - we can either choose not to use all our GPUs, explore a larger <d-math>gbs</d-math>, or test if a lower <d-math>mbs</d-math> will speed up training. In the latter case we’ll end up prioritizing throughput over individual GPU compute efficiency, using a smaller <d-math>mbs</d-math> than possible in order to speed up training.</p>
+        <p>It's time to look at a concrete example. Let's say we want to train a recent model with a <d-math>gbs</d-math> of 4M tokens and a sequence length of 4k. Our batch size will thus be 1,024 samples (we pick the closest power of 2). Let's assume we observe that a single GPU can only fit <d-math>mbs</d-math>=2 in memory, and we have 128 GPUs available for training. This means with 4 gradient accumulation steps, we'll achieve our goal of 1,024 samples or 4M tokens per training step. Now, what if we suddenly have 512 GPUs available? We can achieve the same <d-math>gbs</d-math> by keeping <d-math>mbs</d-math>=2 and setting the number of gradient accumulation steps to 1, which will result in faster training!</p>
         <div class="note-box">
             <p class="note-box-title">📝 Note</p>
         <p>The sharding paradigm is closely related to DP, so we’ll have a look at it first by investigating the ZeRO method.</p>
+        <h3>Zero Redundancy Optimizer (ZeRO)</h3>
+        <p>In this section we will introduce DeepSpeed ZeRO, a memory optimization technology designed to reduce memory redundancy in LLM training.</p>
         <p>While data parallelism is an efficient way to scale training, the naive replication of optimizer states, gradients, and parameters across each DP rank introduces significant memory redundancy. ZeRO eliminates this by partitioning the optimizer states, gradients, and parameters across the data parallel dimension, while still allowing computation with the full set of parameters. This sometimes requires more communications between DP ranks, which may or may not be fully overlapped, as we’ll see next!</p>
+        <aside>We’ll focus on ZeRO-1 to ZeRO-3 in this book, as this should give a broad view of how this technology helps reduce the memory usage while showing the trade-offs to take into account. You can find details on more ZeRO flavors in the <a href="https://www.deepspeed.ai/tutorials/zero/">DeepSpeed docs</a>.</aside>
         <p>This approach is organized into three possible optimization stages:</p>
         <ul>
             <li>ZeRO-1: optimizer state partitioning</li>
             <li>ZeRO-2: optimizer state + gradient partitioning</li>
+            <li>ZeRO-3: optimizer state + gradient + parameter partitioning</li>
         </ul>
+        <aside>When we say "partitioning" here, it means along the DP axis, as ZeRO is a data-parallel method. We’ll see later that we can partition along other axes as well.</aside>
         <p>You might have noticed that activations is missing from the list of things we can shard. Since each DP replica of the model receives a different micro-batch, the activations on each DP rank also differ, so they are not duplicated and thus can’t be sharded!</p>
         <p>Now that we’ve sharded gradients as well, are we done, or can we keep getting away with this? Well, sort of. <!-- RH: Could that just say "...or can we keep making improvements?"? I'm not sure what you mean by getting away with this or (at this point, at least) why it's only "sort of." --> Here comes ZeRO-3!</p>
+        <h4>ZeRO-3: Adding <strong>parameter partitioning</strong> (FSDP)</h4>
         <p>For stage 3, we extend the above approach of sharding optimizer states and gradients over DP replicas to sharding the model’s parameters.</p>