nouamanetazi HF staff commited on
Commit
68cc8e2
Β·
1 Parent(s): e87aa99

some few more changes

Browse files
Files changed (2) hide show
  1. dist/index.html +17 -3
  2. src/index.html +17 -3
dist/index.html CHANGED
@@ -1484,7 +1484,14 @@
1484
  </script> -->
1485
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1486
 
1487
- <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
 
 
 
 
 
 
 
1488
 
1489
  <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
1490
 
@@ -3557,7 +3564,7 @@
3557
 
3558
  <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
3559
 
3560
- <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \cdot h^2</d-math>), plus master weights in FP32 (<d-math>h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 \cdot h^2</d-math>) per weight matrix.</li>
3561
 
3562
  <li><strong>Total model parameters:</strong> Each transformer block will store:
3563
  <ul>
@@ -3586,6 +3593,13 @@
3586
  </li>
3587
 
3588
  <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
 
 
 
 
 
 
 
3589
  </ul>
3590
 
3591
  <h3>A3: Math for Compute/Communication Overlap</h3>
@@ -3654,7 +3668,7 @@
3654
 
3655
  <p>The computation time for the forward pass of one decoder layer is:</p>
3656
  <d-math block>
3657
- t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
3658
  </d-math>
3659
 
3660
  <p>For effective overlap between computation and communication, we need:</p>
 
1484
  </script> -->
1485
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1486
 
1487
+ <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
1488
+
1489
+ <div class="note-box">
1490
+ <p class="note-box-title">πŸ“ Note</p>
1491
+ <div class="note-box-content">
1492
+ <p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
1493
+ </div>
1494
+ </div>
1495
 
1496
  <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
1497
 
 
3564
 
3565
  <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
3566
 
3567
+ <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2 h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
3568
 
3569
  <li><strong>Total model parameters:</strong> Each transformer block will store:
3570
  <ul>
 
3593
  </li>
3594
 
3595
  <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
3596
+
3597
+ <div class="note-box">
3598
+ <p class="note-box-title">πŸ“ Note</p>
3599
+ <div class="note-box-content">
3600
+ <p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
3601
+ </div>
3602
+ </div>
3603
  </ul>
3604
 
3605
  <h3>A3: Math for Compute/Communication Overlap</h3>
 
3668
 
3669
  <p>The computation time for the forward pass of one decoder layer is:</p>
3670
  <d-math block>
3671
+ t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
3672
  </d-math>
3673
 
3674
  <p>For effective overlap between computation and communication, we need:</p>
src/index.html CHANGED
@@ -1484,7 +1484,14 @@
1484
  </script> -->
1485
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1486
 
1487
- <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers will be sent to the next GPU to continue the forward pass.</p>
 
 
 
 
 
 
 
1488
 
1489
  <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
1490
 
@@ -3557,7 +3564,7 @@
3557
 
3558
  <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
3559
 
3560
- <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \cdot h^2</d-math>), plus master weights in FP32 (<d-math>h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 \cdot h^2</d-math>) per weight matrix.</li>
3561
 
3562
  <li><strong>Total model parameters:</strong> Each transformer block will store:
3563
  <ul>
@@ -3586,6 +3593,13 @@
3586
  </li>
3587
 
3588
  <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
 
 
 
 
 
 
 
3589
  </ul>
3590
 
3591
  <h3>A3: Math for Compute/Communication Overlap</h3>
@@ -3654,7 +3668,7 @@
3654
 
3655
  <p>The computation time for the forward pass of one decoder layer is:</p>
3656
  <d-math block>
3657
- t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
3658
  </d-math>
3659
 
3660
  <p>For effective overlap between computation and communication, we need:</p>
 
1484
  </script> -->
1485
  <!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
1486
 
1487
+ <p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
1488
+
1489
+ <div class="note-box">
1490
+ <p class="note-box-title">πŸ“ Note</p>
1491
+ <div class="note-box-content">
1492
+ <p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
1493
+ </div>
1494
+ </div>
1495
 
1496
  <p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
1497
 
 
3564
 
3565
  <li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
3566
 
3567
+ <li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2 h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
3568
 
3569
  <li><strong>Total model parameters:</strong> Each transformer block will store:
3570
  <ul>
 
3593
  </li>
3594
 
3595
  <li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
3596
+
3597
+ <div class="note-box">
3598
+ <p class="note-box-title">πŸ“ Note</p>
3599
+ <div class="note-box-content">
3600
+ <p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
3601
+ </div>
3602
+ </div>
3603
  </ul>
3604
 
3605
  <h3>A3: Math for Compute/Communication Overlap</h3>
 
3668
 
3669
  <p>The computation time for the forward pass of one decoder layer is:</p>
3670
  <d-math block>
3671
+ t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
3672
  </d-math>
3673
 
3674
  <p>For effective overlap between computation and communication, we need:</p>