Commit
Β·
68cc8e2
1
Parent(s):
e87aa99
some few more changes
Browse files- dist/index.html +17 -3
- src/index.html +17 -3
dist/index.html
CHANGED
@@ -1484,7 +1484,14 @@
|
|
1484 |
</script> -->
|
1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
1486 |
|
1487 |
-
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1488 |
|
1489 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
1490 |
|
@@ -3557,7 +3564,7 @@
|
|
3557 |
|
3558 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
3559 |
|
3560 |
-
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \
|
3561 |
|
3562 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
3563 |
<ul>
|
@@ -3586,6 +3593,13 @@
|
|
3586 |
</li>
|
3587 |
|
3588 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3589 |
</ul>
|
3590 |
|
3591 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
@@ -3654,7 +3668,7 @@
|
|
3654 |
|
3655 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
3656 |
<d-math block>
|
3657 |
-
t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
3658 |
</d-math>
|
3659 |
|
3660 |
<p>For effective overlap between computation and communication, we need:</p>
|
|
|
1484 |
</script> -->
|
1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
1486 |
|
1487 |
+
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
|
1488 |
+
|
1489 |
+
<div class="note-box">
|
1490 |
+
<p class="note-box-title">π Note</p>
|
1491 |
+
<div class="note-box-content">
|
1492 |
+
<p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
|
1493 |
+
</div>
|
1494 |
+
</div>
|
1495 |
|
1496 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
1497 |
|
|
|
3564 |
|
3565 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
3566 |
|
3567 |
+
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2 h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
|
3568 |
|
3569 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
3570 |
<ul>
|
|
|
3593 |
</li>
|
3594 |
|
3595 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
3596 |
+
|
3597 |
+
<div class="note-box">
|
3598 |
+
<p class="note-box-title">π Note</p>
|
3599 |
+
<div class="note-box-content">
|
3600 |
+
<p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
|
3601 |
+
</div>
|
3602 |
+
</div>
|
3603 |
</ul>
|
3604 |
|
3605 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
|
|
3668 |
|
3669 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
3670 |
<d-math block>
|
3671 |
+
t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
3672 |
</d-math>
|
3673 |
|
3674 |
<p>For effective overlap between computation and communication, we need:</p>
|
src/index.html
CHANGED
@@ -1484,7 +1484,14 @@
|
|
1484 |
</script> -->
|
1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
1486 |
|
1487 |
-
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1488 |
|
1489 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
1490 |
|
@@ -3557,7 +3564,7 @@
|
|
3557 |
|
3558 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
3559 |
|
3560 |
-
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \
|
3561 |
|
3562 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
3563 |
<ul>
|
@@ -3586,6 +3593,13 @@
|
|
3586 |
</li>
|
3587 |
|
3588 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3589 |
</ul>
|
3590 |
|
3591 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
@@ -3654,7 +3668,7 @@
|
|
3654 |
|
3655 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
3656 |
<d-math block>
|
3657 |
-
t_{compute} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
3658 |
</d-math>
|
3659 |
|
3660 |
<p>For effective overlap between computation and communication, we need:</p>
|
|
|
1484 |
</script> -->
|
1485 |
<!-- <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p> -->
|
1486 |
|
1487 |
+
<p>Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! which means we don't save any activation memory with this approach.</p>
|
1488 |
+
|
1489 |
+
<div class="note-box">
|
1490 |
+
<p class="note-box-title">π Note</p>
|
1491 |
+
<div class="note-box-content">
|
1492 |
+
<p>This is because each GPU needs to perform PP forward passes before starting the first backward pass. Since each GPU handles 1/PP of the layers but needs to process PP micro-batches before the first backward, it ends up storing <d-math>PP \times (activs / PP) \approx activs</d-math>, which means the activation memory requirement remains roughly the same as without pipeline parallelism.</p>
|
1493 |
+
</div>
|
1494 |
+
</div>
|
1495 |
|
1496 |
<p>This introduces a new type of communication pattern: instead of communicating parameters like we did with ZeRO-3 in data parallelism, we're now passing activation tensors sequentially between GPUs in a "pipeline." While conceptually simple, efficiently implementing this technique is quite tricky. Let's dive right into the details!</p>
|
1497 |
|
|
|
3564 |
|
3565 |
<li><strong>Model weights and gradients:</strong> Each weight matrix in your model (e.g. linear layer) contains about <d-math>h^2</d-math> elements. Gradients have the same size as weights.</li>
|
3566 |
|
3567 |
+
<li><strong>Optimizer states:</strong> For each weight matrix (of <d-math>h^2</d-math> elements), an optimizer like Adam with mixed precision training will keep momentum and variance states in FP32 precision (<d-math>2 \times 2 h^2</d-math>), plus master weights in FP32 (<d-math>2 h^2</d-math>). So, the total number of optimizer states will be around (<d-math>6 h^2</d-math>) per weight matrix.</li>
|
3568 |
|
3569 |
<li><strong>Total model parameters:</strong> Each transformer block will store:
|
3570 |
<ul>
|
|
|
3593 |
</li>
|
3594 |
|
3595 |
<li><strong>Forward and backward pass compute (FLOPS):</strong> A very rough estimate for the FLOPS in a forward pass is <d-math>2 \cdot num\_tokens \cdot num\_params</d-math>. The backward pass compute is twice that: <d-math>4 \cdot num\_tokens \cdot num\_params</d-math>.</li>
|
3596 |
+
|
3597 |
+
<div class="note-box">
|
3598 |
+
<p class="note-box-title">π Note</p>
|
3599 |
+
<div class="note-box-content">
|
3600 |
+
<p>A more accurate FLOPs formula for a forward+backward pass would be <d-math>6 \cdot seq\_len \cdot num\_params + 12 \cdot num\_layers \cdot h \cdot seq\_len^2</d-math> which accounts for the quadratic scaling from attention operations across the entire sequence, but to simplify the math, we assume that <d-math>seq\_len^2 << h</d-math>.</p>
|
3601 |
+
</div>
|
3602 |
+
</div>
|
3603 |
</ul>
|
3604 |
|
3605 |
<h3>A3: Math for Compute/Communication Overlap</h3>
|
|
|
3668 |
|
3669 |
<p>The computation time for the forward pass of one decoder layer is:</p>
|
3670 |
<d-math block>
|
3671 |
+
t_{compute} = \frac{2 \cdot seq\_len \cdot mbs \cdot (16 \cdot h^2)}{peak\_flops} = \frac{32 \cdot seq\_len \cdot mbs \cdot h^2}{peak\_flops}
|
3672 |
</d-math>
|
3673 |
|
3674 |
<p>For effective overlap between computation and communication, we need:</p>
|