lvwerra HF Staff commited on
Commit
f59aa30
·
1 Parent(s): fe400ed

minor fixes

Browse files
dist/assets/images/memorycoalescing.png CHANGED

Git LFS Details

  • SHA256: 1094fe9aeb953c743791445ee6d7e73a5a89fa85fe60f4312266d1265e7c591a
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB

Git LFS Details

  • SHA256: 088cd848100ab26abbffdcc7c0e8f18a83facd0a8637c460e3ac88d483b04b46
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB
dist/index.html CHANGED
@@ -1465,7 +1465,7 @@
1465
  </script> -->
1466
 
1467
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1468
- <div class="figure-legend"><p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather, and ReduceScatter operations</p></div> <!-- RH: Should the figure and legend use all-reduce, all-gather, and reduce-scatter instead of AllReduce, AllGather, and ReduceScatter, to match the rest of the text? -->
1469
 
1470
  <p>Sequence and context parallelism can help for long sequences, but they don’t help much if the root cause of our memory issues is not the sequence length but rather the size of the model itself. For large models (70B+ parameters), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning another parallelism dimension: <em>pipeline parallelism</em> (PP).</p>
1471
 
@@ -1754,7 +1754,7 @@
1754
 
1755
  <p>As you can see, ZeRO-3 and PP solve the same challenge but involve different approaches, and the choice between them will depend on whether you decide to focus communication on transferring weights or activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a trade-off between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible unnecessary communication overhead.</p>
1756
 
1757
- <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with pipeline parallelism and are complementary to it. These combinations don't raise any particular new challenges. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1 (sic) <!-- RH: Why "(sic)"? Safe to remove that? -->.</p>
1758
 
1759
  <p><strong>Tensor parallelism</strong> (with <strong>sequence parallelism</strong>) is naturally complementary to and can be combined with both pipeline parallelism and ZeRO-3, as it relies on the distributive property of matrix multiplications, which allows weights and activations to be sharded and computed independently before being combined.</p>
1760
 
@@ -2092,11 +2092,11 @@
2092
 
2093
 
2094
  <h3>A primer on GPUs</h3>
2095
-
2096
- <p>Generally, GPUs have a very hierarchical organization. In this primer, we’ll keep the discussion at the conceptual level that is necessary for the rest of our presentation.</p> <!-- RH: Should the second sentence here be an aside? If that works, also remove the paragraph break after "organization." Or, if you don't want to do that, swap the order of these two sentences? -->
2097
-
2098
- <p>On the compute side, a GPU consists of an array of compute units called <strong><em>streaming multiprocessors (SMs)</em></strong>. Each SM contains and controls a set of streaming processors, also known as <em>cores</em>. For example, an NVIDIA H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">the docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
2099
 
 
 
2100
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
2101
  <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
2102
 
 
1465
  </script> -->
1466
 
1467
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1468
+ <div class="figure-legend"><p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for all-reduce, all-gather, and reduce-scatter operations</p></div> <!-- RH: Should the figure and legend use all-reduce, all-gather, and reduce-scatter instead of AllReduce, AllGather, and ReduceScatter, to match the rest of the text? -->
1469
 
1470
  <p>Sequence and context parallelism can help for long sequences, but they don’t help much if the root cause of our memory issues is not the sequence length but rather the size of the model itself. For large models (70B+ parameters), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning another parallelism dimension: <em>pipeline parallelism</em> (PP).</p>
1471
 
 
1754
 
1755
  <p>As you can see, ZeRO-3 and PP solve the same challenge but involve different approaches, and the choice between them will depend on whether you decide to focus communication on transferring weights or activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a trade-off between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible unnecessary communication overhead.</p>
1756
 
1757
+ <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with pipeline parallelism and are complementary to it. These combinations don't raise any particular new challenges. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1.</p>
1758
 
1759
  <p><strong>Tensor parallelism</strong> (with <strong>sequence parallelism</strong>) is naturally complementary to and can be combined with both pipeline parallelism and ZeRO-3, as it relies on the distributive property of matrix multiplications, which allows weights and activations to be sharded and computed independently before being combined.</p>
1760
 
 
2092
 
2093
 
2094
  <h3>A primer on GPUs</h3>
2095
+
2096
+ <p>Generally, GPUs have a very hierarchical organization. On the compute side, a GPU consists of an array of compute units called <strong><em>streaming multiprocessors (SMs)</em></strong>. Each SM contains and controls a set of streaming processors, also known as <em>cores</em>. For example, an NVIDIA H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">the docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
 
 
2097
 
2098
+ <aside> In this primer, we’ll keep the discussion at the conceptual level that is necessary for the rest of our presentation.</aside>
2099
+
2100
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
2101
  <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
2102
 
src/index.html CHANGED
@@ -1465,7 +1465,7 @@
1465
  </script> -->
1466
 
1467
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1468
- <div class="figure-legend"><p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for AllReduce, AllGather, and ReduceScatter operations</p></div> <!-- RH: Should the figure and legend use all-reduce, all-gather, and reduce-scatter instead of AllReduce, AllGather, and ReduceScatter, to match the rest of the text? -->
1469
 
1470
  <p>Sequence and context parallelism can help for long sequences, but they don’t help much if the root cause of our memory issues is not the sequence length but rather the size of the model itself. For large models (70B+ parameters), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning another parallelism dimension: <em>pipeline parallelism</em> (PP).</p>
1471
 
@@ -1754,7 +1754,7 @@
1754
 
1755
  <p>As you can see, ZeRO-3 and PP solve the same challenge but involve different approaches, and the choice between them will depend on whether you decide to focus communication on transferring weights or activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a trade-off between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible unnecessary communication overhead.</p>
1756
 
1757
- <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with pipeline parallelism and are complementary to it. These combinations don't raise any particular new challenges. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1 (sic) <!-- RH: Why "(sic)"? Safe to remove that? -->.</p>
1758
 
1759
  <p><strong>Tensor parallelism</strong> (with <strong>sequence parallelism</strong>) is naturally complementary to and can be combined with both pipeline parallelism and ZeRO-3, as it relies on the distributive property of matrix multiplications, which allows weights and activations to be sharded and computed independently before being combined.</p>
1760
 
@@ -2092,11 +2092,11 @@
2092
 
2093
 
2094
  <h3>A primer on GPUs</h3>
2095
-
2096
- <p>Generally, GPUs have a very hierarchical organization. In this primer, we’ll keep the discussion at the conceptual level that is necessary for the rest of our presentation.</p> <!-- RH: Should the second sentence here be an aside? If that works, also remove the paragraph break after "organization." Or, if you don't want to do that, swap the order of these two sentences? -->
2097
-
2098
- <p>On the compute side, a GPU consists of an array of compute units called <strong><em>streaming multiprocessors (SMs)</em></strong>. Each SM contains and controls a set of streaming processors, also known as <em>cores</em>. For example, an NVIDIA H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">the docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
2099
 
 
 
2100
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
2101
  <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
2102
 
 
1465
  </script> -->
1466
 
1467
  <!-- <p><img alt="pp_comm_bandwidth.svg" src="/assets/images/pp_comm_bandwidth.svg" /></p> -->
1468
+ <div class="figure-legend"><p>Inter-node communication bandwidth measurements across different node counts, showing median (lines) and 5th-95th percentile ranges (shaded areas) for all-reduce, all-gather, and reduce-scatter operations</p></div> <!-- RH: Should the figure and legend use all-reduce, all-gather, and reduce-scatter instead of AllReduce, AllGather, and ReduceScatter, to match the rest of the text? -->
1469
 
1470
  <p>Sequence and context parallelism can help for long sequences, but they don’t help much if the root cause of our memory issues is not the sequence length but rather the size of the model itself. For large models (70B+ parameters), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning another parallelism dimension: <em>pipeline parallelism</em> (PP).</p>
1471
 
 
1754
 
1755
  <p>As you can see, ZeRO-3 and PP solve the same challenge but involve different approaches, and the choice between them will depend on whether you decide to focus communication on transferring weights or activations. While they can be combined, it's not often done in practice as doing so requires increasing the global batch size significantly to amortize the communication costs, creating a trade-off between global batch size, model size, network bandwidth, and training efficiency. If you decide to combine them, ZeRO-3 should be configured to keep the weights in memory during the series of PP micro-batches to minimize as much as possible unnecessary communication overhead.</p>
1756
 
1757
+ <p>On the other hand, ZeRO-1 and ZeRO-2, which focus on optimizer states and gradients, can be easily combined with pipeline parallelism and are complementary to it. These combinations don't raise any particular new challenges. For instance, the training of DeepSeek-v3 used PP combined with ZeRO-1.</p>
1758
 
1759
  <p><strong>Tensor parallelism</strong> (with <strong>sequence parallelism</strong>) is naturally complementary to and can be combined with both pipeline parallelism and ZeRO-3, as it relies on the distributive property of matrix multiplications, which allows weights and activations to be sharded and computed independently before being combined.</p>
1760
 
 
2092
 
2093
 
2094
  <h3>A primer on GPUs</h3>
2095
+
2096
+ <p>Generally, GPUs have a very hierarchical organization. On the compute side, a GPU consists of an array of compute units called <strong><em>streaming multiprocessors (SMs)</em></strong>. Each SM contains and controls a set of streaming processors, also known as <em>cores</em>. For example, an NVIDIA H100 GPU has 132 SMs with 128 cores per SM, resulting in a total of 16,896 cores (see <a href="https://resources.nvidia.com/en-us-tensor-core">the docs for tensor cores</a> for details), each capable of handling multiple threads simultaneously.</p>
 
 
2097
 
2098
+ <aside> In this primer, we’ll keep the discussion at the conceptual level that is necessary for the rest of our presentation.</aside>
2099
+
2100
  <p><img alt="image.png" src="/assets/images/diving_primergpu.svg" /></p>
2101
  <div class="figure-legend"><p>Source: https://blog.codingconfessions.com/p/gpu-computing</p></div>
2102