embed predict_memory

#47
by nouamanetazi HF staff - opened
assets/images/predict_memory_tool.png ADDED

Git LFS Details

  • SHA256: b079e5968c1ddfff6f0f663db43f6fb9715240e92dd455875194575eb4c98313
  • Pointer size: 130 Bytes
  • Size of remote file: 94.3 kB
assets/images/profile_trace_annotated.png CHANGED

Git LFS Details

  • SHA256: e1806f717e427febe26bfa45135d45d76adc9808c8a92553f7f7e0bb9faa80ae
  • Pointer size: 131 Bytes
  • Size of remote file: 995 kB

Git LFS Details

  • SHA256: 7359ca99eff4eaa53952bfba0dd562ab6bb9b109033f283d296aef2471e642bc
  • Pointer size: 131 Bytes
  • Size of remote file: 995 kB
dist/assets/images/memorycoalescing.png CHANGED

Git LFS Details

  • SHA256: 088cd848100ab26abbffdcc7c0e8f18a83facd0a8637c460e3ac88d483b04b46
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB

Git LFS Details

  • SHA256: 1094fe9aeb953c743791445ee6d7e73a5a89fa85fe60f4312266d1265e7c591a
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB
dist/assets/images/predict_memory_tool.png ADDED

Git LFS Details

  • SHA256: a61828c60b0e39e57c3d050474889dc51c87f5fdaa6d5afa4ae7b55e329678b2
  • Pointer size: 130 Bytes
  • Size of remote file: 26.5 kB
dist/index.html CHANGED
@@ -193,19 +193,12 @@
193
  </div>
194
  </div>
195
 
196
- <p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
197
- <ul>
198
- <li>
199
- <p>
200
- <a href="https://huggingface.co/spaces/nanotron/predict_memory">predict_memory</a>
201
- </p>
202
- </li>
203
- <li>
204
- <p>
205
- <a href="https://pytorch.org/docs/stable/torch_cuda_memory.html">torch_cuda_memory</a>
206
- </p>
207
- </li>
208
- </ul>
209
 
210
  <p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
211
 
@@ -1874,7 +1867,7 @@
1874
 
1875
  <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
1876
 
1877
- <h2>How to Find the Best Training Configuration</h2>
1878
 
1879
  <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1880
 
 
193
  </div>
194
  </div>
195
 
196
+ <p>While this widget gives a theoretical breakdown we also made the <a href="https://huggingface.co/spaces/nanotron/predict_memory">following tool</a> that can be used to predict the memory usage during a training run:</p>
197
+
198
+ <a href="https://huggingface.co/spaces/nanotron/predict_memory">
199
+ <img src="/assets/images/predict_memory_tool.png" alt="Predict Memory Tool" />
200
+ </a>
201
+
 
 
 
 
 
 
 
202
 
203
  <p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
204
 
 
1867
 
1868
  <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
1869
 
1870
+ <h2>Finding the Best Training Configuration</h2>
1871
 
1872
  <p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
1873
 
src/index.html CHANGED
@@ -193,19 +193,12 @@
193
  </div>
194
  </div>
195
 
196
- <p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
197
- <ul>
198
- <li>
199
- <p>
200
- <a href="https://huggingface.co/spaces/nanotron/predict_memory">predict_memory</a>
201
- </p>
202
- </li>
203
- <li>
204
- <p>
205
- <a href="https://pytorch.org/docs/stable/torch_cuda_memory.html">torch_cuda_memory</a>
206
- </p>
207
- </li>
208
- </ul>
209
 
210
  <p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
211
 
 
193
  </div>
194
  </div>
195
 
196
+ <p>While this widget gives a theoretical breakdown we also made the <a href="https://huggingface.co/spaces/nanotron/predict_memory">following tool</a> that can be used to predict the memory usage during a training run:</p>
197
+
198
+ <a href="https://huggingface.co/spaces/nanotron/predict_memory">
199
+ <img src="/assets/images/predict_memory_tool.png" alt="Predict Memory Tool" />
200
+ </a>
201
+
 
 
 
 
 
 
 
202
 
203
  <p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
204