embed predict_memory
#47
by
nouamanetazi
HF staff
- opened
assets/images/predict_memory_tool.png
ADDED
![]() |
Git LFS Details
|
assets/images/profile_trace_annotated.png
CHANGED
![]() |
Git LFS Details
|
![]() |
Git LFS Details
|
dist/assets/images/memorycoalescing.png
CHANGED
![]() |
Git LFS Details
|
![]() |
Git LFS Details
|
dist/assets/images/predict_memory_tool.png
ADDED
![]() |
Git LFS Details
|
dist/index.html
CHANGED
@@ -193,19 +193,12 @@
|
|
193 |
</div>
|
194 |
</div>
|
195 |
|
196 |
-
<p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
|
197 |
-
|
198 |
-
|
199 |
-
<
|
200 |
-
|
201 |
-
|
202 |
-
</li>
|
203 |
-
<li>
|
204 |
-
<p>
|
205 |
-
<a href="https://pytorch.org/docs/stable/torch_cuda_memory.html">torch_cuda_memory</a>
|
206 |
-
</p>
|
207 |
-
</li>
|
208 |
-
</ul>
|
209 |
|
210 |
<p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
|
211 |
|
@@ -1874,7 +1867,7 @@
|
|
1874 |
|
1875 |
<p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
|
1876 |
|
1877 |
-
<h2>
|
1878 |
|
1879 |
<p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
|
1880 |
|
|
|
193 |
</div>
|
194 |
</div>
|
195 |
|
196 |
+
<p>While this widget gives a theoretical breakdown we also made the <a href="https://huggingface.co/spaces/nanotron/predict_memory">following tool</a> that can be used to predict the memory usage during a training run:</p>
|
197 |
+
|
198 |
+
<a href="https://huggingface.co/spaces/nanotron/predict_memory">
|
199 |
+
<img src="/assets/images/predict_memory_tool.png" alt="Predict Memory Tool" />
|
200 |
+
</a>
|
201 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
202 |
|
203 |
<p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
|
204 |
|
|
|
1867 |
|
1868 |
<p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine– them? This will be the topic of our next section.</p>
|
1869 |
|
1870 |
+
<h2>Finding the Best Training Configuration</h2>
|
1871 |
|
1872 |
<p>We’ve now covered all the parallelism techniques that are actually used to distribute and training larger models as well as how and why they can be combined together. There remain a general question: which ones should we choose in the end and how to decide on a specific combination?</p>
|
1873 |
|
src/index.html
CHANGED
@@ -193,19 +193,12 @@
|
|
193 |
</div>
|
194 |
</div>
|
195 |
|
196 |
-
<p>While this widget gives a theoretical breakdown the following tool can be used to predict the memory usage:</p>
|
197 |
-
|
198 |
-
|
199 |
-
<
|
200 |
-
|
201 |
-
|
202 |
-
</li>
|
203 |
-
<li>
|
204 |
-
<p>
|
205 |
-
<a href="https://pytorch.org/docs/stable/torch_cuda_memory.html">torch_cuda_memory</a>
|
206 |
-
</p>
|
207 |
-
</li>
|
208 |
-
</ul>
|
209 |
|
210 |
<p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
|
211 |
|
|
|
193 |
</div>
|
194 |
</div>
|
195 |
|
196 |
+
<p>While this widget gives a theoretical breakdown we also made the <a href="https://huggingface.co/spaces/nanotron/predict_memory">following tool</a> that can be used to predict the memory usage during a training run:</p>
|
197 |
+
|
198 |
+
<a href="https://huggingface.co/spaces/nanotron/predict_memory">
|
199 |
+
<img src="/assets/images/predict_memory_tool.png" alt="Predict Memory Tool" />
|
200 |
+
</a>
|
201 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
202 |
|
203 |
<p><strong>Clear code implementations:</strong> theory is one thing, but we discover all kinds of edge cases and important details when we implement something. That’s why we link to implementation references where possible. Depending on the case, we’ll use two code references:</p>
|
204 |
|