FSDP Training with Mistral-Small-3.1-24B-Instruct-2503 Model and DecoderLayer
#62
by
ian00000
- opened
Hi Mistral team,
I am working on training the mistralai/Mistral-Small-3.1-24B-Instruct-2503 model using Fully Sharded Data Parallel (FSDP), but I have encountered an issue. Specifically, the transformers.models.mistral3.modeling_mistral3.py module for this model does not contain a DecoderLayer.
Given this, I wanted to ask whether it would be possible to use the DecoderLayer from the mistral model instead, or if this could cause compatibility or performance issues. Would the mistral layer be suitable for wrapping with FSDP, or should I look for a different approach to handle this?
Thank you for your assistance. I look forward to your guidance.