Questions on how routed experts are merged

#1
by chuhac - opened

Thank you all for this great work, but I have questions about:

custom merge of R1s and V3s routed experts

After carefully checking the weights, I turned out to find that the actual custom merge is that R1T reuse the expert routing gate and shared experts from V3-0324 model while using all routed experts from R1 model.
I am willing to discuss here about the inductive bias behind this design, since it seems that the R1T model proves to behave very well despite the mismatch between weights from different base models.

It appears to be as smart as R1 but much faster, using 40% fewer output tokens.

If anyone is interested in the details of the claims above, I'm glad to share the code to examine the expert-merge. A few takeaways are presented as follows:

  • Embedding: Reused from the V3 model.

  • The Attention part: Reused from the V3 model.

  • Dense Blocks

    • The first three Dense Blocks are completely reused from the V3 model.
  • MoE Blocks

    • Expert Parameters
      • Shared Experts: Directly reuse the shared experts from the V3 model.
      • Routed Experts: All reused experts are the routed experts from the R1 model.
    • Expert Router: Reused from V3 model.

Well its a merge so ofcourse they are reusing modules from both models , i fail to see your point?
They never claimed to have "created" the ultimate model , they merged the best from 2 worlds which is an amazing feat of it's own tbh.

@Daemontatox My point is that the claimed custom merge, which is supposed to indicate that a good merging between routed experts from R1 and V3 model, with the routing gate somehow also better handled.

However, the R1T model only reuses all routed experts from R1 model while reusing the routing gate from V3 model. The inductive bias of this is my interest point here.

TNG Technology Consulting GmbH org

Dear Junda,

thanks for your work, your very cool answer and the thoughts from you. Your analysis of the released model and its construction blueprint is exactly right - bravo!

We consider the released model a custom merge, as the MoE layers consist of selected parts of both parent models, as you pointed out. Of course, the zoo of possible combinations is much bigger. For example, we also created versions in which the parameters of the routed experts themselves are linearly mixed between V3-0324 and R1.

Pretty much all mixing settings that we tried resulted in workable models, so the construction process seems robust.

Nonetheless, the weight mixing dimension has interesting properties. For example, if you mix the routed experts' weights 50%-50% of both parent models, the reasoning traces visible in pure R1 output are not there. However, if you increase R1's share to 75% in the mix, the visible reasoning traces are there (!), and they have the subjectively nice, slightly more compact style than R1 itself.

If you would like to share your code, e.g. here on HF, we'd love to take a look, and others surely as well - thank you :-).

Cheers!

@TNGHK
Dear HK from TNG Team,

I appreciate your swift feedback and thank you for sharing the insights of weight merge. As is explained, the expert merging is really more straightforward and suitable for the architecture of MoE models.

if you mix the routed experts' weights 50%-50% of both parent models, the reasoning traces visible in pure R1 output are not there. However, if you increase R1's share to 75% in the mix, the visible reasoning traces are there

The provided details really help the community. These phenomena also enhance my perception of the possibility of weight merging in the MoE. It also makes me wonder if the FFN weights, i.e., the experts, are exactly the ones that responsible for reasoning capacities.

Overall, thanks for your team's great work and hope you can provide more of your insights on the weight merging between reasoning and non-reasoning models in future works.

Here I also share the relatively naive (yet straightforward) implementation of weight examination to facilitate the community:

# before the loop, load models and state dict
for li in range(62):
    for i in range(-2, 256): 
        print('=' * 100)
        print(f"Expert: {i}")
        if i == -2:
            K = f"model.layers.{li}.mlp.gate.weight"
            print(K)            
        for mlp_name in ("up", "gate", "down"):
            if i == -1:
                K = f"model.layers.{li}.mlp.shared_experts.{mlp_name}_proj.weight"
                print(f"Shared MLP Name: {mlp_name}")
            if i >= 0:
                K = f"model.layers.{li}.mlp.experts.{i}.{mlp_name}_proj.weight"            
                print(f"Expert MLP Name: {mlp_name}")
                
            print("v3 r1t", end=" ")
            print(
                torch.allclose(
                    v3_tensors[K].to(torch.float32), 
                    r1t_tensors[K].to(torch.float32), 
                    atol=1e-5
                )
            )
            print("v3 r1", end=" ")
            print(
                torch.allclose(
                    v3_tensors[K].to(torch.float32), 
                    r1_tensors[K].to(torch.float32), 
                    atol=1e-5
                )
            )
            print("r1 r1t", end=" ")
            print(
                torch.allclose(
                    r1t_tensors[K].to(torch.float32), 
                    r1_tensors[K].to(torch.float32), 
                    atol=1e-5
                )
            )

Okay so I was curious whether this could be done directly on the quantized GGUF models (since merging the actual FP8 model, converting to BF16, then quantizing it takes absolutely forever, especially since you need to create a new imatrix). I think it should as long as we only pick A/B weights instead of merging them directly.

Here's a rough script that tries to match the logic outlined above, using the Unsloth dynamic GGUF quants for R1 / V3 as the source: deepseek_r1t_chimera_diy_gguf.py.

Unsure if the logic/mapping is 100% correct but it does appear to work (i.e. no thinking block in the output). I guess the routed R1 experts might be somewhat suboptimal since they would've been quantized with the input/activations from the rest of the R1 model in mind instead of the V3 parts.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment