Complete-muE — Tune Dense Once, Transfer to All MoE Configurations

TL;DR

If you train Mixture-of-Experts (MoE) transformers, you don't need to re-sweep learning rate, weight decay, or init for every architecture. Tune them once on a small dense FFN proxy, plug the values into a deterministic rule, and they transfer near-optimally to any large MoE configuration — different numbers of activated experts, different total expert counts, different granularities, shared experts, group-balanced routing, width, depth, and batch size.

The whole point is the asymmetry: small dense sweep, large MoE deployment. Our large-scale runs tune at proxy width $d_\star = 128$ and deploy at $d = 1024$ ($8\times$ wider). At that scale the recipe gives $\sim\!4.5\times$ convergence speedup for $240$P $5$s video diffusion and $\sim\!5.3\text{-}5.5\times$ for LLM training versus a dense baseline at the same hyperparameters.

👤 If you're new to MoE, the next two sections build the intuition.
🛠️ If you train models for a living, jump to the recipe.
🔬 If you want the math, the "under the hood" boxes throughout have the technical details.

Why MoE tuning is so painful

A modern Mixture-of-Experts transformer block has many knobs you can turn:

How wide each expert is ($h$)
How many experts the router activates per token ($a$, "activated experts")
How many experts exist in total ($N$, "capacity")
Granularity (more, smaller experts vs. fewer, larger ones)
Shared experts that always fire
Group-balanced routing
...all on top of the usual width, depth, batch size, and training duration

The catch: each of these choices changes two things at once.

1. The architecture changes

Replacing a dense FFN with an MoE block, or making the experts smaller, alters the parameterized network. Signals propagate differently. Update sizes shift.

2. The data each expert sees changes

An expert only trains on tokens that the router sends to it. With $a$ activated experts out of $N$, each expert sees roughly $a/N$ of the tokens. Change $a$ or $N$ and you've changed each expert's effective batch size and total training duration.

This double-change is exactly what existing transfer rules cannot handle:

$\mu$P¹
1 Yang et al. (2022). Tensor Programs V. arXiv:2203.03466
handles changes to a fixed parameterized model very well — it gives you zero-shot width transfer. Subsequent work extends it to depth, practical training, and diffusion transformers^2,3,10
2 Dey et al. (2025). Don't be lazy: CompleteP. arXiv:2505.01618
3 Mlodozeniec et al. (2025). Completed Hyperparameter Transfer (Complete(d)P). arXiv:2512.22382
10 Zheng et al. (2025). Scaling Diffusion Transformers via $\mu$P. arXiv:2505.15270
. But plain $\mu$P has no slot for the per-expert workload.
SDE-based batch/duration rules⁴
4 Malladi et al. (2022). SDEs and scaling rules for adaptive gradient algorithms. arXiv:2205.10287
handle changes in tokens-per-step for a fixed architecture. They don't apply when the architecture itself changes.
Recent MoE-specific transfer studies^5,6
5 Małaśnicki et al. (2025). $\mu$-Parametrization for Mixture of Experts. arXiv:2506.16962
6 Jiang et al. (2026). Hyperparameter Transfer with MoE Layers.
preserve learning-rate ranges under expert-count sweeps but stop short of one compositional rule for the full MoE design space.

So when you go from dense FFN to MoE, or change the MoE capacity, neither tool alone tells you what learning rate to use. Practitioners end up doing what they always did: another hyperparameter sweep. That gets expensive fast.

The core insight: active width + expert workload

Complete-muE's main observation is that you can take the messy MoE design space and decompose it into two clean, well-understood steps. Each step changes one thing.

Step 1 — Active width. A sparse MoE block that uses $a$ experts of width $h$ behaves, to first order, like a dense FFN with hidden width $H_a = a \cdot h$. This single quantity — the active width — governs everything that classical $\mu$P knows how to handle.

Step 2 — Expert workload. Once active width is held fixed, the only remaining thing that changes when you flip between dense and sparse, or vary the activated-expert count, is how many tokens each expert sees per step. That's exactly the SDE setting — but applied to the per-expert training process rather than the whole model.

Decomposing the problem this way is the entire trick. Each step is a known problem with a known answer. The hard MoE-specific cases — total expert count, granularity, hybrids with shared experts and group-balanced routing — turn out to be compositions of these two steps. Nothing new needs to be invented for them.

Two bridges from dense to any MoE

Bridge I: Dense FFN ↔ Dense MoE

The first bridge says: a dense FFN of width $H$ and a "Dense MoE" with $N$ experts of width $h = H/N$ (every expert active for every token) carry the same forward signal and the same per-step update size, as long as you do two small things:

Use an output multiplier $A = d/H$ on the down-projection (where $d$ is the residual width). This is the standard $\mu$P factor for a width-$H$ FFN.
Use a route scale equal to the number of activated experts. Normalized routing averages expert outputs, which would otherwise shrink the update by $1/a$; the route scale undoes that.

Under the hood

For a unit-expansion dense companion at backbone width $d$, the matching rule for the FFN/MoE output branch is

$$A(H) = \frac{d}{H}, \qquad \sigma_{\mathrm{down}}(d, H) = \left(\frac{H}{d}\right)^{1/2} \sigma_{\mathrm{down}}^{(1)}(d), \qquad \eta_{\mathrm{down}}(d, H) = \eta_{\mathrm{down}}^{(1)}(d).$$

This is essentially $\mu$P applied to the active width $H$, plus a normalized-route correction $r_a = a$.

Bridge II: Dense MoE ↔ Sparse MoE

The second bridge handles the move from a Dense MoE (all $N$ experts active) to a sparse MoE (only $a < N$ active). Once Bridge I is applied at the active width $H_a = ah$, the only remaining change is the per-expert data exposure.

Here's the elegant cancellation: when you switch from $a$ activated experts to $a'$ at fixed global batch size and steps, each expert's batch changes by the same factor as its training duration. Both go up or down together by $a'/a$. The standard SDE correction cancels:

Under the hood

$$\rho_B^{\mathrm{exp}} = \frac{B_{\mathrm{exp}}(a')}{B_{\mathrm{exp}}(a)} = \frac{a'}{a}, \qquad \rho_D^{\mathrm{exp}} = \frac{D_{\mathrm{exp}}(a')}{D_{\mathrm{exp}}(a)} = \frac{a'}{a}.$$

So the dense-style correction $\eta' \approx \eta\sqrt{\rho_B^{\mathrm{exp}}/\rho_D^{\mathrm{exp}}} = \eta$ — no first-order learning rate or weight decay multiplier is needed.

What does change is the expert-side noise-to-signal level $\sigma_0(a) = \eta \cdot \sigma_{\mathrm{exp}}(a)$. Making the MoE denser (larger $a$) lowers the per-step gradient noise, which is why denser MoEs can reach lower loss at the same hyperparameters. We'll return to this.

Everything else is composition

That's it for new primitives. Every other MoE setting — total capacity, granularity, hybrid blocks with shared experts and grouped routing — is built by composing Bridge I and Bridge II:

Total capacity ($N \to N'$): first widen the Dense-MoE companion (Bridge I), then re-sparsify back to $a$ active experts (Bridge II). The two factors cancel exactly.
Granularity at fixed density ($s = a/N$): the active width is preserved by construction, so the same sparse-layer rule applies.
Hybrid blocks (shared + grouped + routed): use one common active-width multiplier over the total active width $H_{\mathrm{tot}}$ and apply route scale only to routed branches.

The recipe in three steps

Tune once on a small dense FFN proxy. Pick a small backbone width (we use $d_\star = 128$), batch size, and a short training duration. Sweep learning rate, weight decay, and initialization on this small dense reference. This is the one and only hyperparameter sweep you do — it stays cheap regardless of how large the target MoE will be.
Read the layer-level rule from Table 1 of the paper for your target MoE block. Compute the active width, look up the output multiplier $A$, the route scale $R$, and the down-projection initialization, then apply them.
Compose the global AdamW factors from Table 2 for any width / depth / batch / duration changes. For MoE-specific changes (different $a$, $N$, granularity, shared experts, group-balanced routing), no extra multiplier is needed — Table 1 already absorbed those.

Tables 1 and 2 fit on one page. You do not need to re-sweep for any MoE setting once the dense reference is tuned.

The actual rule is even simpler than it sounds. There is a single active-width $H_{\mathrm{act}}$ that drives every MoE-specific adjustment, plus one routing choice that keeps things clean:

Knob	Rule	Notes
Active width $H_{\mathrm{act}}$	$H$ (dense) $ah$ (sparse MoE) $H_{\mathrm{tot}}$ (hybrid)	The single quantity that drives every rule below.
Output multiplier $A$	$A = d / H_{\mathrm{act}}$	Apply to the down-projection of the FFN/MoE branch.
Route scale $R$	$R = a$ (routed sums) $R = 1$ (dense / shared)	Only on the normalized routed sum, not on dense/shared branches.
Down-projection init std	$\sigma_{\mathrm{down}} = \sqrt{H_{\mathrm{act}}/d}\,\sigma_{\mathrm{down}}^{(1)}$	Scales relative to the unit-expansion dense companion.
Up / gate projections	Standard $\mu$P factors	Controlled by backbone width $d$ — no MoE-specific change.
Router readout	Standard $\mu$P factors	Treated like any $d$-controlled projection.

If the target differs from the dense calibration in batch size $B$ or total iterations $T$, apply the CompleteP multipliers below on top of the per-layer rule. The general form (from the paper's Table 2) is

$\eta, \lambda \propto \sqrt{\rho_B/\rho_D}, \qquad \epsilon \propto \sqrt{\rho_D/\rho_B}, \qquad 1{-}\beta_{1,2} \propto \rho_B/\rho_D,$

with $\rho_B = B'/B$ and $\rho_D = (B'T')/(BT)$. Three common instantiations:

Scenario	$\rho_B$	$\rho_D$	$\eta, \lambda$ ×	$\epsilon$ ×	$1{-}\beta_{1,2}$ ×	Transfer type
Fixed total tokens batch ↑ $\kappa_B\!\times$, steps ↓ $\kappa_B\!\times$	$\kappa_B$	$1$	$\sqrt{\kappa_B}$	$1/\sqrt{\kappa_B}$	$\kappa_B$	Exact All three SDE objects ($\sigma_0$, $\widetilde\lambda$, $H_{\mathrm{SDE}}$) preserved.
Fixed iterations batch ↑ $\kappa_B\!\times$, token budget ↑ $\kappa_B\!\times$	$\kappa_B$	$\kappa_B$	—	—	—	Approximate $\sigma_0$ shifts by $1/\sqrt{\kappa_B}$; optimal LR remains stable empirically.
Fixed batch size steps ↑ $\kappa_D\!\times$, token budget ↑ $\kappa_D\!\times$	$1$	$\kappa_D$	$1/\sqrt{\kappa_D}$	$\sqrt{\kappa_D}$	$1/\kappa_D$	Approximate Iso-horizon: $H_{\mathrm{SDE}}$ and $\widetilde\lambda$ preserved; $\sigma_0$ shifts by $1/\sqrt{\kappa_D}$.

A small honest caveat about drift

The first-order SDE correction in Bridge II cancels exactly, but there's a second-order effect we want to be upfront about: the per-step expert-side noise $\sigma_0(a)$ shifts when $a$ changes. The change is bounded:

$$\sigma_0(a') = \frac{\sigma_0(a)}{\sqrt{\rho_B^{\mathrm{exp}}}}.$$

This means Bridge II is not a strict SDE invariance — it's a relatively stable transfer with some mild residual drift in the optimal hyperparameters. The same kind of bounded drift shows up in two other places in the paper:

Capacity scaling (varying $N$ at fixed $a, h$), because the composition that builds it uses Bridge II as a sub-step.
Batch-size transfer at fixed training iterations, where the gradient noise $\sigma_0$ shifts even though the optimization horizon $H_{\mathrm{SDE}} = T\eta^2$ is preserved.

The empirical question is whether this drift is small enough to ignore in practice. The answer the paper carefully argues — and then demonstrates twice over — is yes.

Does it actually work?

Small-scale: optima align across MoE settings

In controlled language-model and diffusion-transformer proxy sweeps, the loss vs. learning rate curves stay tightly aligned across activated-expert counts — with only the minor drift the theory predicts.

LM learning-rate sweep across activated-expert counts. — Activated-experts learning-rate sweep at fixed total experts and per-expert width, for language-model (left) and diffusion-transformer (right) proxies. The optimal LR region is broad and relatively stable across `a`, with only the minor drift the theory predicts.

Diffusion learning-rate sweep across activated-expert counts. — Activated-experts learning-rate sweep at fixed total experts and per-expert width, for language-model (left) and diffusion-transformer (right) proxies. The optimal LR region is broad and relatively stable across `a`, with only the minor drift the theory predicts.

Small-scale: fixed-LR scaling works across MoE axes

The more direct test: fix all AdamW hyperparameters at the dense-tuned values (LM: $\text{LR}=10^{-3}$, init std $=10^{-2}$, $\text{WD}=0.1$; Diffusion: $\text{LR}=1.6\times10^{-3}$, init std $=2\times10^{-2}$, $\text{WD}=10^{-2}$), then scale only the MoE architecture along four axes. Activated experts, total capacity, granularity, and depth all give consistent loss reductions for both modalities — no per-setting retuning.

Fixed-LR loss scaling — LM, activated experts. — Fixed-hyperparameter loss scaling across MoE axes. Four columns of paired panels — activated experts, total capacity, granularity, layer depth — at one fixed AdamW setting per modality. Each axis gives consistent loss improvement without per-setting retuning. This is the most direct empirical evidence for the "tune dense once, transfer to all" recipe at small scale.

Fixed-LR loss scaling — DF, activated experts. — Fixed-hyperparameter loss scaling across MoE axes. Four columns of paired panels — activated experts, total capacity, granularity, layer depth — at one fixed AdamW setting per modality. Each axis gives consistent loss improvement without per-setting retuning. This is the most direct empirical evidence for the "tune dense once, transfer to all" recipe at small scale.

One pattern stands out across the four panels: capacity scaling (more total experts at fixed activated count) drives substantially lower loss than granularity scaling at the same fixed hyperparameters. Among the MoE quality knobs, total-expert count is the strongest lever per unit of training compute — which, conveniently, is also the cheapest one to scale, as we'll see in the systems benchmark below.

Large-scale: small dense calibration → large MoE deployment, five training regimes

The point of the recipe is the asymmetry: tuning is done on a small dense proxy, deployment is at large MoE scale. All sweeps that produced the calibrated hyperparameters were run at proxy backbone width $d_\star = 128$ on short runs (25k–100k steps). The large-scale runs then use Complete-muE to transfer to backbone width $d = 1024$ (8× wider, ~6.3B total MoE parameters with ~0.62B active) — a setting that would be enormously expensive to sweep directly.

At this scale, all four multimodal diffusion regimes — $256$P images, $512$P images, $240$P key frames, and $240$P $5$s videos — share one hyperparameter setting from the small-dense calibration ($\text{LR} = 2.26 \times 10^{-3}$, $\text{WD} = 0.01$). The LM training run uses one other ($\text{LR} = 5\times 10^{-4}$, $\text{WD} = 0.05$). Each MoE configuration is compared to its dense baseline at those same hyperparameters.

Every setting delivers consistent MoE-over-dense convergence speedup, with no per-setting retuning:

$\sim\!2.5\times$ on $256$P images
$\sim\!4.5\times$ on $240$P $5$s videos
$\sim\!5.3\times$–$5.5\times$ on LLM training (100k iterations)

This is the strongest form of empirical evidence the paper offers: cheap small-dense sweep → expensive large MoE training, across five very different landscapes, all delivering consistent gains.

Large-scale 240P 5s video training loss, dense vs. MoE. — Large-scale 240P 5s video diffusion training, dense vs. MoE. Left: training loss curves. Right: convergence speedup, measured as dense steps divided by MoE steps needed to reach the same loss during the stable-LR phase of WSD. The MoE variants reach roughly 4.5× speedup at large scale, using the same dense-tuned hyperparameter setting that was also used for the 256P, 512P, and 240P key-frame regimes.

Large-scale 240P 5s video convergence speedup, MoE vs. dense. — Large-scale 240P 5s video diffusion training, dense vs. MoE. Left: training loss curves. Right: convergence speedup, measured as dense steps divided by MoE steps needed to reach the same loss during the stable-LR phase of WSD. The MoE variants reach roughly 4.5× speedup at large scale, using the same dense-tuned hyperparameter setting that was also used for the 256P, 512P, and 240P key-frame regimes.

Large-scale LLM training loss, dense vs. MoE. — Large-scale LLM training (100k iterations), dense vs. MoE variants `128e8a4g1s` and `128e8a1s` at one shared hyperparameter setting (LR = $5\times10^{-4}$, WD = $0.05$). Left: smoothed training loss. Middle: validation loss on C4. Right: convergence speedup vs. dense baseline. The MoE variants reach 5.3–5.5× convergence speedup over dense.

Large-scale LLM validation loss, dense vs. MoE. — Large-scale LLM training (100k iterations), dense vs. MoE variants `128e8a4g1s` and `128e8a1s` at one shared hyperparameter setting (LR = $5\times10^{-4}$, WD = $0.05$). Left: smoothed training loss. Middle: validation loss on C4. Right: convergence speedup vs. dense baseline. The MoE variants reach 5.3–5.5× convergence speedup over dense.

Downstream benchmarks confirm the loss gain

The training-loss advantage translates into downstream-task quality. After the same 100k-step LM training, both MoE variants substantially improve the dense baseline's average across 13 LM benchmarks:

Large-scale dense vs. MoE benchmark evaluation (5-shot, percentages). **Bold** = best per column.
Model	SVAMP	MMLU	ARC-Easy	ARC-Chall.	COPA	PIQA	HellaSwag
Dense	7.0	25.4	64.6	33.6	70.0	72.6	56.5
MoE `128e8a4g1s`	9.7	25.9	71.2	43.6	80.0	77.7	69.3
MoE `128e8a1s`	10.0	26.2	72.9	43.4	85.0	77.8	69.4
Model	WinoGrande	LAMBADA	BoolQ	AGIEval-RC	AGIEval-LR	AGIEval-SAT	Average
Dense	59.0	50.5	62.6	20.9	27.3	26.2	44.3
MoE `128e8a4g1s`	65.2	61.9	63.4	22.8	27.1	20.9	49.1
MoE `128e8a1s`	66.5	63.4	63.9	27.2	24.3	27.2	50.6

Average score: 44.3 → 49.1 for the group-balanced variant and 44.3 → 50.6 for the non-grouped variant — a +4.8 to +6.3 point improvement on average, again at the dense-tuned hyperparameters.

Bonus: large MoEs train at near-dense cost

The other half of the practical win is that MoE training is cheap, especially along the capacity axis that gave the biggest loss gain above. On a single H100 80GB GPU, scaling from $8$ to $256$ total experts (at a fixed $8$ activated) raises step latency only from $87.8$ to $97.0$ ms/step — just $1.08\text{-}1.20\times$ the $81.0$ ms dense SwiGLU baseline. So you can grow MoE parameter count by an order of magnitude while keeping training cost essentially flat.

For context, reaching comparable parameter count by widening a dense model instead would push step latency from $81.0$ to $667.6$ ms ($8\times$ slower) before larger sizes simply run out of memory. The same comparison goes for granularity: fine-grained partitioning ($k = 2 \to 64$) raises latency from $84.1$ to $135.0$ ms/step — not free, but still far cheaper than dense width scaling.

Bottom line: large MoEs trained via capacity scaling are nearly free relative to dense, and Complete-muE makes them tunable for free, too. Both halves of the win compound.

Takeaways

Small dense sweep, large MoE deployment. A single small-dense-proxy sweep is enough — no architecture-specific re-tuning at large scale.
Two bridges, everything else by composition. Bridge I (active-width $\mu$P) + Bridge II (per-expert workload) cover the whole MoE design space without new primitives.
The math is honest about its limits. The SDE cancellation is first-order, with a bounded second-order $\sigma_0$ drift. Small-scale sweeps and large-scale runs both confirm this drift is small enough in practice.
Practical wins are real and measurable. $4.5\times$ on $240$P $5$s video and $5.3\text{-}5.5\times$ on LLM, all at the dense baseline's hyperparameters.

Why this matters

For teams training large MoE models, the cost of "what learning rate should I use for this new MoE configuration?" has historically been a hyperparameter sweep at scale. Complete-muE replaces that sweep with one table lookup. The savings compound every time a team tries a new sparsity, granularity, or hybrid routing variant.

For researchers, the contribution is conceptual: capacity, granularity, and hybrid routing aren't new primitive hyperparameter-transfer rules — they're compositions of two well-understood bridge cases. That decomposition is the part most likely to outlive any specific architecture trend.

The paper

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models — Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong, Yifan Gong, Jianming Zhang, Yan Kang (Adobe Research). Public link forthcoming.

Citation

Please cite this work as:

Peng, Hongwu and Dibua, Ohiremen and Xiong, Yuanjun and Gong, Yifan and Zhang, Jianming and Kang, Yan, "Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models", Adobe Research, 2026.

Or use the BibTeX citation:

@article{peng2026completemue,
  title={Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models},
  author={Peng, Hongwu and Dibua, Ohiremen and Xiong, Yuanjun and Gong, Yifan and Zhang, Jianming and Kang, Yan},
  journal={arXiv preprint},
  year={2026}
}

Questions? Email hongwup@adobe.com.

References

Foundational works Complete-muE builds on: $\mu$P-style transfer, SDE-based optimizer scaling, MoE-specific transfer, and the broader sparse-MoE literature. Citations marked in the body link to the corresponding margin note; the full list is here.

Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., & Gao, J. (2022). Tensor Programs V: Tuning large neural networks via zero-shot hyperparameter transfer. NeurIPS. arXiv:2203.03466.
Dey, N. S., Zhang, B. C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., & Hestness, J. (2025). Don't be lazy: CompleteP enables compute-efficient deep transformers. NeurIPS. arXiv:2505.01618.
Mlodozeniec, B., Ablin, P., Béthune, L., Busbridge, D., Klein, M., Ramapuram, J., & Cuturi, M. (2025). Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration. arXiv:2512.22382.
Malladi, S., Lyu, K., Panigrahi, A., & Arora, S. (2022). On the SDEs and scaling rules for adaptive gradient algorithms. NeurIPS. arXiv:2205.10287.
Małaśnicki, J., Ciebiera, K., Boruń, M., Pióro, M., Ludziejewski, J., Stefaniak, M., Krutul, M., Jaszczur, S., Cygan, M., Adamczewski, K., et al. (2025). Mu-Parametrization for Mixture of Experts. ES-FoMo III Workshop. arXiv:2506.16962.
Jiang, T., Bordelon, B., Pehlevan, C., & Hanin, B. (2026). Hyperparameter Transfer with Mixture-of-Expert Layers. arXiv preprint.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR. arXiv:1701.06538.
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR. arXiv:2101.03961.
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. ACL. arXiv:2401.06066.
Zheng, C., Lou, A., Liu, C., Wei, X., Liu, Z., & Ermon, S. (2025). Scaling Diffusion Transformers Efficiently via $\mu$P. NeurIPS. arXiv:2505.15270.
DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437.
Orvieto, A., De, S., Gulcehre, C., Pascanu, R., & Smith, S. L. (2025). In Search of Adam's Secret Sauce. NeurIPS. arXiv:2510.08198.