MUON optimizer significantly accelerates Grokking in transformers: Microsoft researchers explore the impact of optimizers on delay generalization

Revisiting the Grokking Challenge
In recent years, Grokking– Deep learning models show a sudden delay in transition from memory to generalization, prompting a re-study of training dynamics. Grokking initially observed in small algorithmic tasks such as modular arithmetics that models can achieve near-perfect training accuracy, while verification performance remains poor over a long period of time. Eventually, and often suddenly begins to generalize. Understanding which controls this transition is important not only for interpretability, but also for optimizing training efficiency in deep networks. Previous studies have highlighted the role of weight loss and regularization. However, the specific impact of the optimizer on this process has not been motivated.
Investigate the impact of optimizers on Grokking
This AI paper from Microsoft examines the impact of optimizer selection on Grokking behavior. Specifically, it compares the performance of the widely adopted ADAMW optimizer with MUON, a newer optimization algorithm, which combines the constraints of spectral specifications with second-order information. The study investigated whether these characteristics enable MUON to speed up the generalization phase.
The experiments cover seven algorithmic tasks (mainly modular arithmetic operations and equal classification) using modern transformer architectures. Each task is designed to reliably demonstrate Grokkking under appropriate training conditions. This study also included comparative analysis of soft Max variants (standard SoftMax, StableMax, and Sparsemax) to evaluate whether output normalization plays a secondary role in regulating training dynamics. However, the core investigation is centered on optimizers.
Architecture and Optimized Design
The basic model architecture adopts standard transformer components implemented by Pytorch. It includes multi-head self-attention, rotational position embedding (rope), RMS normalization, SILU activation and dropout-based regularization. The input token (numeric or operator) is encoded with a simple identity embedding.
The key difference is the behavior of optimization:
- adamwThis is the baseline in contemporary deep learning workflows, using adaptive learning rates, weight decay.
- momIn contrast, orthogonal gradients are applied, the limitations of spectral specifications are performed to stabilize training, and the second order curvature is approximate to obtain more informative updates.
These mechanisms are designed to promote broader exploration in optimization, mitigate instability, such as “SoftMax crash”, and synchronize learning progress across levels. The ability of MUON to adjust the update amplitude according to layer size is particularly important in avoiding inefficient memory pathways.
Three SoftMax configurations are included – SoftMax, STABLEMAX and SPARSEMAX, including evaluating whether numerical stability or sparseness of output distributions affects Grokking. This helps ensure that the observed effects are mainly derived from optimizer dynamics, rather than the nuance of output activation.
Experience assessment and results
The empirical protocols for the study were designed methodically. Each optimizer-soft-task combination is evaluated on multiple seeds to ensure statistical robustness. Grokking is operationally defined as the first epoch, and after the training accuracy is stable, the verification accuracy exceeds 95%.
The results show that MUON has a consistent and statistically significant advantage. On average, Muon reached the Grokking threshold in 102.89 periods, while ADAMW was 153.09 periods. This difference is not only numerically large, but also statistically strict (t = 5.0175, p≈6.33e -8). Furthermore, Muon shows that the distribution of Grokking period is closer under all conditions, which suggests that the training trajectory is more predictable.
All tasks are performed on NVIDIA H100 GPUs using a unified code base and standardized configuration. Tasks include modular addition, multiplication, division, indication, GCD and 10-bit parity tasks. Dataset sizes range from 1,024 to 9,409 examples, and each task adjusts the training validation split to maintain consistency.
in conclusion
These findings provide strong evidence that optimizer geometry significantly affects the emergence of generalizations in over-parameterized models. By shifting the optimization path to second-order aware update and spectrum specification constraints, MUON appears to help discover the underlying data structure more directly, thus bypassing the extended overfitting phase.
This study highlights the widespread need to view optimization strategies as a first-class factor in neural training design. Although previous work emphasizes data and regularization, these results suggest that the optimizer architecture itself can play a key role in shaping training dynamics.
Check Paper. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.
🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
