MARS-M: When Variance Reduction Meets Matrices
MARS-M: When Variance Reduction Meets Matrices
基于矩阵的预处理优化器,如Muon,最近被证明比基于标量的优化器更有效地训练大规模神经网络,包括大型语言模型(LLM)。
另一方面,最近对LLM预训练优化器的基准测试表明,与不采用方差缩减的标准优化器相比,MARS等方差缩减技术可以实现显著的加速。
本文介绍了一种新的优化器MARS-M,它将MARS中的方差缩减技术与Muon相结合,以实现两全其美。
在标准正则性条件下,我们证明了MARS-M以\(\tilde{\mathcal{O}}(T^{-1/3})\)的速率收敛到一阶驻点,这改进了Muon所达到的\(\tilde{\mathcal{O}}(T^{-1/4})\)速率。
我们在语言建模和计算机视觉任务上的实证结果表明,MARS-M在各种下游基准测试中始终产生较低的损失和改进的性能。
Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs).
On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction.
In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon.
Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of \(\tilde{\mathcal{O}}(T^{-1/3})\), which improves upon \(\tilde{\mathcal{O}}(T^{-1/4})\) rate attained by Muon.
Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks.