MARS

MARS (Make vAriance Reduction Shine)是一个统一的优化框架，旨在解决训练大型模型的固有挑战。传统的自适应梯度方法，如Adam和AdamW，经常遭受高随机梯度方差的困扰，而方差缩减技术在深度学习中难以获得实际影响。 MARS的核心由两个主要部分组成：（1）一个缩放的随机递归动量，它提供了一个方差减小的全梯度估计器，以获得更好的梯度复杂性；以及（2）预处理更新，它近似于二阶牛顿法，以获得更好的每次迭代复杂度。通过将预处理梯度方法与方差缩减相结合，MARS实现了两全其美，加速了优化中关键点的搜索。

共同一作：袁会卓（字节跳动），刘益枫（UCLA计算机系）
通讯作者：顾全全教授（字节跳动&UCLA计算机系）
其他作者：吴双（字节跳动），周珣（字节跳动）

MARS (Make vAriance Reduction Shine) is a unified optimization framework designed to address the inherent challenges of training large models. Traditional adaptive gradient methods like Adam and AdamW often suffer from high stochastic gradient variance, while variance reduction techniques have struggled to gain practical impact in deep learning. At its core, MARS comprises two major components: (1) a scaled stochastic recursive momentum, which provides a variance-reduced estimator of the full gradient for better gradient complexity; and (2) the preconditioned update, which approximates the second-order Newton's method for better per-iteration complexity. By combining preconditioned gradient methods with variance reduction, MARS achieves the best of both worlds, accelerating the search for critical points in optimization.

Co-first authors: Huizhuo Yuan (ByteDance Inc.), Yifeng Liu (UCLA CS Dept.)
Corresponding authors: Prof. Quanquan Gu (ByteDance Inc. & UCLA CS Dept.)
Other authors: Shuang Wu (ByteDance Inc.), Xun Zhou (ByteDance Inc.)