weight decay

传统梯度下降法
$$
\mathbf{x}{t+1}=\mathbf{x}{t}-\alpha \nabla f_{t}(\mathbf{x}_t)
$$

缺陷是

weight decay

In the weight decay described by Hanson & Pratt (1988),
the weights $\mathbf{x}$ decay exponentially as
$$
\mathbf{x}{t+1}=(1-w) \mathbf{x}{t}-\alpha \nabla f_{t}(\mathbf{x}_t)
$$

where $w$ defines the rate of the weight decay per step and
$\nabla f_{t}(\mathbf{x}_t)$ is the $t$-th batch gradient to be multiplied by a learning rate $\alpha$.

L2 regularization VS weight decay

Commonly, we

$$
f_{t}^{reg}(\mathbf{x}{t}) = f{t}(\mathbf{x}_t)+\frac{w}{2} {\left | \mathbf{x}t \right |}{2}^{2}
$$

Decoupling the Weight Decay from the gradient-based update

Reference

1.