Yogi Optimizer New!
Training GANs is a balancing act. The discriminator and generator often produce wildly fluctuating gradient magnitudes. Practitioners have reported that Yogi reduces mode collapse and produces higher quality samples because it prevents the optimizer from "forgetting" rare gradient features.
of the effective learning rate, providing better convergence guarantees. Key Improvements Over Adam Adaptive Learning Rate Control
Or, in its practical implementation: $$v_t = v_t-1 + (1 - \beta_2) \cdot \textsign(g_t^2 - v_t-1) \cdot g_t^2$$ yogi optimizer
Yogi is frequently used in complex deep learning tasks that require high stability, such as: Biometrics
$$m_t = \beta_1 m_t-1 + (1 - \beta_1) g_t$$ $$v_t = v_t-1 - (1 - \beta_2) \cdot \textsign(v_t-1 - g_t^2) \cdot g_t^2$$ (Note: Some implementations use $v_t = v_t-1 + (1 - \beta_2) \cdot \textsign(g_t^2 - v_t-1) \cdot g_t^2$ for readability) $$\hatm_t = m_t / (1 - \beta_1^t)$$ $$\theta_t+1 = \theta_t - \eta \cdot \hatm_t / (\sqrtv_t + \epsilon)$$ Training GANs is a balancing act
Where $g_t$ is the gradient at time $t$ and $\beta_2$ is a decay rate. The problem arises when the gradients are large and sparse. Adam adds the new squared gradient to the running average. If the running average is small and a large gradient suddenly appears, Adam updates the average aggressively. In some cases, this prevents the algorithm from regulating the effective step size correctly, leading to sub-optimal convergence.
The crucial difference is in how Yogi handles the second moment estimator. Instead of simply adding the squared gradient, Yogi of the effective learning rate, providing better convergence
model = MyNeuralNet() optimizer = optim.Yogi( model.parameters(), lr=0.01, betas=(0.9, 0.999), eps=1e-3, initial_accumulator=1e-6 )
model.compile(optimizer=optimizer, loss='categorical_crossentropy')
Without delving too deeply into the calculus, Adam’s update rule looks roughly like this for the second moment ($v_t$):
The Yogi Optimizer is not a hype-driven replacement for Adam; it is a mathematically rigorous evolution for adversarial and noisy environments. Keep Adam in your toolkit, but keep Yogi for when the training gets tough.
