Is clipnorm applied before or after momentum in keras?

  keras, momentum, python, tensorflow, tf.keras

In Keras or Tensorflow clipnorm rescales large "gradients" to have a specific norm and clipvalue bounds all the values of the "gradient".

But what happens if you combine one of them with moemntum or something like adam.

A) Is clipnorm applied on the actual pure mathematical gradient of the loss with respect to the parameters and then this clipped gradient is used to calculate the update step using the momentum of the old gradients and the learning rate?

velocity = momentum * velocity - learning_rate * clipnorm(g)
w = w + velocity


B) First the momentum of the old gradients is combined with the unmodified new gradient. Then the resulting vector (the "velocity") gets scaled by clipnorm.

velocity = clipnorm(momentum * velocity - learning_rate * g)
w = w + velocity

or B’)

velocity = clipnorm(momentum * velocity - learning_rate * g)
w = w + clipnorm(velocity)

or there would also be the possibility of A’)

velocity = momentum * velocity - clipnorm(learning_rate * g)
w = w + velocity


A (and A’) would suffer from the problem that even though the norm of the gradient is bounded the velocity could get arbitrarily large due to momentum and the clipnorm would make it even slower to break down the velocity or change the direction.

From my perspective B would be the most reasonable, but I don’t know how it is actually implemented.

The same question can be analogously asked for clipvalue and adam and other momentum based algorithms.

Source: Python Questions