In Keras or Tensorflow `clipnorm`

rescales large "gradients" to have a specific norm and `clipvalue`

bounds all the values of the "gradient".

But what happens if you combine one of them with `moemntum`

or something like `adam`

.

A) Is `clipnorm`

applied on the actual pure mathematical gradient of the loss with respect to the parameters and then this clipped gradient is used to calculate the update step using the momentum of the old gradients and the learning rate?

```
velocity = momentum * velocity - learning_rate * clipnorm(g)
w = w + velocity
```

or

B) First the momentum of the old gradients is combined with the unmodified new gradient. Then the resulting vector (the "velocity") gets scaled by clipnorm.

```
velocity = clipnorm(momentum * velocity - learning_rate * g)
w = w + velocity
```

or B’)

```
velocity = clipnorm(momentum * velocity - learning_rate * g)
w = w + clipnorm(velocity)
```

or there would also be the possibility of A’)

```
velocity = momentum * velocity - clipnorm(learning_rate * g)
w = w + velocity
```

???

A (and A’) would suffer from the problem that even though the norm of the gradient is bounded the velocity could get arbitrarily large due to momentum and the clipnorm would make it even slower to break down the velocity or change the direction.

From my perspective B would be the most reasonable, but I don’t know how it is actually implemented.

The same question can be analogously asked for `clipvalue`

and `adam`

and other momentum based algorithms.

Source: Python Questions