This is an implementation of the SGDW optimizer described in "Decoupled Weight Decay Regularization" by Loshchilov & Hutter (https://arxiv.org/abs/1711.05101) ([pdf])(https://arxiv.org/pdf/1711.05101.pdf). It computes the update step of tf.keras.optimizers.SGD and additionally decays the variable. Note that this is different from adding L2 regularization on the variables to the loss. Decoupling the weight decay from other hyperparameters (in particular the learning rate) simplifies hyperparameter search. For further information see the documentation of the SGD Optimizer.

optimizer_decay_sgdw(
  weight_decay,
  learning_rate = 0.001,
  momentum = 0,
  nesterov = FALSE,
  name = "SGDW",
  clipnorm = NULL,
  clipvalue = NULL,
  decay = NULL,
  lr = NULL
)

Arguments

weight_decay

weight decay rate.

learning_rate

float hyperparameter >= 0. Learning rate.

momentum

float hyperparameter >= 0 that accelerates SGD in the relevant direction and dampens oscillations.

nesterov

boolean. Whether to apply Nesterov momentum.

name

Optional name prefix for the operations created when applying gradients. Defaults to 'SGD'.

clipnorm

is clip gradients by norm.

clipvalue

is clip gradients by value.

decay

is included for backward compatibility to allow time inverse decay of learning rate.

lr

is included for backward compatibility, recommended to use learning_rate instead.

Value

Optimizer for use with `keras::compile()`

Examples

if (FALSE) { step = tf$Variable(0L, trainable = FALSE) schedule = tf$optimizers$schedules$PiecewiseConstantDecay(list(c(10000, 15000)), list(c(1e-0, 1e-1, 1e-2))) lr = 1e-1 * schedule(step) wd = lambda: 1e-4 * schedule(step) }