Stochastic Weight Averaging

optimizer_swa(
  optimizer,
  start_averaging = 0,
  average_period = 10,
  name = "SWA",
  sequential_update = TRUE,
  clipnorm = NULL,
  clipvalue = NULL,
  decay = NULL,
  lr = NULL
)

Arguments

optimizer

The original optimizer that will be used to compute and apply the gradients.

start_averaging

An integer. Threshold to start averaging using SWA. Averaging only occurs at start_averaging iters, must be >= 0. If start_averaging = m, the first snapshot will be taken after the mth application of gradients (where the first iteration is iteration 0).

average_period

An integer. The synchronization period of SWA. The averaging occurs every average_period steps. Averaging period needs to be >= 1.

name

Optional name for the operations created when applying gradients. Defaults to 'SWA'.

sequential_update

Bool. If FALSE, will compute the moving average at the same time as the model is updated, potentially doing benign data races. If True, will update the moving average after gradient updates

clipnorm

is clip gradients by norm.

clipvalue

is clip gradients by value.

decay

is included for backward compatibility to allow time inverse decay of learning rate.

lr

is included for backward compatibility, recommended to use learning_rate instead.

Value

Optimizer for use with `keras::compile()`

Details

The Stochastic Weight Averaging mechanism was proposed by Pavel Izmailov et. al in the paper [Averaging Weights Leads to Wider Optima and Better Generalization](https://arxiv.org/abs/1803.05407). The optimizer implements averaging of multiple points along the trajectory of SGD. The optimizer expects an inner optimizer which will be used to apply the gradients to the variables and itself computes a running average of the variables every k steps (which generally corresponds to the end of a cycle when a cyclic learning rate is employed). We also allow the specification of the number of steps averaging should first happen after. Let's say, we want averaging to happen every k steps after the first m steps. After step m we'd take a snapshot of the variables and then average the weights appropriately at step m + k, m + 2k and so on. The assign_average_vars function can be called at the end of training to obtain the averaged_weights from the optimizer.

Examples

if (FALSE) { opt = tf$keras$optimizers$SGD(learning_rate) opt = optimizer_swa(opt, start_averaging=m, average_period=k) }