Stochastic Weight Averaging
optimizer_swa( optimizer, start_averaging = 0, average_period = 10, name = "SWA", sequential_update = TRUE, clipnorm = NULL, clipvalue = NULL, decay = NULL, lr = NULL )
optimizer | The original optimizer that will be used to compute and apply the gradients. |
---|---|
start_averaging | An integer. Threshold to start averaging using SWA. Averaging only occurs at start_averaging iters, must be >= 0. If start_averaging = m, the first snapshot will be taken after the mth application of gradients (where the first iteration is iteration 0). |
average_period | An integer. The synchronization period of SWA. The averaging occurs every average_period steps. Averaging period needs to be >= 1. |
name | Optional name for the operations created when applying gradients. Defaults to 'SWA'. |
sequential_update | Bool. If FALSE, will compute the moving average at the same time as the model is updated, potentially doing benign data races. If True, will update the moving average after gradient updates |
clipnorm | is clip gradients by norm. |
clipvalue | is clip gradients by value. |
decay | is included for backward compatibility to allow time inverse decay of learning rate. |
lr | is included for backward compatibility, recommended to use learning_rate instead. |
Optimizer for use with `keras::compile()`
The Stochastic Weight Averaging mechanism was proposed by Pavel Izmailov et. al in the paper [Averaging Weights Leads to Wider Optima and Better Generalization](https://arxiv.org/abs/1803.05407). The optimizer implements averaging of multiple points along the trajectory of SGD. The optimizer expects an inner optimizer which will be used to apply the gradients to the variables and itself computes a running average of the variables every k steps (which generally corresponds to the end of a cycle when a cyclic learning rate is employed). We also allow the specification of the number of steps averaging should first happen after. Let's say, we want averaging to happen every k steps after the first m steps. After step m we'd take a snapshot of the variables and then average the weights appropriately at step m + k, m + 2k and so on. The assign_average_vars function can be called at the end of training to obtain the averaged_weights from the optimizer.
if (FALSE) { opt = tf$keras$optimizers$SGD(learning_rate) opt = optimizer_swa(opt, start_averaging=m, average_period=k) }