Monotonic attention mechanism with Bahadanau-style energy function.
attention_bahdanau_monotonic( object, units, memory = NULL, memory_sequence_length = NULL, normalize = FALSE, sigmoid_noise = 0, sigmoid_noise_seed = NULL, score_bias_init = 0, mode = "parallel", kernel_initializer = "glorot_uniform", dtype = NULL, name = "BahdanauMonotonicAttention", ... )
object | Model or layer object |
---|---|
units | The depth of the query mechanism. |
memory | The memory to query; usually the output of an RNN encoder. This tensor should be shaped [batch_size, max_time, ...]. |
memory_sequence_length | (optional): Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths. |
normalize | Python boolean. Whether to normalize the energy term. |
sigmoid_noise | Standard deviation of pre-sigmoid noise. See the docstring for `_monotonic_probability_fn` for more information. |
sigmoid_noise_seed | (optional) Random seed for pre-sigmoid noise. |
score_bias_init | Initial value for score bias scalar. It's recommended to initialize this to a negative value when the length of the memory is large. |
mode | How to compute the attention distribution. Must be one of 'recursive', 'parallel', or 'hard'. See the docstring for tfa.seq2seq.monotonic_attention for more information. |
kernel_initializer | (optional), the name of the initializer for the attention kernel. |
dtype | The data type for the query and memory layers of the attention mechanism. |
name | Name to use when creating ops. |
... | A list that contains other common arguments for layer creation. |
None
This type of attention enforces a monotonic constraint on the attention distributions; that is once the model attends to a given point in the memory it can't attend to any prior points at subsequence output timesteps. It achieves this by using the _monotonic_probability_fn instead of softmax to construct its attention distributions. Since the attention scores are passed through a sigmoid, a learnable scalar bias parameter is applied after the score function and before the sigmoid. Otherwise, it is equivalent to BahdanauAttention. This approach is proposed in
Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, Douglas Eck, "Online and Linear-Time Attention by Enforcing Monotonic Alignments." ICML 2017. https://arxiv.org/abs/1704.00784