Skip gram sample — skip_gram

Generates skip-gram token and label paired Tensors from the input

skip_gram_sample(
  input_tensor,
  min_skips = 1,
  max_skips = 5,
  start = 0,
  limit = -1,
  emit_self_as_target = FALSE,
  vocab_freq_table = NULL,
  vocab_min_count = NULL,
  vocab_subsampling = NULL,
  corpus_size = NULL,
  batch_size = NULL,
  batch_capacity = NULL,
  seed = NULL,
  name = NULL
)

Arguments

input_tensor	A rank-1 `Tensor` from which to generate skip-gram candidates.
min_skips	`int` or scalar `Tensor` specifying the minimum window size to randomly use for each token. Must be >= 0 and <= `max_skips`. If `min_skips` and `max_skips` are both 0, the only label outputted will be the token itself when `emit_self_as_target = TRUE` - or no output otherwise.
max_skips	`int` or scalar `Tensor` specifying the maximum window size to randomly use for each token. Must be >= 0.
start	`int` or scalar `Tensor` specifying the position in `input_tensor` from which to start generating skip-gram candidates.
limit	`int` or scalar `Tensor` specifying the maximum number of elements in `input_tensor` to use in generating skip-gram candidates. -1 means to use the rest of the `Tensor` after `start`.
emit_self_as_target	`bool` or scalar `Tensor` specifying whether to emit each token as a label for itself.
vocab_freq_table	(Optional) A lookup table (subclass of `lookup.InitializableLookupTableBase`) that maps tokens to their raw frequency counts. If specified, any token in `input_tensor` that is not found in `vocab_freq_table` will be filtered out before generating skip-gram candidates. While this will typically map to integer raw frequency counts, it could also map to float frequency proportions. `vocab_min_count` and `corpus_size` should be in the same units as this.
vocab_min_count	(Optional) `int`, `float`, or scalar `Tensor` specifying minimum frequency threshold (from `vocab_freq_table`) for a token to be kept in `input_tensor`. If this is specified, `vocab_freq_table` must also be specified - and they should both be in the same units.
vocab_subsampling	(Optional) `float` specifying frequency proportion threshold for tokens from `input_tensor`. Tokens that occur more frequently (based on the ratio of the token's `vocab_freq_table` value to the `corpus_size`) will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. If this is specified, both `vocab_freq_table` and `corpus_size` must also be specified. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.
corpus_size	(Optional) `int`, `float`, or scalar `Tensor` specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of `vocab_freq_table`). Used with `vocab_subsampling` for down-sampling frequently occurring tokens. If this is specified, `vocab_freq_table` and `vocab_subsampling` must also be specified.
batch_size	(Optional) `int` specifying batch size of returned `Tensors`.
batch_capacity	(Optional) `int` specifying batch capacity for the queue used for batching returned `Tensors`. Only has an effect if `batch_size` > 0. Defaults to 100 * `batch_size` if not specified.
seed	(Optional) `int` used to create a random seed for window size and subsampling. See `set_random_seed` docs for behavior.
name	(Optional) A `string` name or a name scope for the operations.

Value

A `list` containing (token, label) `Tensors`. Each output `Tensor` is of rank-1 and has the same type as `input_tensor`. The `Tensors` will be of length `batch_size`; if `batch_size` is not specified, they will be of random length, though they will be in sync with each other as long as they are evaluated together.

Details

tensor. Generates skip-gram `("token", "label")` pairs using each element in the rank-1 `input_tensor` as a token. The window size used for each token will be randomly selected from the range specified by `[min_skips, max_skips]`, inclusive. See https://arxiv.org/abs/1301.3781 for more details about skip-gram. For example, given `input_tensor = ["the", "quick", "brown", "fox", "jumps"]`, `min_skips = 1`, `max_skips = 2`, `emit_self_as_target = FALSE`, the output `(tokens, labels)` pairs for the token "quick" will be randomly selected from either `(tokens=["quick", "quick"], labels=["the", "brown"])` for 1 skip, or `(tokens=["quick", "quick", "quick"], labels=["the", "brown", "fox"])` for 2 skips. If `emit_self_as_target = TRUE`, each token will also be emitted as a label for itself. From the previous example, the output will be either `(tokens=["quick", "quick", "quick"], labels=["the", "quick", "brown"])` for 1 skip, or `(tokens=["quick", "quick", "quick", "quick"], labels=["the", "quick", "brown", "fox"])` for 2 skips. The same process is repeated for each element of `input_tensor` and concatenated together into the two output rank-1 `Tensors` (one for all the tokens, another for all the labels). If `vocab_freq_table` is specified, tokens in `input_tensor` that are not present in the vocabulary are discarded. Tokens whose frequency counts are below `vocab_min_count` are also discarded. Tokens whose frequency proportions in the corpus exceed `vocab_subsampling` may be randomly down-sampled. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details about subsampling. Due to the random window sizes used for each token, the lengths of the outputs are non-deterministic, unless `batch_size` is specified to batch the outputs to always return `Tensors` of length `batch_size`.

Raises

ValueError: If `vocab_freq_table` is not provided, but `vocab_min_count`, `vocab_subsampling`, or `corpus_size` is specified. If `vocab_subsampling` and `corpus_size` are not both present or both absent.