Skip-gram sampling with a text vocabulary file.

skip_gram_sample_with_text_vocab(
  input_tensor,
  vocab_freq_file,
  vocab_token_index = 0,
  vocab_token_dtype = tf$string,
  vocab_freq_index = 1,
  vocab_freq_dtype = tf$float64,
  vocab_delimiter = ",",
  vocab_min_count = NULL,
  vocab_subsampling = NULL,
  corpus_size = NULL,
  min_skips = 1,
  max_skips = 5,
  start = 0,
  limit = -1,
  emit_self_as_target = FALSE,
  batch_size = NULL,
  batch_capacity = NULL,
  seed = NULL,
  name = NULL
)

Arguments

input_tensor

A rank-1 `Tensor` from which to generate skip-gram candidates.

vocab_freq_file

`string` specifying full file path to the text vocab file.

vocab_token_index

`int` specifying which column in the text vocab file contains the tokens.

vocab_token_dtype

`DType` specifying the format of the tokens in the text vocab file.

vocab_freq_index

`int` specifying which column in the text vocab file contains the frequency counts of the tokens.

vocab_freq_dtype

`DType` specifying the format of the frequency counts in the text vocab file.

vocab_delimiter

`string` specifying the delimiter used in the text vocab file.

vocab_min_count

`int`, `float`, or scalar `Tensor` specifying minimum frequency threshold (from `vocab_freq_file`) for a token to be kept in `input_tensor`. This should correspond with `vocab_freq_dtype`.

vocab_subsampling

(Optional) `float` specifying frequency proportion threshold for tokens from `input_tensor`. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details.

corpus_size

(Optional) `int`, `float`, or scalar `Tensor` specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of `vocab_freq_file`). Used with `vocab_subsampling` for down-sampling frequently occurring tokens. If this is specified, `vocab_freq_file` and `vocab_subsampling` must also be specified. If `corpus_size` is needed but not supplied, then it will be calculated from `vocab_freq_file`. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-supplied `corpus_size` value must be greater than or equal to the sum of all the frequency counts of `vocab_freq_file`.

min_skips

`int` or scalar `Tensor` specifying the minimum window size to randomly use for each token. Must be >= 0 and <= `max_skips`. If `min_skips` and `max_skips` are both 0, the only label outputted will be the token itself.

max_skips

`int` or scalar `Tensor` specifying the maximum window size to randomly use for each token. Must be >= 0.

start

`int` or scalar `Tensor` specifying the position in `input_tensor` from which to start generating skip-gram candidates.

limit

`int` or scalar `Tensor` specifying the maximum number of elements in `input_tensor` to use in generating skip-gram candidates. -1 means to use the rest of the `Tensor` after `start`.

emit_self_as_target

`bool` or scalar `Tensor` specifying whether to emit each token as a label for itself.

batch_size

(Optional) `int` specifying batch size of returned `Tensors`.

batch_capacity

(Optional) `int` specifying batch capacity for the queue used for batching returned `Tensors`. Only has an effect if `batch_size` > 0. Defaults to 100 * `batch_size` if not specified.

seed

(Optional) `int` used to create a random seed for window size and subsampling. See [`set_random_seed`](../../g3doc/python/constant_op.md#set_random_seed) for behavior.

name

(Optional) A `string` name or a name scope for the operations.

Value

A `list` containing (token, label) `Tensors`. Each output `Tensor` is of rank-1 and has the same type as `input_tensor`. The `Tensors` will be of length `batch_size`; if `batch_size` is not specified, they will be of random length, though they will be in sync with each other as long as they are evaluated together.

Details

Wrapper around `skip_gram_sample()` for use with a text vocabulary file. The vocabulary file is expected to be a plain-text file, with lines of `vocab_delimiter`-separated columns. The `vocab_token_index` column should contain the vocabulary term, while the `vocab_freq_index` column should contain the number of times that term occurs in the corpus. For example, with a text vocabulary file of: ``` bonjour,fr,42 hello,en,777 hola,es,99 ``` You should set `vocab_delimiter=","`, `vocab_token_index=0`, and `vocab_freq_index=2`. See `skip_gram_sample()` documentation for more details about the skip-gram sampling process.

Raises

ValueError: If `vocab_token_index` or `vocab_freq_index` is less than 0 or exceeds the number of columns in `vocab_freq_file`. If `vocab_token_index` and `vocab_freq_index` are both set to the same column. If any token in `vocab_freq_file` has a negative frequency.