Skip-gram sampling with a text vocabulary file.
skip_gram_sample_with_text_vocab( input_tensor, vocab_freq_file, vocab_token_index = 0, vocab_token_dtype = tf$string, vocab_freq_index = 1, vocab_freq_dtype = tf$float64, vocab_delimiter = ",", vocab_min_count = NULL, vocab_subsampling = NULL, corpus_size = NULL, min_skips = 1, max_skips = 5, start = 0, limit = -1, emit_self_as_target = FALSE, batch_size = NULL, batch_capacity = NULL, seed = NULL, name = NULL )
input_tensor | A rank-1 `Tensor` from which to generate skip-gram candidates. |
---|---|
vocab_freq_file | `string` specifying full file path to the text vocab file. |
vocab_token_index | `int` specifying which column in the text vocab file contains the tokens. |
vocab_token_dtype | `DType` specifying the format of the tokens in the text vocab file. |
vocab_freq_index | `int` specifying which column in the text vocab file contains the frequency counts of the tokens. |
vocab_freq_dtype | `DType` specifying the format of the frequency counts in the text vocab file. |
vocab_delimiter | `string` specifying the delimiter used in the text vocab file. |
vocab_min_count | `int`, `float`, or scalar `Tensor` specifying minimum frequency threshold (from `vocab_freq_file`) for a token to be kept in `input_tensor`. This should correspond with `vocab_freq_dtype`. |
vocab_subsampling | (Optional) `float` specifying frequency proportion threshold for tokens from `input_tensor`. Tokens that occur more frequently will be randomly down-sampled. Reasonable starting values may be around 1e-3 or 1e-5. See Eq. 5 in http://arxiv.org/abs/1310.4546 for more details. |
corpus_size | (Optional) `int`, `float`, or scalar `Tensor` specifying the total number of tokens in the corpus (e.g., sum of all the frequency counts of `vocab_freq_file`). Used with `vocab_subsampling` for down-sampling frequently occurring tokens. If this is specified, `vocab_freq_file` and `vocab_subsampling` must also be specified. If `corpus_size` is needed but not supplied, then it will be calculated from `vocab_freq_file`. You might want to supply your own value if you have already eliminated infrequent tokens from your vocabulary files (where frequency < vocab_min_count) to save memory in the internal token lookup table. Otherwise, the unused tokens' variables will waste memory. The user-supplied `corpus_size` value must be greater than or equal to the sum of all the frequency counts of `vocab_freq_file`. |
min_skips | `int` or scalar `Tensor` specifying the minimum window size to randomly use for each token. Must be >= 0 and <= `max_skips`. If `min_skips` and `max_skips` are both 0, the only label outputted will be the token itself. |
max_skips | `int` or scalar `Tensor` specifying the maximum window size to randomly use for each token. Must be >= 0. |
start | `int` or scalar `Tensor` specifying the position in `input_tensor` from which to start generating skip-gram candidates. |
limit | `int` or scalar `Tensor` specifying the maximum number of elements in `input_tensor` to use in generating skip-gram candidates. -1 means to use the rest of the `Tensor` after `start`. |
emit_self_as_target | `bool` or scalar `Tensor` specifying whether to emit each token as a label for itself. |
batch_size | (Optional) `int` specifying batch size of returned `Tensors`. |
batch_capacity | (Optional) `int` specifying batch capacity for the queue used for batching returned `Tensors`. Only has an effect if `batch_size` > 0. Defaults to 100 * `batch_size` if not specified. |
seed | (Optional) `int` used to create a random seed for window size and subsampling. See [`set_random_seed`](../../g3doc/python/constant_op.md#set_random_seed) for behavior. |
name | (Optional) A `string` name or a name scope for the operations. |
A `list` containing (token, label) `Tensors`. Each output `Tensor` is of rank-1 and has the same type as `input_tensor`. The `Tensors` will be of length `batch_size`; if `batch_size` is not specified, they will be of random length, though they will be in sync with each other as long as they are evaluated together.
Wrapper around `skip_gram_sample()` for use with a text vocabulary file. The vocabulary file is expected to be a plain-text file, with lines of `vocab_delimiter`-separated columns. The `vocab_token_index` column should contain the vocabulary term, while the `vocab_freq_index` column should contain the number of times that term occurs in the corpus. For example, with a text vocabulary file of: ``` bonjour,fr,42 hello,en,777 hola,es,99 ``` You should set `vocab_delimiter=","`, `vocab_token_index=0`, and `vocab_freq_index=2`. See `skip_gram_sample()` documentation for more details about the skip-gram sampling process.
ValueError: If `vocab_token_index` or `vocab_freq_index` is less than 0 or exceeds the number of columns in `vocab_freq_file`. If `vocab_token_index` and `vocab_freq_index` are both set to the same column. If any token in `vocab_freq_file` has a negative frequency.