text_tokenizer

Text tokenization utility

Description

Vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf…

Usage

 
text_tokenizer( 
  num_words = NULL, 
  filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n", 
  lower = TRUE, 
  split = " ", 
  char_level = FALSE, 
  oov_token = NULL 
)

Arguments

Arguments	Description
num_words	the maximum number of words to keep, based on word frequency. Only the most common `num_words` words will be kept.
filters	a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ’ character.
lower	boolean. Whether to convert the texts to lowercase.
split	character or string to use for token splitting.
char_level	if `TRUE`, every character will be treated as a token
oov_token	`NULL` or string If given, it will be added to `word_index`` and used to replace out-of-vocabulary words during text_to_sequence calls.

Details

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ’ character). These sequences are then split into lists of tokens. They will then be indexed or vectorized. 0 is a reserved index that won’t be assigned to any word.

Section

Attributes

The tokenizer object has the following attributes:

word_counts — named list mapping words to the number of times they appeared on during fit. Only set after fit_text_tokenizer() is called on the tokenizer.
word_docs — named list mapping words to the number of documents/texts they appeared on during fit. Only set after fit_text_tokenizer() is called on the tokenizer.
word_index — named list mapping words to their rank/index (int). Only set after fit_text_tokenizer() is called on the tokenizer.
document_count — int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_text_tokenizer() is called on the tokenizer.