text_hashing_trick
Converts a text to a sequence of indexes in a fixed-size hashing space.
Description
Converts a text to a sequence of indexes in a fixed-size hashing space.
Usage
text_hashing_trick(
text,
n, hash_function = NULL,
filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n",
lower = TRUE,
split = " "
)
Arguments
Arguments | Description |
---|---|
text | Input text (string). |
n | Dimension of the hashing space. |
hash_function | if NULL uses the Python hash() function. Otherwise can be 'md5' or any function that takes in input a string and returns an int. Note that hash is not a stable hashing function, so it is not consistent across different runs, while 'md5' is a stable hashing function. |
filters | Sequence of characters to filter out such as punctuation. Default includes basic punctuation, tabs, and newlines. |
lower | Whether to convert the input to lowercase. |
split | Sentence split marker (string). |
Details
Two or more words may be assigned to the same index, due to possible collisions by the hashing function.
Value
A list of integer word indices (unicity non-guaranteed).
See Also
Other text preprocessing: make_sampling_table()
, pad_sequences()
, skipgrams()
, text_one_hot()
, text_to_word_sequence()