View source on GitHub
|
Decode UTF8 tokens into code points and return their bits.
text.utf8_binarize(
tokens, word_length=16, bits_per_char=24, replacement_char=65533, name=None
)
See the RetVec paper for details.
Example:
code_points = utf8_binarize("hello", word_length=3, bits_per_char=4)print(code_points.numpy())[0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1.]
The codepoints are encoded bitwise in the little-endian order.
The inner dimension of the output is always word_length * bits_per_char,
because extra characters are truncated / missing characters are padded,
and bits_per_char lowest bits of each codepoint is stored.
Decoding errors (which in applications are often replaced with the character
U+65533 "REPLACEMENT CHARACTER") are represented with replacement_char's
bits_per_char lowest bits.
Returns | |
|---|---|
| A tensor of floating-point zero and one values corresponding to the bits of the token characters' Unicode code points. | |
Shape
|
[<shape oftokens>, word_length * bits_per_char].
|
View source on GitHub