text.utf8_binarize

Decode UTF8 tokens into code points and return their bits.

text.utf8_binarize(
    tokens, word_length=16, bits_per_char=24, replacement_char=65533, name=None
)

See the RetVec paper for details.

Example:

code_points = utf8_binarize("hello", word_length=3, bits_per_char=4)
print(code_points.numpy())
[0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1.]

The codepoints are encoded bitwise in the little-endian order. The inner dimension of the output is always word_length * bits_per_char, because extra characters are truncated / missing characters are padded, and bits_per_char lowest bits of each codepoint is stored.

Decoding errors (which in applications are often replaced with the character U+65533 "REPLACEMENT CHARACTER") are represented with replacement_char's bits_per_char lowest bits.

Args
`tokens`	A `Tensor` of tokens (strings) with any shape.
`word_length`	Number of Unicode characters to process per word (the rest are silently ignored; the output is zero-padded).
`bits_per_char`	The number of lowest bits of the Unicode codepoint to encode.
`replacement_char`	The Unicode codepoint to use on decoding errors.
`name`	The op name (optional).

Returns
A tensor of floating-point zero and one values corresponding to the bits of the token characters' Unicode code points.
`Shape`	`[<shape of`tokens`>, word_length * bits_per_char]`.

text.utf8_binarize Stay organized with collections Save and categorize content based on your preferences.

Example:

Args

Returns

text.utf8_binarize