Skip to content

Terms vector API should document the encoding which is used to compute the offsets #4363

@jpountz

Description

@jpountz

The new term vectors API exposes offsets. However, these offsets have been computed for the UTF-16 encoding, so they are going to look buggy if applied to a string which is not UTF-16-encoded. In particular, I'm thinking that if you are using a language such as Python 3 that uses UTF-8 as a default encoding for strings, you shouldn't use these offsets directly to compute sub-strings.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions