The new term vectors API exposes offsets. However, these offsets have been computed for the UTF-16 encoding, so they are going to look buggy if applied to a string which is not UTF-16-encoded. In particular, I'm thinking that if you are using a language such as Python 3 that uses UTF-8 as a default encoding for strings, you shouldn't use these offsets directly to compute sub-strings.
The new term vectors API exposes offsets. However, these offsets have been computed for the UTF-16 encoding, so they are going to look buggy if applied to a string which is not UTF-16-encoded. In particular, I'm thinking that if you are using a language such as Python 3 that uses UTF-8 as a default encoding for strings, you shouldn't use these offsets directly to compute sub-strings.