DOCS: add add_prefix_space to processors.ByteLevel #1878
DOCS: add add_prefix_space to processors.ByteLevel #1878ArthurZucker merged 10 commits intohuggingface:mainfrom
add_prefix_space to processors.ByteLevel #1878Conversation
add_prefix_space to processors.ByteLevel
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM but lets update to proper doc
bindings/python/src/processors.rs
Outdated
| /// | ||
| /// add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`True`): | ||
| /// Whether the add_prefix_space option was enabled during pre-tokenization. This |
There was a problem hiding this comment.
the doc is wrong in the sense that this does not remove the space in the post processor nor does it add it. It just accounts for it in the offset computation if trim_offsets.
// If we are processing the first pair of offsets, with `add_prefix_space`,
// then we shouldn't remove anything we added. If there are more than one
// leading spaces though, it means we didn't add them, and they should be
// removed.try to incorporate this please!
There was a problem hiding this comment.
Thanks for the review, I updated the docs. I somehow misunderstood the meaning of add_prefix_space for the postprocessor, but it simply refers to the preprocessor and has more of a technical meaning here. To actually understand the meaning I added some code to the tests and thought that might be helpful to keep. Let me know if you want further changes.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
just missing make style |
…okenizers into docs-add-prefix-space-1819
done |
😉 |
|
That was weird, nothing happened when I ran |
|
you need to cargo build or |
|
Ty! |
Ah right, thanks! |
closes #1819
We already support
add_prefix_spaceforprocessors.ByteLevel, it is just not documented since this class is defined via pyo3'spyclass, see here. See further down the signature for thenewfunction, which mirrors python's__new__method: https://github.com/huggingface/tokenizers/blob/main/bindings/python/src/processors.rs#L492.