Skip to content

Automatically stubbing the pyi files while keeping inspecting ability#509

Merged
n1t0 merged 14 commits intohuggingface:masterfrom
Narsil:auto-stub
Nov 17, 2020
Merged

Automatically stubbing the pyi files while keeping inspecting ability#509
n1t0 merged 14 commits intohuggingface:masterfrom
Narsil:auto-stub

Conversation

@Narsil
Copy link
Copy Markdown
Contributor

@Narsil Narsil commented Nov 6, 2020

Context:

  • Maintaining both python rust docs and pyi files was becoming tedious
    and would often end up being wrong or outdated.
  • Rust PyO3 is rather good at producing inspectable objects from python side
    However, code editors usually don't use real python code to inspect.
    That was the reason for the existence of pyi files.

Proposed change:

  • We will use inspect module to go through the rust bindings, and generate all python stub files
    accordingly.
  • Now all __init__.py and __init__.pyi will be written by stub.py file. Edit: With the generated header
  • That actually means we need to drop typing annotations (inspect.signature(__text_signature__)
    does not support them link as some environments do use inspect (most REPLs, iPython, notebooks..)
    another issue with annotations was importing them wherever they were defined which poses a challenge
    for custom defined ones.
  • We also drop custom ENUMs that only emulated string conversion, in favor of explicit strings
    with a comment detailing available options. We could have a regular python file implement them
    but then stub.py would need to import them, which it does not have a simple mechanism for.
  • python stub.py --check will now check that we correctly ran the command as code style to make sure
    we are in sync with the rust doc.
  • Both python stub.py and python stub.py --check need to run after the lib was compiled with
    python setup.py develop or something similar to inspect the actual lib.
    Because of that, github workflow was changed to run check-style after building the lib
  • Class attributes are implemented as @property to be able to document them. Currently
    no good system to detect presence of a @XXX.setter was discovered. It seems all attributes
    do have a setter, but yields and exception when run on #[getter] only. It should not matter
    that much to autocomplete facilities.
  • implementations is kept in place as stub.py will simply ignore unknown files keeping the ability
    to add new python code, just not in the ones written through rust.

Limits:

  • We now have 2 places for black conf, the Makefile and the stub.py but we should not get bitten
    by this, as the real black runs after stub.py.
  • No safeguard against overwriting real python files right now.

@Narsil Narsil requested a review from n1t0 November 6, 2020 11:26
Copy link
Copy Markdown
Contributor

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! That is a great addition and I'm really happy that we won't have to sync these .pyi files manually anymore! 🎉

I must say I'm strongly against generating the .py files though. This doesn't really add any value while on the contrary, it takes away a good part of our freedom while also degrading the user experience.
Also, we shouldn't be introducing breaking changes for the purpose of generating files automatically. Maintaining the .py files is a breeze, and getting rid of the .pyi is where 99% of the value lies.

So let's generate only the .pyi files.

///
#[pyclass(dict, module = "tokenizers", name=AddedToken)]
#[text_signature = "(content, single_word=False, lstrip=False, rstrip=False, normalized=True)"]
#[text_signature = "(self, content, single_word=False, lstrip=False, rstrip=False, normalized=True)"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably remove all the self everywhere and document all the pyclass using __new__ instead. (I guess its the reason you add them in the first place?). This would be more accurate, what do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well all code was previously documenting init. I feel it's more pythonic than new even though it's really a __new__ . self would become cls then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, seems fair. I don't think it changes anything anyway.

Comment thread bindings/python/stub.py Outdated

def do_black(content, is_pyi):
mode = black.Mode(
target_versions={black.TargetVersion.PY36},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still using PY35 everywhere else, so we should stay consistent here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True I thought I had changed this. my bad.

""" WordPiece Decoder """
def __init__(self, suffix="</w>"):
pass
def decode(self, tokens):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the inherited methods are duplicated on the children classes. I think we should remove them as the various IDE should be able to understand this by themselves. (Just like they would with plain Python files)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot more checks (checking mro, uptree for existing signatures, checking signatures are the same and so one).I I really don't think it's worth it as it's generated code now. The possibility of a bug that shadows existing functions outweighs the benefit of writing less generated code in my mind.

@Narsil
Copy link
Copy Markdown
Contributor Author

Narsil commented Nov 9, 2020

Latest iterations keeps the root __init__.py that contains more code than the others.
The way to get non overwritable __init__.py is simply done by removing Generated comment at the top of the file.

I still fill that getting as much stuff generated as possible will make quirks and inaccuracies in the bindings or docs as low as possible and should be a desired goal.

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Nov 9, 2020

I'm not a fan of this hybrid generation, where sometimes it gets automatically generated, sometimes it doesn't, and you have to remember to import things manually just in some cases. I'd prefer to have either 100% of the exports automatically generated, or just no generation at all (speaking about the .py files only here of course).

Also, I'd love to have it done automatically when calling python setup.py develop if possible. Otherwise, maybe some doc could be helpful.

@Narsil
Copy link
Copy Markdown
Contributor Author

Narsil commented Nov 9, 2020

I'm not a fan of this hybrid generation, where sometimes it gets automatically generated, sometimes it doesn't, and you have to remember to import things manually just in some cases. I'd prefer to have either 100% of the exports automatically generated, or just no generation at all (speaking about the .py files only here of course).

We're only talking about normalizers.__init__.py right ? How about deprecating normalizers.unicode_normalizer_from_str and moving it to implementations.utils.unicode_normalizer_from_str so that when we actually deprecate it, then we won't have that problem as all __init__,py will be generated (except the top level one, where we could definitely apply the same trick)

Also, I'd love to have it done automatically when calling python setup.py develop if possible. Otherwise, maybe some doc could be helpful.

I wanted to do that, but in order to inspect the library it needs to be built, and somehow it does not seem to work in the hook I found. https://www.anomaly.net.au/blog/running-pre-and-post-install-jobs-for-your-python-packages/ (After run import tokenizers would fail).
Also, not sure we want that with pip install -e git:/// either.

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Nov 9, 2020

We're only talking about normalizers.__init__.py right ? How about deprecating normalizers.unicode_normalizer_from_str and moving it to implementations.utils.unicode_normalizer_from_str so that when we actually deprecate it, then we won't have that problem as all __init__,py will be generated (except the top level one, where we could definitely apply the same trick)

At the moment, this concerns py_src/tokenizers/__init__.py and normalizers.__init__.py, but in the future we might need to modify some others. We want to be able to add typings, type aliases, enums, and anything else that might improve the user experience, and the documentation. Namely, we want to be able to add anything we need in those .py files. we want to be able to write some Python even though the core of the library is written in Rust.

I wanted to do that, but in order to inspect the library it needs to be built, and somehow it does not seem to work in the hook I found. https://www.anomaly.net.au/blog/running-pre-and-post-install-jobs-for-your-python-packages/ (After run import tokenizers would fail).
Also, not sure we want that with pip install -e git:/// either.

Well, if we want to go with this, I think we'll have to find a way. python setup.py install, pip install ., python setup.py bdist-wheel and python setup.py sdist should all continue to work. That's also what we use for building our wheels automatically.

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Nov 16, 2020

@Narsil what's the status on this? I'd like to update the Python API reference pretty soon, and there is a huge overlap with this PR so it'd be better to wrap this up first.

@Narsil
Copy link
Copy Markdown
Contributor Author

Narsil commented Nov 17, 2020

Well this PR is ready. You still can comment the code responsible for generating .py file if you really feel against it.

@n1t0 n1t0 merged commit 352c92a into huggingface:master Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants