Allow arbitrary namespaces for Symbols by zdevito · Pull Request #9018 · pytorch/pytorch

zdevito · 2018-06-29T04:37:42Z

Context: I am updating jit::FunctionSchema to use Symbol name; rather than std::string name. Sometimes the name refers to a builtin thing like prim::UnpackTuple, sometimes to an aten operator like aten::add, and sometimes just to a raw string, like my_method_foo that really doesn't belong in any namespace and should be printed to the user in that form. For this last case, I want the ability to create a raw Symbol again, like was previously possible, that just represents an interned string. This PR enables that use, keeps the other functionality still possible, and simplifies interned_string's implementation a bit.

This changes how Symbol is implemented. Now the namespace of a symbol
is optional and the namespaces themselves are Symbols.
This allows Symbol to be used with arbitrary namespaces, and allows
you to use Symbol as an simple interned string using via fromQualString
and toQualString without :: in the string. This also simplifies the
implementation. Like with string conversion, builtin primitives go
through a fast path for namespace lookup while registered symbols require
holding a lock and reading an array entry to lookup the namespace.

Note: alexnet expect file update is from a previous commit. It doesn't run in CI because pytorch vision is not installed.

zdevito · 2018-06-29T04:57:09Z

@ezyang can you take a look since I think you were the last to work on interned_strings.cpp

facebook-github-bot

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2018-06-29T13:55:14Z

If it's just a raw string, do you really want it to be interned? I'm not convinced that there isn't secretly a namespace lurking in the background here.

zdevito · 2018-06-29T16:39:41Z

@ezyang This is for FunctionSchema, where sometimes I want a raw string, because it is referring to a function name and I want it reported to the user as they wrote it. Other times it is used as the schema for an operator and I need the Symbol for the operator to be fully qualified. More generally, I think having symbol act as a true interned string is useful because they are useful in various places.

facebook-github-bot

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2018-06-29T20:08:56Z

OK, but for example, say a user manages to hand you a function named "onnx::Blah" (I'm sure, somewhere, Python lets you put awful names on functions). Aren't you going to interpret this as an actually ONNX namespaced thing?

zdevito · 2018-06-29T20:30:37Z

Those two uses are already in different namespaces based on how those values are used. For a function, the name field is used to look it up in the lexical environment. Builtin operators are looked up in the operator registry. But FunctionSchema is used to represent the valid arguments in either case. This way, the code doesn't have to convert from interned string to std::string to use this data structure. Similarly, we really don't need the attr:: namespace either. Only onnx:: and aten:: actually need disambiguation.

ezyang · 2018-06-29T21:04:54Z

Let me build this and see if I can come up with an example of what I'm nervous about.

ezyang · 2018-06-29T21:32:17Z

I think my primary objection to the changes in this patch is that it makes it easier to do the wrong thing.

In the previous world, namespaces were mandatory. If you called a function f(Symbol) from Python, and forgot to attach a namespace, the parser would shout at you. In this PR, if you pass in an un-namespaced symbol, we will happily accept it, and you won't learn about the error of your ways until much later in the pipeline, at which point the backtrace when the bad symbol was passed in is lost. The Python typechecker is not going to give you any help here. (I guess you can fix this by removing the implicit Python string to Symbol conversion from the interface.)

In the previous world, you could have whatever junk you wanted after the namespace. attr::blah::baz is OK; because the xxx:: prefix is mandatory, there is never any ambiguity: everything after the first :: is the actual string in question. So you don't have to worry about accepting strings from users, because they cannot introduce ambiguity by passing you something weird you don't expect. In the new world, if you accept an unqualified name, you had better check it doesn't contain any ::, because if it does, you will get a symbol that is in the wrong namespace (it "should" have been in the none namespace, but it's actually in some random namespace "xxx::").

In the previous world, the mandatory namespace was used as a way to say what strings mean. Yes, technically attr didn't need to be namespaced, but you were never confused about what an attr-namespace string was: it was a key for an attribute key map. Now there's a pile of non-namespaced identifiers. What do they mean? Who knows: you have to look at the call sites.

(I remember in our original discussion, you were not 100% pleased with the namespacing refactor, and suggested we refactor Symbol into an interned string concept, which is just plain interned strings with no semantics, and a symbol concept, where we have semantics. I still think this is a good idea; a definite improvement over the old design. Though, I'm not sure you need it for your use case.)

zdevito · 2018-06-29T22:00:39Z

Now there's a pile of non-namespaced identifiers. What do they mean? Who knows: you have to look at the call sites.

I think we conceptually disagree here. I view Symbols as interned strings, full stop. They are a version of the string that makes switch statements possible in C++. I don't want them to be any more or less than that. The problem with the namespaces before this patch is that they removed this functionality -- there exists strings that I cannot represent with symbols. The schema stuff that I am doing is now introducing the first use where I need that functionality back. We should like take this conversation online sometime next week, since the PR is part of a chain needed in refactoring the way schema and the interpreter work.

apaszke · 2018-07-01T17:21:50Z

there exists strings that I cannot represent with symbols

Why is that? What if we simply allowed using a C++-like convention for the global namespace (think ::name)? I actually quite like the fact that our symbols are neatly organized into namespaces right now, as it makes the code that works with attrs/ops cleaner.

zdevito · 2018-07-01T20:34:36Z

Ok, I updated this so that namespaces are always required, which I believe addresses @ezyang 's concerns. I still think this is worth landing because it simplifies the interned_strings implementation.

torch/csrc/jit/test_jit.cpp

torch/csrc/jit/interned_strings.cpp

This changes how Symbol is implemented. Now the namespace of a symbol is itself a Symbol in the 'namespaces' namespace. This allows Symbol to be used with arbitrary namespaces. This also simplifies the implementation. Like with string conversion, builtin primitives go through a fast path for namespace lookup while registered symbols require holding a lock and reading an array entry to lookup the namespace.

facebook-github-bot

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…n with human-readable schema. (#8885) Summary: This is a series of two commits that should probably be read separately. They are stacked on top of #9018 since the second commit requires it for correctness. Commit 1 ======= This commit is the first in a series that will clean up how we handle declaring operators and intrinsics in the JIT to make it more modular and readable. This introduces readable declarations that can be used to register operators and switches gen_jit_dispatch to generate this schema. A follow up PR will remove the dispatch keys like "add-3" and resolve ops directly based on the registered schema, further simplifying the generation process. * Switches schema over to parsed declarations, in the future this will allow something like: ``` registry.register_intrinsic("foo(Tensor a, Tensor b) -> Tensor", [](Stack& stack) { ... }) ``` This will allow the scalable registration of intrinsics for lists, tuples, and other ops, as long as meta-data for these ops (e.g. derivatives and size propagation routines). The declarations resemble those used by PythonArgParser but have been singificantly cleaned up to minimize the number of types that can appear in the declaration. We should strive to get the other parts of PyTorch switched over to this restricted declaration set when possible, but it is too much to do in a single PR. My hope is that eventually we will use a very similar language to describe declarations in C10, and this can serve as a guide for that. Parsing is done using the script lexer, so it is very robust to whitespace and extensible for future types. This removes the other way we encoded schema, and makes it easier to see what schema are registered. Current generated declarations: https://gist.github.com/zdevito/a96a17766fb3a098d69a91ee00abaaf6 * Switches how we handle attempting to use an integer in the place of a fixed-sized int list, such as in conv (e.g. 'int[3] stride=1'). Now that we can statically distinguish between int and Tensor, we handle the expansion as an implicit conversion in the compiler. This allows us to simplify the interpreter since it no longer needs to handle the conversion itself. * Schema declarations have been changed so that they match the type system in the IR exactly. In particular, attribute_info which was used by liftConstantAttributes has been dropped and constant attributes are lifted purely based on the type of the input. Type conversions in compiler have been simplified due to this change. * Error highlighting in ErrorReport now only reports at most 20 lines of code, to make reading where an error occurred easier. Commit 2 ======= This commit unifies aten_dispatch and aten_schema into a single Operator object that both contains schema and implementation information. In the future we can use this object to also contain functionality like shape prop and autodiff needed by all operators. Operators are registered globally, and dispatch logic uses the schema information to figure out which variant to use. Descriptor keys, a frequent source of inscrutable debug errors, have been removed. * Introduce Operator, to replace TensorOp. Unlike TensorOp, we use Operator for all op implementations, including primitives that may occur in the graphs. The only exceptions are ops that are only known to the interpreter like jumps, and GraphExecutors where we need to record additional debug info. * Adds a global registry for Operator implementations. aten_dispatch.cpp turns into register_aten_ops.cpp, which registers all the Operators for aten with the operator registry. register_prim_ops.cpp now contains the implementations for primitive operators that used to be in the interpreter. This means that it is now safe to use `getOperation(node)` to lookup the true interpreter function for the node, which will simplify const-propagation passes. * Remove addInterpreterOpHandler in favor of global operator registry. * Instead of descriptors, we match Node arguments directly against FunctionSchema describing expected inputs in `matchSchema`. `matchSchema` knows how parse both attributes and positional inputs from a node and match it to the appropriate registered operator. Debug error messages when we try to run an invalid operator are significantly improved: they now automatically display the schema for the op with the same name that are registered. * Merge aten_schema into regsiter_aten_ops. Each Operator takes a string schema which is parsed to determine when to dispatch to that op. * Cleans up gen_jit_dispatch.py now that we do not need to write out descriptors. In particular, skip_scalar_overloads can be removed since Richard's code sorts declarations to put Tensor, Tensor declarations first. * remove matchSchemaAndLiftConstantAttributes and use emitBuiltinCall instead to remove code duplication * refactor stack manipulation functions into a separate header file. Pull Request resolved: #8885 Reviewed By: jamesr66a Differential Revision: D8751048 Pulled By: zdevito fbshipit-source-id: 312aabfbf88307c5f6ab947b6caf691468b94557

Summary: Context: I am updating jit::FunctionSchema to use `Symbol name;` rather than `std::string name`. Sometimes the name refers to a builtin thing like `prim::UnpackTuple`, sometimes to an aten operator like `aten::add`, and sometimes just to a raw string, like `my_method_foo` that really doesn't belong in any namespace and should be printed to the user in that form. For this last case, I want the ability to create a raw Symbol again, like was previously possible, that just represents an interned string. This PR enables that use, keeps the other functionality still possible, and simplifies interned_string's implementation a bit. This changes how Symbol is implemented. Now the namespace of a symbol is optional and the namespaces themselves are Symbols. This allows Symbol to be used with arbitrary namespaces, and allows you to use Symbol as an simple interned string using via fromQualString and toQualString without :: in the string. This also simplifies the implementation. Like with string conversion, builtin primitives go through a fast path for namespace lookup while registered symbols require holding a lock and reading an array entry to lookup the namespace. Note: alexnet expect file update is from a previous commit. It doesn't run in CI because pytorch vision is not installed. Closes pytorch#9018 Reviewed By: SsnL Differential Revision: D8690449 Pulled By: zdevito fbshipit-source-id: b65ee57704641d7294fe115c5470cf55d406458f

…n with human-readable schema. (pytorch#8885) Summary: This is a series of two commits that should probably be read separately. They are stacked on top of pytorch#9018 since the second commit requires it for correctness. Commit 1 ======= This commit is the first in a series that will clean up how we handle declaring operators and intrinsics in the JIT to make it more modular and readable. This introduces readable declarations that can be used to register operators and switches gen_jit_dispatch to generate this schema. A follow up PR will remove the dispatch keys like "add-3" and resolve ops directly based on the registered schema, further simplifying the generation process. * Switches schema over to parsed declarations, in the future this will allow something like: ``` registry.register_intrinsic("foo(Tensor a, Tensor b) -> Tensor", [](Stack& stack) { ... }) ``` This will allow the scalable registration of intrinsics for lists, tuples, and other ops, as long as meta-data for these ops (e.g. derivatives and size propagation routines). The declarations resemble those used by PythonArgParser but have been singificantly cleaned up to minimize the number of types that can appear in the declaration. We should strive to get the other parts of PyTorch switched over to this restricted declaration set when possible, but it is too much to do in a single PR. My hope is that eventually we will use a very similar language to describe declarations in C10, and this can serve as a guide for that. Parsing is done using the script lexer, so it is very robust to whitespace and extensible for future types. This removes the other way we encoded schema, and makes it easier to see what schema are registered. Current generated declarations: https://gist.github.com/zdevito/a96a17766fb3a098d69a91ee00abaaf6 * Switches how we handle attempting to use an integer in the place of a fixed-sized int list, such as in conv (e.g. 'int[3] stride=1'). Now that we can statically distinguish between int and Tensor, we handle the expansion as an implicit conversion in the compiler. This allows us to simplify the interpreter since it no longer needs to handle the conversion itself. * Schema declarations have been changed so that they match the type system in the IR exactly. In particular, attribute_info which was used by liftConstantAttributes has been dropped and constant attributes are lifted purely based on the type of the input. Type conversions in compiler have been simplified due to this change. * Error highlighting in ErrorReport now only reports at most 20 lines of code, to make reading where an error occurred easier. Commit 2 ======= This commit unifies aten_dispatch and aten_schema into a single Operator object that both contains schema and implementation information. In the future we can use this object to also contain functionality like shape prop and autodiff needed by all operators. Operators are registered globally, and dispatch logic uses the schema information to figure out which variant to use. Descriptor keys, a frequent source of inscrutable debug errors, have been removed. * Introduce Operator, to replace TensorOp. Unlike TensorOp, we use Operator for all op implementations, including primitives that may occur in the graphs. The only exceptions are ops that are only known to the interpreter like jumps, and GraphExecutors where we need to record additional debug info. * Adds a global registry for Operator implementations. aten_dispatch.cpp turns into register_aten_ops.cpp, which registers all the Operators for aten with the operator registry. register_prim_ops.cpp now contains the implementations for primitive operators that used to be in the interpreter. This means that it is now safe to use `getOperation(node)` to lookup the true interpreter function for the node, which will simplify const-propagation passes. * Remove addInterpreterOpHandler in favor of global operator registry. * Instead of descriptors, we match Node arguments directly against FunctionSchema describing expected inputs in `matchSchema`. `matchSchema` knows how parse both attributes and positional inputs from a node and match it to the appropriate registered operator. Debug error messages when we try to run an invalid operator are significantly improved: they now automatically display the schema for the op with the same name that are registered. * Merge aten_schema into regsiter_aten_ops. Each Operator takes a string schema which is parsed to determine when to dispatch to that op. * Cleans up gen_jit_dispatch.py now that we do not need to write out descriptors. In particular, skip_scalar_overloads can be removed since Richard's code sorts declarations to put Tensor, Tensor declarations first. * remove matchSchemaAndLiftConstantAttributes and use emitBuiltinCall instead to remove code duplication * refactor stack manipulation functions into a separate header file. Pull Request resolved: pytorch#8885 Reviewed By: jamesr66a Differential Revision: D8751048 Pulled By: zdevito fbshipit-source-id: 312aabfbf88307c5f6ab947b6caf691468b94557

zdevito requested review from apaszke, colesbury, ezyang, gchanan and soumith as code owners June 29, 2018 04:37

zdevito force-pushed the pr/interned_strings_fix branch 3 times, most recently from 5d7b07d to d977863 Compare June 29, 2018 04:55

zdevito force-pushed the pr/interned_strings_fix branch from d977863 to 0e8b803 Compare June 29, 2018 05:50

facebook-github-bot reviewed Jun 29, 2018

View reviewed changes

zdevito mentioned this pull request Jun 29, 2018

Unify aten_dispatch and aten_schema into a single operator abstraction with human-readable schema. #8885

Closed

zdevito force-pushed the pr/interned_strings_fix branch from 0e8b803 to 9403e35 Compare July 1, 2018 20:30

ezyang reviewed Jul 5, 2018

View reviewed changes

torch/csrc/jit/test_jit.cpp Outdated

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Jul 5, 2018

View reviewed changes

torch/csrc/jit/interned_strings.cpp Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ezyang approved these changes Jul 5, 2018

View reviewed changes

zdevito force-pushed the pr/interned_strings_fix branch from 9403e35 to 27d7867 Compare July 6, 2018 00:45

facebook-github-bot reviewed Jul 6, 2018

View reviewed changes

facebook-github-bot closed this in 9aded43 Jul 6, 2018

Conversation

zdevito commented Jun 29, 2018

Uh oh!

zdevito commented Jun 29, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jun 29, 2018

Uh oh!

zdevito commented Jun 29, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jun 29, 2018

Uh oh!

zdevito commented Jun 29, 2018

Uh oh!

ezyang commented Jun 29, 2018

Uh oh!

ezyang commented Jun 29, 2018

Uh oh!

zdevito commented Jun 29, 2018

Uh oh!

apaszke commented Jul 1, 2018

Uh oh!

zdevito commented Jul 1, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants