Skip to content

Allow arbitrary namespaces for Symbols#9018

Closed
zdevito wants to merge 1 commit intopytorch:masterfrom
zdevito:pr/interned_strings_fix
Closed

Allow arbitrary namespaces for Symbols#9018
zdevito wants to merge 1 commit intopytorch:masterfrom
zdevito:pr/interned_strings_fix

Conversation

@zdevito
Copy link
Contributor

@zdevito zdevito commented Jun 29, 2018

Context: I am updating jit::FunctionSchema to use Symbol name; rather than std::string name. Sometimes the name refers to a builtin thing like prim::UnpackTuple, sometimes to an aten operator like aten::add, and sometimes just to a raw string, like my_method_foo that really doesn't belong in any namespace and should be printed to the user in that form. For this last case, I want the ability to create a raw Symbol again, like was previously possible, that just represents an interned string. This PR enables that use, keeps the other functionality still possible, and simplifies interned_string's implementation a bit.

This changes how Symbol is implemented. Now the namespace of a symbol
is optional and the namespaces themselves are Symbols.
This allows Symbol to be used with arbitrary namespaces, and allows
you to use Symbol as an simple interned string using via fromQualString
and toQualString without :: in the string. This also simplifies the
implementation. Like with string conversion, builtin primitives go
through a fast path for namespace lookup while registered symbols require
holding a lock and reading an array entry to lookup the namespace.

Note: alexnet expect file update is from a previous commit. It doesn't run in CI because pytorch vision is not installed.

@zdevito zdevito force-pushed the pr/interned_strings_fix branch 3 times, most recently from 5d7b07d to d977863 Compare June 29, 2018 04:55
@zdevito
Copy link
Contributor Author

zdevito commented Jun 29, 2018

@ezyang can you take a look since I think you were the last to work on interned_strings.cpp

@zdevito zdevito force-pushed the pr/interned_strings_fix branch from d977863 to 0e8b803 Compare June 29, 2018 05:50
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Jun 29, 2018

If it's just a raw string, do you really want it to be interned? I'm not convinced that there isn't secretly a namespace lurking in the background here.

@zdevito
Copy link
Contributor Author

zdevito commented Jun 29, 2018

@ezyang This is for FunctionSchema, where sometimes I want a raw string, because it is referring to a function name and I want it reported to the user as they wrote it. Other times it is used as the schema for an operator and I need the Symbol for the operator to be fully qualified. More generally, I think having symbol act as a true interned string is useful because they are useful in various places.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Jun 29, 2018

OK, but for example, say a user manages to hand you a function named "onnx::Blah" (I'm sure, somewhere, Python lets you put awful names on functions). Aren't you going to interpret this as an actually ONNX namespaced thing?

@zdevito
Copy link
Contributor Author

zdevito commented Jun 29, 2018

Those two uses are already in different namespaces based on how those values are used. For a function, the name field is used to look it up in the lexical environment. Builtin operators are looked up in the operator registry. But FunctionSchema is used to represent the valid arguments in either case. This way, the code doesn't have to convert from interned string to std::string to use this data structure. Similarly, we really don't need the attr:: namespace either. Only onnx:: and aten:: actually need disambiguation.

@ezyang
Copy link
Contributor

ezyang commented Jun 29, 2018

Let me build this and see if I can come up with an example of what I'm nervous about.

@ezyang
Copy link
Contributor

ezyang commented Jun 29, 2018

I think my primary objection to the changes in this patch is that it makes it easier to do the wrong thing.

In the previous world, namespaces were mandatory. If you called a function f(Symbol) from Python, and forgot to attach a namespace, the parser would shout at you. In this PR, if you pass in an un-namespaced symbol, we will happily accept it, and you won't learn about the error of your ways until much later in the pipeline, at which point the backtrace when the bad symbol was passed in is lost. The Python typechecker is not going to give you any help here. (I guess you can fix this by removing the implicit Python string to Symbol conversion from the interface.)

In the previous world, you could have whatever junk you wanted after the namespace. attr::blah::baz is OK; because the xxx:: prefix is mandatory, there is never any ambiguity: everything after the first :: is the actual string in question. So you don't have to worry about accepting strings from users, because they cannot introduce ambiguity by passing you something weird you don't expect. In the new world, if you accept an unqualified name, you had better check it doesn't contain any ::, because if it does, you will get a symbol that is in the wrong namespace (it "should" have been in the none namespace, but it's actually in some random namespace "xxx::").

In the previous world, the mandatory namespace was used as a way to say what strings mean. Yes, technically attr didn't need to be namespaced, but you were never confused about what an attr-namespace string was: it was a key for an attribute key map. Now there's a pile of non-namespaced identifiers. What do they mean? Who knows: you have to look at the call sites.

(I remember in our original discussion, you were not 100% pleased with the namespacing refactor, and suggested we refactor Symbol into an interned string concept, which is just plain interned strings with no semantics, and a symbol concept, where we have semantics. I still think this is a good idea; a definite improvement over the old design. Though, I'm not sure you need it for your use case.)

@zdevito
Copy link
Contributor Author

zdevito commented Jun 29, 2018

Now there's a pile of non-namespaced identifiers. What do they mean? Who knows: you have to look at the call sites.

I think we conceptually disagree here. I view Symbols as interned strings, full stop. They are a version of the string that makes switch statements possible in C++. I don't want them to be any more or less than that. The problem with the namespaces before this patch is that they removed this functionality -- there exists strings that I cannot represent with symbols. The schema stuff that I am doing is now introducing the first use where I need that functionality back. We should like take this conversation online sometime next week, since the PR is part of a chain needed in refactoring the way schema and the interpreter work.

@apaszke
Copy link
Contributor

apaszke commented Jul 1, 2018

there exists strings that I cannot represent with symbols

Why is that? What if we simply allowed using a C++-like convention for the global namespace (think ::name)? I actually quite like the fact that our symbols are neatly organized into namespaces right now, as it makes the code that works with attrs/ops cleaner.

@zdevito zdevito force-pushed the pr/interned_strings_fix branch from 0e8b803 to 9403e35 Compare July 1, 2018 20:30
@zdevito
Copy link
Contributor Author

zdevito commented Jul 1, 2018

Ok, I updated this so that namespaces are always required, which I believe addresses @ezyang 's concerns. I still think this is worth landing because it simplifies the interned_strings implementation.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This changes how Symbol is implemented. Now the namespace of a symbol
is itself a Symbol in the 'namespaces' namespace. This allows Symbol to be used
with arbitrary namespaces. This also simplifies the
implementation. Like with string conversion, builtin primitives go
through a fast path for namespace lookup while registered symbols require
holding a lock and reading an array entry to lookup the namespace.
@zdevito zdevito force-pushed the pr/interned_strings_fix branch from 9403e35 to 27d7867 Compare July 6, 2018 00:45
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zdevito has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Jul 10, 2018
…n with human-readable schema. (#8885)

Summary:
This is a series of two commits that should probably be read separately. They are stacked on top of #9018 since the second commit requires it for correctness.

Commit 1
=======

This commit is the first in a series that will clean up how we handle declaring operators and intrinsics in the JIT to make it more modular and readable. This introduces readable declarations that can be used to register operators and switches gen_jit_dispatch to generate this schema. A follow up PR will remove the dispatch keys like "add-3" and resolve ops directly based on the registered schema, further simplifying the generation process.

* Switches schema over to parsed declarations, in the future this will allow something like:

```
  registry.register_intrinsic("foo(Tensor a, Tensor b) -> Tensor", [](Stack& stack) {
    ...
  })
```

This will allow the scalable registration of intrinsics for lists, tuples, and other ops, as long as meta-data for these ops (e.g. derivatives and size propagation routines).

The declarations resemble those used by PythonArgParser but have been singificantly cleaned up to minimize the number of types that can appear in the declaration. We should strive to get the other parts of PyTorch switched over to this restricted declaration set when possible, but it is too much to do in a single PR. My hope is that eventually we will use a very similar language to describe declarations in C10, and this can serve as a guide for that.

Parsing is done using the script lexer, so it is very robust to whitespace and extensible for future types.

This removes the other way we encoded schema, and makes it easier to see what schema are registered.

Current generated declarations: https://gist.github.com/zdevito/a96a17766fb3a098d69a91ee00abaaf6

* Switches how we handle attempting to use an integer in the place of a fixed-sized int list, such as in conv (e.g. 'int[3] stride=1'). Now that we can statically distinguish between int and Tensor, we handle the expansion as an implicit conversion in the compiler. This allows us to simplify the interpreter since it no longer needs to handle the conversion itself.

* Schema declarations have been changed so that they match the type system in the IR exactly. In particular, attribute_info which was used by liftConstantAttributes has been dropped and constant attributes are lifted purely based on the type of the input. Type conversions in compiler have been simplified due to this change.

* Error highlighting in ErrorReport now only reports at most 20 lines of code, to make reading where an error occurred easier.

Commit 2
=======

This commit unifies aten_dispatch and aten_schema into a single Operator object that both contains schema and implementation information. In the future we can use this object to also contain functionality like shape prop and autodiff needed by all operators. Operators are registered globally, and dispatch logic uses the schema information to figure out which variant to use. Descriptor keys, a frequent source of inscrutable debug errors, have been removed.

* Introduce Operator, to replace TensorOp. Unlike TensorOp, we use Operator for all op implementations, including primitives that may occur in the graphs. The only exceptions are ops that are only known to the interpreter like jumps, and GraphExecutors where we need to record additional debug info.

* Adds a global registry for Operator implementations. aten_dispatch.cpp turns into register_aten_ops.cpp, which registers all the Operators for aten with the operator registry. register_prim_ops.cpp now contains the implementations for primitive operators that used to be in the interpreter. This means that it is now safe to use `getOperation(node)` to lookup the true interpreter function for the node, which will simplify const-propagation passes.

* Remove addInterpreterOpHandler in favor of global operator registry.

* Instead of descriptors, we match Node arguments directly against FunctionSchema describing expected inputs in `matchSchema`. `matchSchema` knows how parse both attributes and positional inputs from a node and match it to the appropriate registered operator. Debug error messages when we try to run an invalid operator are significantly improved: they now automatically display the schema for the op with the same name that are registered.

* Merge aten_schema into regsiter_aten_ops. Each Operator takes a string schema which is parsed to determine when to dispatch to that op.

* Cleans up gen_jit_dispatch.py now that we do not need to write out descriptors.  In particular, skip_scalar_overloads can be removed since Richard's code sorts declarations to put Tensor, Tensor declarations first.

* remove matchSchemaAndLiftConstantAttributes and use emitBuiltinCall instead to remove code duplication

* refactor stack manipulation functions into a separate header file.
Pull Request resolved: #8885

Reviewed By: jamesr66a

Differential Revision: D8751048

Pulled By: zdevito

fbshipit-source-id: 312aabfbf88307c5f6ab947b6caf691468b94557
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
Context: I am updating jit::FunctionSchema to use `Symbol name;` rather than `std::string name`. Sometimes the name refers to a builtin  thing like `prim::UnpackTuple`, sometimes to an aten operator like `aten::add`, and sometimes just to a raw string, like `my_method_foo` that really doesn't belong in any namespace and should be printed to the user in that form. For this last case, I want the ability to create a raw Symbol again, like was previously possible, that just represents an interned string. This PR enables that use, keeps the other functionality still possible, and simplifies interned_string's implementation a bit.

This changes how Symbol is implemented. Now the namespace of a symbol
is optional and the namespaces themselves are Symbols.
This allows Symbol to be used with arbitrary namespaces, and allows
you to use Symbol as an simple interned string using via fromQualString
and toQualString without :: in the string. This also simplifies the
implementation. Like with string conversion, builtin primitives go
through a fast path for namespace lookup while registered symbols require
holding a lock and reading an array entry to lookup the namespace.

Note: alexnet expect file update is from a previous commit. It doesn't run in CI because pytorch vision is not installed.
Closes pytorch#9018

Reviewed By: SsnL

Differential Revision: D8690449

Pulled By: zdevito

fbshipit-source-id: b65ee57704641d7294fe115c5470cf55d406458f
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
…n with human-readable schema. (pytorch#8885)

Summary:
This is a series of two commits that should probably be read separately. They are stacked on top of pytorch#9018 since the second commit requires it for correctness.

Commit 1
=======

This commit is the first in a series that will clean up how we handle declaring operators and intrinsics in the JIT to make it more modular and readable. This introduces readable declarations that can be used to register operators and switches gen_jit_dispatch to generate this schema. A follow up PR will remove the dispatch keys like "add-3" and resolve ops directly based on the registered schema, further simplifying the generation process.

* Switches schema over to parsed declarations, in the future this will allow something like:

```
  registry.register_intrinsic("foo(Tensor a, Tensor b) -> Tensor", [](Stack& stack) {
    ...
  })
```

This will allow the scalable registration of intrinsics for lists, tuples, and other ops, as long as meta-data for these ops (e.g. derivatives and size propagation routines).

The declarations resemble those used by PythonArgParser but have been singificantly cleaned up to minimize the number of types that can appear in the declaration. We should strive to get the other parts of PyTorch switched over to this restricted declaration set when possible, but it is too much to do in a single PR. My hope is that eventually we will use a very similar language to describe declarations in C10, and this can serve as a guide for that.

Parsing is done using the script lexer, so it is very robust to whitespace and extensible for future types.

This removes the other way we encoded schema, and makes it easier to see what schema are registered.

Current generated declarations: https://gist.github.com/zdevito/a96a17766fb3a098d69a91ee00abaaf6

* Switches how we handle attempting to use an integer in the place of a fixed-sized int list, such as in conv (e.g. 'int[3] stride=1'). Now that we can statically distinguish between int and Tensor, we handle the expansion as an implicit conversion in the compiler. This allows us to simplify the interpreter since it no longer needs to handle the conversion itself.

* Schema declarations have been changed so that they match the type system in the IR exactly. In particular, attribute_info which was used by liftConstantAttributes has been dropped and constant attributes are lifted purely based on the type of the input. Type conversions in compiler have been simplified due to this change.

* Error highlighting in ErrorReport now only reports at most 20 lines of code, to make reading where an error occurred easier.

Commit 2
=======

This commit unifies aten_dispatch and aten_schema into a single Operator object that both contains schema and implementation information. In the future we can use this object to also contain functionality like shape prop and autodiff needed by all operators. Operators are registered globally, and dispatch logic uses the schema information to figure out which variant to use. Descriptor keys, a frequent source of inscrutable debug errors, have been removed.

* Introduce Operator, to replace TensorOp. Unlike TensorOp, we use Operator for all op implementations, including primitives that may occur in the graphs. The only exceptions are ops that are only known to the interpreter like jumps, and GraphExecutors where we need to record additional debug info.

* Adds a global registry for Operator implementations. aten_dispatch.cpp turns into register_aten_ops.cpp, which registers all the Operators for aten with the operator registry. register_prim_ops.cpp now contains the implementations for primitive operators that used to be in the interpreter. This means that it is now safe to use `getOperation(node)` to lookup the true interpreter function for the node, which will simplify const-propagation passes.

* Remove addInterpreterOpHandler in favor of global operator registry.

* Instead of descriptors, we match Node arguments directly against FunctionSchema describing expected inputs in `matchSchema`. `matchSchema` knows how parse both attributes and positional inputs from a node and match it to the appropriate registered operator. Debug error messages when we try to run an invalid operator are significantly improved: they now automatically display the schema for the op with the same name that are registered.

* Merge aten_schema into regsiter_aten_ops. Each Operator takes a string schema which is parsed to determine when to dispatch to that op.

* Cleans up gen_jit_dispatch.py now that we do not need to write out descriptors.  In particular, skip_scalar_overloads can be removed since Richard's code sorts declarations to put Tensor, Tensor declarations first.

* remove matchSchemaAndLiftConstantAttributes and use emitBuiltinCall instead to remove code duplication

* refactor stack manipulation functions into a separate header file.
Pull Request resolved: pytorch#8885

Reviewed By: jamesr66a

Differential Revision: D8751048

Pulled By: zdevito

fbshipit-source-id: 312aabfbf88307c5f6ab947b6caf691468b94557
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants