Enable eager execution#3306
Enable eager execution#3306JackCaoG merged 10 commits intopytorch:masterfrom aws-rhsoln:eager_execution
Conversation
|
Thanks for contributing. I didn't look at the change in detail yet but what you are trying to achieved is very similar to What |
Yeah it is similar, except that this would help users execute eagerly which they are used to doing in frameworks like PT. The issue with OpByOp mode is that it would break the lazily collected graph into ops when the user wants to print. This would result in re-executions in cases like these: It would cut 2 graphs, one having 8 nodes (each linear layer creats 2 nodes) and other having 6 nodes and each graph would then be executed op by op. This would result in duplicate executions. |
|
I felt like what you want is essentially least amount of wait time when user want to inspect a tensor value. Can we just spawn a sub process that call In LTC @wconstab has plan to make |
|
I think this approach would work fine for lazy-tensor-core too. We could discuss on the lazy_tensor design doc what kind of user api (env var or otherwise) we want to shoot for. Env var seems simplest for starters for torch-xla. |
|
@aws-rhsoln There seems to be still some kind of merge conflict |
Will try to resolve it and send in a revision |
|
Ran the BERT-large model for 10 iterations. Here are the metrics: For 2-layer BERT: |
formating changes bug fix renamed to eager_debug and used device data check replaced the api with the env variable
JackCaoG
left a comment
There was a problem hiding this comment.
Mostly lgtm, minor nits
| return DeviceContextArena::Get()->GetRunningSeed(device); | ||
| } | ||
|
|
||
| bool XLATensor::UseEagerDebugMode() { |
There was a problem hiding this comment.
hmm.. this doesn't needs to be a class method. I don't want to block you because of this but maybe I will fix it later.
ML scientists during their initial model development phase tend to use tools like PDB to debug their models. They would step through their models and print arbitary tensors to check their values and see if they look correct. Currently, if the user wants to debug issues in their model by printing output of intermediate layers or use debugger like pdb to step and investigate the tensors, the users incur an expensive intermediate graph compilation and execution. Adding the ability to run operations eagerly allows users to debug operations before running their expensive training jobs. Consider the example
In the above example, there are two graphs, both having a graphsize of 2. In larger models these intermediate graphsizes can be large and incur high compile and execution times. Also, in the next iteration if the user puts the mark_step at different location it would result in a different graph resulting in a cache miss. This would force the user to have fixed breakpoints, thereby putting a constraint. If each op is compiled and executed independently, the position of breakpoint doesn't matter anymore and there would be fixed number of compilations and executions. Moreover, with per op execution, the chances of hitting the cache increases as the layers in large models keep on repeating, thereyby avoiding a graph compile. With the
xrt_serverpreserving the compilation cache across training runs, compiling per op can result in reusable cache, thereby further reducing the compilation time during debug.Pitch
We would like users to be able to enable eager execution by calling an api. This api needs to be called before any xla_tensor is created. The api is similar to what tensorflow had in 1.15 for enabling eager execution.
The number of graph compilation and executions remain the same and does not depend on where the
mark_stepis inserted.Note: Doing per-op execution would result in higher e2e execution time when compared to lazy mode, however for initial development when the number of tensor prints by the user is going to be high, not doing intermediate graph compile should offset this time.