-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Add the flops profiler which measures the time, number of estimated flops and parameters of each module in a model #544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jeffra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see DeepSpeedExamples (DSE) has lots of changes here, is this because we're wanting to point to an updated version of DSE that uses the profiler? or just pointing to a different commit than what the current master branch points to?
| @@ -0,0 +1,43 @@ | |||
| from functools import partial | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should move these examples to DSE? or maybe still in the DeepSpeed repo but not in the deepspeed package? I am not sure we want them to be installed with the library itself (e.g., hidden away as a library like
/usr/local/lib/python3.6/dist-packages/deepspeed) where they are hard to find and run directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the examples to readme and tests/unit/test_flops_profiler.py
| @@ -0,0 +1,43 @@ | |||
| from functools import partial | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that we have these examples. We might want to move them to the DSE repo? And we could also include them in a page on readthedocs if you think that would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the examples. I will add them to dse and readthedocs.
tests/small_model_debugging/run.sh
Outdated
| @@ -0,0 +1,12 @@ | |||
| #! /bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into our unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/small_model_debugging/run.sh
Outdated
| export WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES)) | ||
|
|
||
| # python test_model.py | ||
| python test_model.py --flops_profiler true --profile_start_step=5 --profile_end_step=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have some unit tests with models that we know the expected flop counts for. Then we could run the flops profiler to ensure our counts stay consistent?
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--local_rank", type=int, default=0) | ||
| parser.add_argument('--zero', type=int, default=0) | ||
| parser.add_argument('--flops_profiler', type=bool, default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment, it would be awesome to move this into our unit tests (tests/unit/).
| @@ -0,0 +1 @@ | |||
| from .profiler import get_model_profile, add_profile_methods, print_model_profile, print_model_aggregated_profile, flops_to_string, duration_to_string, params_to_string | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this many exports, we could also just from .profiler import * and then in profiler be sure to follow public/private naming conventions. Methods starting with _ won't be included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put all the functions in a FlopsProfiler class which will be the only thing to import
|
|
||
|
|
||
| # FC | ||
| F.linear = wrapFunc(F.linear, linear_flops_compute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the below wraps occur each time the module is imported? Can things get double-wrapped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved wrapping functionals into the profiler.start() and added unwrapping functionals in the profile.end().
| F.linear = wrapFunc(F.linear, linear_flops_compute) | ||
|
|
||
| # convolutions | ||
| F.conv1d = wrapFunc(F.conv1d, conv_flops_compute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also unwrap these functionals after profiling is complete? The polite thing might be to only wrap while profiling is active, if that's possible. I see we do that with modules in end_profile?
| F.avg_pool2d = wrapFunc(F.avg_pool2d, pool_flops_compute) | ||
| F.avg_pool3d = wrapFunc(F.avg_pool3d, pool_flops_compute) | ||
| F.max_pool1d = wrapFunc(F.max_pool1d, pool_flops_compute) | ||
| F.max_pool2d = wrapFunc(F.max_pool2d, pool_flops_compute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool to have a data structure that maps a list of functionals to the compute wrapper, and then at wrap time the engine code just traverses that data structure to do this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Monkey patching does not work if the target function is referenced as an item in a loop.
| return logits, probs | ||
|
|
||
|
|
||
| if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks left over from development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
deepspeed/runtime/engine.py
Outdated
| from ..ops.adam import DeepSpeedCPUAdam | ||
| from ..ops.adam import FusedAdam | ||
|
|
||
| from deepspeed.profiling.flops_profiler.profiler import add_profile_methods, print_model_profile, print_model_aggregated_profile, flops_to_string, params_to_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we manage exports within each subpackage, we can just from deepspeed.profiling import * to expose the high-level API I think. That's a bit nicer than having a long list like this.
Since we do have this pattern of exporting a lot of methods, what about wrapping them up in a FlopsProfiler class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
samyam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to just leave a high level review here based on some our earlier discussion. I think it would be very cool to do the profiling using profiler.start() and profiler.stop() instead of using get_model_profile(...). This is aligned more closely with general training workflows, and additionally the it would not require the user to create an input constructor. Overall, it would be a more natural and seamless way to use the profiler. What do you think @cli99 ?
This would also go towards addressing the import side effect we chatted about @cli99. It might be a good opportunity to use contextlib ? |
| return p | ||
|
|
||
|
|
||
| def linear_flops_compute(input, weight, bias=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we wanna account for the bias flops when it is not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the flops for computing bias? I thought it's zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consider the dimension of bias (like out_features), which is the number of addition. But, this will be very small probably, so that's okay if not considering that.
|
|
||
| def linear_flops_compute(input, weight, bias=None): | ||
| out_features = weight.shape[0] | ||
| return torch.numel(input) * out_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we multiply this by 2 to account for multiply-add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the flops profiler outputs MACs and the user can multiply that by 2 or anything. The throughput of a module is computed as TFLOPS where I do MACs*2/time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but I was thinking that this is not always MAC. I am not sure if we want to take into account the point-wise operations into the profiler like bias-add or normalize.
| for step, batch in enumerate(data_loader): | ||
| # start profiling at training step "profile_step" | ||
| if step == profile_start_step: | ||
| model.start_profile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im wondering if its more transparent to not overwrite the model into a profiler, and keep the model intact? Something like
model_profiler = prof.add_profile_method(model)
model_profiler.start()
y=model(x)
model_profiler.stop()
Also for getting the info we can then do
model_profiler.print_all()
flops = model_profiler.get_flops()
I actually don't know which one is better. Maybe we should have a discussion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readme shown here is out of date. The latest code does this. Happy to chat more about it.
|
Closing this one after rebasing/squashing, moved to #664 |
No description provided.