-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix for AMD unit tests #2047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for AMD unit tests #2047
Conversation
mrwyattii
commented
Jun 23, 2022
- Add asserts to prevent FP16 with CPUAdam on AMD CPUs (not supported)
- AMD runners now use a venv, removed sudo from pip install
- Add skips to unit tests for CPUAdam
| for param_id, p in enumerate(group['params']): | ||
| if p.dtype == torch.half: | ||
| logger.warning( | ||
| "FP16 params for CPUAdam may not work on AMD CPUs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrwyattii @RezaYazdaniAminabadi Can you please elaborate a bit more on what error/issue you observed when you added this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rraminen, Hi @jithunnair-amd
I think CPU-Adam should be supported to some extent but some overlapping features for copying data from CPU to GPU may not be working properly. Also, I remember that I could not find the corresponding vector instructions for handling the fp16 data-type. So, we may be able to remove this and let it run safely only for restricted cases. Or, if you can help verify the functionality that we need for CPU-Adam is supported, we can have another PR to add this?
Thanks,
Reza
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RezaYazdaniAminabadi You mentioned: "I could not find the corresponding vector instructions for handling the fp16 data-type." Would that be the ones used in SIMD_STORE2 and SIMD_LOAD2 here: https://github.com/microsoft/DeepSpeed/blob/master/csrc/includes/simd.h#L55?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RezaYazdaniAminabadi,
Below are the only references to FP16 instructions in simd.h:
https://github.com/microsoft/DeepSpeed/blob/a8b821535aa0b254efa681d51b4951734ca021cc/csrc/includes/simd.h#L36
https://github.com/microsoft/DeepSpeed/blob/a8b821535aa0b254efa681d51b4951734ca021cc/csrc/includes/simd.h#L58
Those conversion instructions are part of AVX512F which is supported in Zen4 as it is very basic AVX512 feature. Do you want to qualify that this code works fine on newer than AMD Zen 4 CPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the delay.
@jithunnair-amd, I think except the loading part there were some casting operations used which is kind of hacky and I had to do that to go around loading the data right. Honestly, I have not tested this part of the code recently but I would be happy to change this part based on your suggestion (and remove the warning or give warning for older architectures as @rraminen mentioned). Unfortunately, I don't have AMD CPUs to try this so I would need your help to verify this.
Thanks,
Reza