Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124257
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit aa84d50 with merge base 0f6ce45 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "Lint + MacOS builds are green" |
|
The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot. |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Hi @malfet , how can I test this PR on aarch64 linux? |
@snadampal here is the fix #124511, but we really need some sort of CI to be able to spot those earlier than nightly. Right now it's tested in M1, which is the same CPU arch, but different compiler by default, which is less stringent about type conversions |
|
My top priority is to get my CI PR merged ASAP. |
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32. Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86 Before/after performance running stories110M on M2Pro | eager (before) | eager (after) | compile(before) | compile (after) | | ---- | --- | -- | -- | | 28 | 57 | 31 | 104 | Pull Request resolved: pytorch#124257 Approved by: https://github.com/mikekgfb
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32. Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86 Before/after performance running stories110M on M2Pro | eager (before) | eager (after) | compile(before) | compile (after) | | ---- | --- | -- | -- | | 28 | 57 | 31 | 104 | Pull Request resolved: pytorch#124257 Approved by: https://github.com/mikekgfb
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
Unrolling entire
nloop actually makes it a tad slower, probably because ARM has smaller register file that x86Before/after performance running stories110M on M2Pro
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10