[Compile] Add NEON implementation for bf16->fp32 cast#134297
[Compile] Add NEON implementation for bf16->fp32 cast#134297
Conversation
|
Let's trigger a dashboard run for this. |
Sure, https://github.com/pytorch/pytorch/actions/runs/10529131469 [Edit] Realized I did this change before the split, so alas it's not really usable. Let's test in trunk |
| int32x4_t shift = vdupq_n_s32(16); | ||
| auto u16_low1 = vget_low_u16(u16_8); | ||
| auto u16_high1 = vget_high_u16(u16_8); | ||
| float32x4_t f32x4_0 = vreinterpretq_f32_u32(vshlq_u32(vmovl_u16(u16_low1), shift)); |
There was a problem hiding this comment.
Seems reasonable, but if input is interleaved then you can just do vectorized (input & 0xFF00) and the reinterpret, for upper half and save get_high, and movl instructions for upper half. For lower hafl you would still need those.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "This is weird: workflow dispatch jobs do not show up in the signal box, but still delay the merge" |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This changes assembly generated for the following routine
from
to
And as result speeds up
python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16from 33 to 90 tokens/seccc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10