CI/BUG: add native jobs for s390x, fix bug in pack_inner#30819
Conversation
|
@sandeepgupta12 ptal |
|
So, currently clang tests are failing on ppc64le and s390x, and clang and gcc tests with forced vectorization fail on s390x. Can we proceed with these workflows now or should tests be fixed first? I'm already looking into s390x failures. |
|
This change fixes s390x tests with forced zvector: I could also split it into a separate PR if you'd like. Now there are only some failures when clang is used. |
rgommers
left a comment
There was a problem hiding this comment.
Thanks @AlekseiNikiforovIBM. This is a bit tricky to review (let's blame the GitHub UI), so it'd be great if you could revert the name change and leave it at linux-ppc64le.yml for now. That will make the diff much smaller, and it'll be visible what you tweaked in the ppc64le job.
| fail-fast: false | ||
| matrix: | ||
| config: | ||
| - name: "GCC (ppc64le)" |
There was a problem hiding this comment.
This whole job matrix is ppc64le, so this name change seem a bit unnecessary
There was a problem hiding this comment.
It looks like it takes title of workflow file, and then job name. Without this change there will be two jobs named "Native ppc64le and s390x Linux Test / GCC", one for ppc64le and one for s390x.
There was a problem hiding this comment.
| - name: "GCC (ppc64le)" | |
| - name: "ppc64le/gcc - baseline(default)" |
I agree with @AlekseiNikiforovIBM. We need to be specific about the job name since this workflow supports both architectures. I would like to be more specific about the baseline since we enabled multiple baseline configurations.
| - name: "GCC with zvector (s390x)" | ||
| args: "-Dallow-noblas=false -Dcpu-baseline=vx" | ||
| - name: "clang with zvector (s390x)" | ||
| args: "-Dallow-noblas=false -Dcpu-baseline=vx" |
There was a problem hiding this comment.
We had two jobs under QEMU; four seems a little excessive - can you do GCC and Clang with zvector here (or vice versa?
There was a problem hiding this comment.
They produce different results, so I think it's better to have all 4 of them. Considering issues with clang, maybe clang jobs should be dropped until all issues when compiling with clang are resolved?
There was a problem hiding this comment.
With clang 18, ppc64le tests are passing. See here- https://github.com/numpy/numpy/actions/runs/21948675141/job/63393053952
There was a problem hiding this comment.
I don't see clang-18 being used there. I see gcc. From linked build log:
C compiler for the host machine: cc (gcc 13.3.0 "cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0")
C linker for the host machine: cc ld.bfd 2.42
C++ compiler for the host machine: c++ (gcc 13.3.0 "c++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0")
C++ linker for the host machine: c++ ld.bfd 2.42
Clang-20 logs for comparison:
C compiler for the host machine: clang-20 (clang 20.1.2 "Ubuntu clang version 20.1.2 (0ubuntu1~24.04.2)")
C linker for the host machine: clang-20 ld.bfd 2.42
C++ compiler for the host machine: clang++-20 (clang 20.1.2 "Ubuntu clang version 20.1.2 (0ubuntu1~24.04.2)")
C++ linker for the host machine: clang++-20 ld.bfd 2.42
A single PR seems fine to me, thanks. |
pack_inner
7f05a2d to
d3779d0
Compare
|
I've disabled clang builds and tests for s390x while there are still too many issues to be fixed. I think enabling them later would be a good choice. |
bae121c to
c9693ef
Compare
|
I've updated PR. I've used docker and fedora:43 image to build and test numpy with clang. Fedora has newer clang, and maybe clang is patched differently there, so there are no miscompilations detected by testsuite. Only 1 test failed and was needed to be fixed:
Luckily, on s390x a copy of output is available in Unfortunately, I don't think similar output is available for ppc64le. Hex output of I've also bumped zvector build to enable everything up to VXE2. There still were 65 failing tests for ppc64le when building with clang: I disabled ppc64le clang build for now. It'd be better to have those tests fixed or disabled first, or some other build environment used. @rgommers please take one more look. |
seiko2plus
left a comment
There was a problem hiding this comment.
Thank you for the fix and extending the CI. The work is good to me. I left some comments below — not necessary to be accomplished, and they can be done in another PR. Regarding the SIMD optimization comment I left, it's for reference — it can be done within this PR or within another PR using the current C SIMD interface or during the transition of moving to Google Highway.
| fail-fast: false | ||
| matrix: | ||
| config: | ||
| - name: "GCC (ppc64le)" |
There was a problem hiding this comment.
| - name: "GCC (ppc64le)" | |
| - name: "ppc64le/gcc - baseline(default)" |
I agree with @AlekseiNikiforovIBM. We need to be specific about the job name since this workflow supports both architectures. I would like to be more specific about the baseline since we enabled multiple baseline configurations.
| @@ -1632,6 +1652,11 @@ pack_inner(const char *inptr, | |||
| va = npyv_rev64_u8(va); | |||
There was a problem hiding this comment.
I think we can do better than just a fix. We can avoid using npyv_rev64_u8 on bigendian if the required order is PACK_ORDER_BIG. So we reverse if the required order is PACK_ORDER_BIG and byte order is NPY_LITTLE_ENDIAN, and we swap if the byte order is NPY_BIG_ENDIAN.
There was a problem hiding this comment.
How would that be done? I don't see how this place can be improved. Maybe such change could be added later in a separate PR?
There was a problem hiding this comment.
What I mean is simply no need to swap what's already mistakenly swapped! So this fix is like performing "!!". The core of the problem is in:
numpy/numpy/_core/src/multiarray/compiled_base.c
Lines 1595 to 1599 in 245e20b
A fix should be as simple as:
if ((order == PACK_ORDER_BIG) != (NPY_BYTE_ORDER == NPY_BIG_ENDIAN)) {
v0 = npyv_rev64_u8(v0);
v1 = npyv_rev64_u8(v1);
v2 = npyv_rev64_u8(v2);
v3 = npyv_rev64_u8(v3);
}Am I missing something in here?
There was a problem hiding this comment.
Yes, it doesn't work as you'd expect it due to npyv_tobits_b8 calls.
It looks like PACK_ORDER_BIG is bit-order parameter, not byte-order parameter.
Let's take values from testcase I'm getting:
v[0]: 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x01
v[1]: 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x01
v[2]: 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x01
v[3]: 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x01
bb[0]: 0x0000000000009010
bb[1]: 0x000000000000d050
bb[2]: 0x000000000000b030
bb[3]: 0x000000000000f070
arr[0]: 0xf070b030d0509010
If excessive byteswapping is done where suggested and removed from other places, result would be:
v[0]: 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x01 0x00 0x00 0x00 0x00
v[1]: 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x01 0x00 0x00 0x00 0x00
v[2]: 0x00 0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x01 0x00 0x00 0x00 0x00
v[3]: 0x00 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x01 0x00 0x00 0x00 0x00
bb[0]: 0x0000000000000908
bb[1]: 0x0000000000000b0a
bb[2]: 0x0000000000000d0c
bb[3]: 0x0000000000000f0e
I don't see how from those bb values I can easily get correct value of arr[0].
It's actually better to rely on
It would be great to have a minimal reproducer, and you can open a separate issue for it. It could be a clang bug! |
fde7603 to
2e215d9
Compare
|
Thanks! I've incorporated new names for workflow and jobs. I've also rebased this PR due to |
343ed1b to
d7f772c
Compare
Yes, it could be a clang bug. I've seen similar failures on s390x on Ubuntu 24.04 with clang 20. But on Fedora 43 with clang 20 it didn't reproduce for me. That's why I'm using docker and Fedora 43 for clang builds on s390x. I don't have a ppc64le machine at the moment to debug this issue. I'll have to leave it to others. |
I will give it a try for ppc64le. |
Add parameters for runner and compiler
Only 16-byte vectorization path is tested for big endian.
Too many tests still fail and need investigation.
Newer clang from Fedora 43 builds numpy and passes tests better than clang from Ubuntu 24.04
AT_HWCAP is now a hex number and needs to be decoded. https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=e5893e6349541d871e8a25120bca014551d13ff5 Use /prc/cpuinfo to obtain same information.
Too many failing tests which need to be looked at.
d7f772c to
9a03276
Compare
|
I've rebased this PR on new |
ee361ad to
a93e9ec
Compare
a93e9ec to
ff5d22a
Compare
|
@sandeepgupta12 has found that adding The need for |
It is already set for numpy itself, use it for compiled modules as well. With this change no -fno-strict-aliasing is needed for clang on s390x and ppc64le
|
It looks like numpy is compiled with I've also fixed one clang warning on s390x about undefined behaviour in negative number shifting. |
|
Is anything blocking this PR? |
My apologies. It was the bottleneck on my part. I merged this PR to speed things up and land the fix, and since the changes wouldn't affect performance on non-bigendian archs, I think it's fine to apply the suggestion I made in here #30819 (comment) in another PR.
The fix is good to me. Yes, the flag should be passed to the C/C++ source already defined in Lines 48 to 51 in 245e20b Thank you, @AlekseiNikiforovIBM. |
It no longer contains only ppc64le workflows, use more common name. actionspz could work as a project name that provides runners. Follow up to numpy#30819.
It no longer contains only ppc64le workflows, use more common name. Follow up to numpy#30819.
CI/BUG: add native jobs for s390x, fix bug in `pack_inner` (#30819)
Rename workflow linux-ppc64le.yml into linux-actionspz.yml.
Copy and update ppc64le workflow for s390x.
Disable qemu-based workflows for s390x.
Fix big endian issue in
pack_inner.Closes: #30817
@mkumatag @rgommers