Skip to content

CI/BUG: add native jobs for s390x, fix bug in pack_inner#30819

Merged
seiko2plus merged 15 commits into
numpy:mainfrom
AlekseiNikiforovIBM:s390x_ci
Apr 2, 2026
Merged

CI/BUG: add native jobs for s390x, fix bug in pack_inner#30819
seiko2plus merged 15 commits into
numpy:mainfrom
AlekseiNikiforovIBM:s390x_ci

Conversation

@AlekseiNikiforovIBM

@AlekseiNikiforovIBM AlekseiNikiforovIBM commented Feb 11, 2026

Copy link
Copy Markdown
Contributor

Rename workflow linux-ppc64le.yml into linux-actionspz.yml.
Copy and update ppc64le workflow for s390x.
Disable qemu-based workflows for s390x.

Fix big endian issue in pack_inner.

Closes: #30817

@mkumatag @rgommers

@AlekseiNikiforovIBM AlekseiNikiforovIBM marked this pull request as draft February 11, 2026 15:05
@mkumatag

Copy link
Copy Markdown

@sandeepgupta12 ptal

@AlekseiNikiforovIBM

AlekseiNikiforovIBM commented Feb 11, 2026

Copy link
Copy Markdown
Contributor Author

So, currently clang tests are failing on ppc64le and s390x, and clang and gcc tests with forced vectorization fail on s390x.

Can we proceed with these workflows now or should tests be fixed first? I'm already looking into s390x failures.

@AlekseiNikiforovIBM AlekseiNikiforovIBM marked this pull request as ready for review February 11, 2026 16:39
@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

This change fixes s390x tests with forced zvector:

663107d

I could also split it into a separate PR if you'd like.

Now there are only some failures when clang is used.

@rgommers rgommers left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlekseiNikiforovIBM. This is a bit tricky to review (let's blame the GitHub UI), so it'd be great if you could revert the name change and leave it at linux-ppc64le.yml for now. That will make the diff much smaller, and it'll be visible what you tweaked in the ppc64le job.

Comment thread .github/workflows/linux-ppc64le.yml Outdated
fail-fast: false
matrix:
config:
- name: "GCC (ppc64le)"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole job matrix is ppc64le, so this name change seem a bit unnecessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it takes title of workflow file, and then job name. Without this change there will be two jobs named "Native ppc64le and s390x Linux Test / GCC", one for ppc64le and one for s390x.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: "GCC (ppc64le)"
- name: "ppc64le/gcc - baseline(default)"

I agree with @AlekseiNikiforovIBM. We need to be specific about the job name since this workflow supports both architectures. I would like to be more specific about the baseline since we enabled multiple baseline configurations.

Comment thread .github/workflows/linux-actionspz.yml Outdated
Comment thread .github/workflows/linux-ppc64le.yml Outdated
- name: "GCC with zvector (s390x)"
args: "-Dallow-noblas=false -Dcpu-baseline=vx"
- name: "clang with zvector (s390x)"
args: "-Dallow-noblas=false -Dcpu-baseline=vx"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had two jobs under QEMU; four seems a little excessive - can you do GCC and Clang with zvector here (or vice versa?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They produce different results, so I think it's better to have all 4 of them. Considering issues with clang, maybe clang jobs should be dropped until all issues when compiling with clang are resolved?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With clang 18, ppc64le tests are passing. See here- https://github.com/numpy/numpy/actions/runs/21948675141/job/63393053952

@AlekseiNikiforovIBM AlekseiNikiforovIBM Feb 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see clang-18 being used there. I see gcc. From linked build log:

C compiler for the host machine: cc (gcc 13.3.0 "cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0")
C linker for the host machine: cc ld.bfd 2.42
C++ compiler for the host machine: c++ (gcc 13.3.0 "c++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0")
C++ linker for the host machine: c++ ld.bfd 2.42

Clang-20 logs for comparison:

C compiler for the host machine: clang-20 (clang 20.1.2 "Ubuntu clang version 20.1.2 (0ubuntu1~24.04.2)")
C linker for the host machine: clang-20 ld.bfd 2.42
C++ compiler for the host machine: clang++-20 (clang 20.1.2 "Ubuntu clang version 20.1.2 (0ubuntu1~24.04.2)")
C++ linker for the host machine: clang++-20 ld.bfd 2.42

@rgommers rgommers added 03 - Maintenance component: CI component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Feb 11, 2026
@rgommers rgommers requested a review from seiko2plus February 11, 2026 20:25
@rgommers

Copy link
Copy Markdown
Member

I could also split it into a separate PR if you'd like.

A single PR seems fine to me, thanks.

@rgommers rgommers changed the title Implement workflows for s390x native CI CI/BUG: Implement workflows for s390x native CI Feb 12, 2026
@rgommers rgommers changed the title CI/BUG: Implement workflows for s390x native CI CI/BUG: add native jobs for s390x, fix bug in pack_inner Feb 12, 2026
@AlekseiNikiforovIBM AlekseiNikiforovIBM force-pushed the s390x_ci branch 2 times, most recently from 7f05a2d to d3779d0 Compare February 12, 2026 12:52
@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

I've disabled clang builds and tests for s390x while there are still too many issues to be fixed. I think enabling them later would be a good choice.

@AlekseiNikiforovIBM AlekseiNikiforovIBM force-pushed the s390x_ci branch 2 times, most recently from bae121c to c9693ef Compare February 13, 2026 14:52
@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

I've updated PR.

I've used docker and fedora:43 image to build and test numpy with clang. Fedora has newer clang, and maybe clang is patched differently there, so there are no miscompilations detected by testsuite.

Only 1 test failed and was needed to be fixed: test_cpu_features.py::Test_ZARCH_Features::test_features.

AT_HWCAP is no longer decoded in glibc output:
https://sourceware.org/git/?p=glibc.git;a=commit;h=e5893e6349541d871e8a25120bca014551d13ff5
Now it's provided as a hex number:

$ LD_SHOW_AUXV=1 /bin/true | grep AT_HWCAP
AT_HWCAP:             0x67ffff

Luckily, on s390x a copy of output is available in /proc/cpuinfo and I've switched to using it:

$ grep feature /proc/cpuinfo 
features        : esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt pcimio sie 

Unfortunately, I don't think similar output is available for ppc64le. Hex output of AT_HWCAP would be needed to be decoded for ppc64le when it'd be detected, or some other approach would have to be used.

I've also bumped zvector build to enable everything up to VXE2.

There still were 65 failing tests for ppc64le when building with clang:
https://github.com/numpy/numpy/actions/runs/21990958320/job/63537634918?pr=30819

2026-02-13T14:52:11.1751562Z = 65 failed, 47875 passed, 344 skipped, 2843 deselected, 34 xfailed, 3 xpassed in 256.65s (0:04:16) =

I disabled ppc64le clang build for now. It'd be better to have those tests fixed or disabled first, or some other build environment used.

@rgommers please take one more look.

@seiko2plus seiko2plus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix and extending the CI. The work is good to me. I left some comments below — not necessary to be accomplished, and they can be done in another PR. Regarding the SIMD optimization comment I left, it's for reference — it can be done within this PR or within another PR using the current C SIMD interface or during the transition of moving to Google Highway.

Comment thread .github/workflows/linux-ppc64le.yml Outdated
Comment thread .github/workflows/linux-ppc64le.yml Outdated
Comment thread .github/workflows/linux-ppc64le.yml Outdated
Comment thread .github/workflows/linux-ppc64le.yml Outdated
Comment thread .github/workflows/linux-ppc64le.yml Outdated
fail-fast: false
matrix:
config:
- name: "GCC (ppc64le)"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: "GCC (ppc64le)"
- name: "ppc64le/gcc - baseline(default)"

I agree with @AlekseiNikiforovIBM. We need to be specific about the job name since this workflow supports both architectures. I would like to be more specific about the baseline since we enabled multiple baseline configurations.

@@ -1632,6 +1652,11 @@ pack_inner(const char *inptr,
va = npyv_rev64_u8(va);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better than just a fix. We can avoid using npyv_rev64_u8 on bigendian if the required order is PACK_ORDER_BIG. So we reverse if the required order is PACK_ORDER_BIG and byte order is NPY_LITTLE_ENDIAN, and we swap if the byte order is NPY_BIG_ENDIAN.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would that be done? I don't see how this place can be improved. Maybe such change could be added later in a separate PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is simply no need to swap what's already mistakenly swapped! So this fix is like performing "!!". The core of the problem is in:

if (order == PACK_ORDER_BIG) {
v0 = npyv_rev64_u8(v0);
v1 = npyv_rev64_u8(v1);
v2 = npyv_rev64_u8(v2);
v3 = npyv_rev64_u8(v3);

A fix should be as simple as:

            if ((order == PACK_ORDER_BIG) != (NPY_BYTE_ORDER == NPY_BIG_ENDIAN)) {
                v0 = npyv_rev64_u8(v0);
                v1 = npyv_rev64_u8(v1);
                v2 = npyv_rev64_u8(v2);
                v3 = npyv_rev64_u8(v3);
            }

Am I missing something in here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't work as you'd expect it due to npyv_tobits_b8 calls.

It looks like PACK_ORDER_BIG is bit-order parameter, not byte-order parameter.

Let's take values from testcase I'm getting:

v[0]: 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x01
v[1]: 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x01
v[2]: 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x01
v[3]: 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x01
bb[0]: 0x0000000000009010
bb[1]: 0x000000000000d050
bb[2]: 0x000000000000b030
bb[3]: 0x000000000000f070
arr[0]: 0xf070b030d0509010

If excessive byteswapping is done where suggested and removed from other places, result would be:

v[0]: 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x01 0x00 0x00 0x00 0x00
v[1]: 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x01 0x00 0x00 0x00 0x00
v[2]: 0x00 0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x01 0x00 0x00 0x00 0x00
v[3]: 0x00 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x01 0x00 0x00 0x00 0x00
bb[0]: 0x0000000000000908
bb[1]: 0x0000000000000b0a
bb[2]: 0x0000000000000d0c
bb[3]: 0x0000000000000f0e

I don't see how from those bb values I can easily get correct value of arr[0].

@seiko2plus

Copy link
Copy Markdown
Member

Luckily, on s390x a copy of output is available in /proc/cpuinfo and I've switched to using it:

It's actually better to rely on cpuinfo in the testing unit since CPU feature detection depends on auxv. So validating auxv using the kernel's final report is a good idea, but I'm not sure if this would be friendly with chroots over QEMU BFMTs. However, it looks good to me.

I disabled ppc64le clang build for now. It'd be better to have those tests fixed or disabled first, or some other build environment used.

It would be great to have a minimal reproducer, and you can open a separate issue for it. It could be a clang bug!

@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

Thanks! I've incorporated new names for workflow and jobs.

I've also rebased this PR due to pyrefly in test requirements: it fails to build on s390x.

@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

It would be great to have a minimal reproducer, and you can open a separate issue for it. It could be a clang bug!

Yes, it could be a clang bug. I've seen similar failures on s390x on Ubuntu 24.04 with clang 20. But on Fedora 43 with clang 20 it didn't reproduce for me. That's why I'm using docker and Fedora 43 for clang builds on s390x.

I don't have a ppc64le machine at the moment to debug this issue. I'll have to leave it to others.

@sandeepgupta12

Copy link
Copy Markdown
Contributor

It would be great to have a minimal reproducer, and you can open a separate issue for it. It could be a clang bug!

Yes, it could be a clang bug. I've seen similar failures on s390x on Ubuntu 24.04 with clang 20. But on Fedora 43 with clang 20 it didn't reproduce for me. That's why I'm using docker and Fedora 43 for clang builds on s390x.

I don't have a ppc64le machine at the moment to debug this issue. I'll have to leave it to others.

I will give it a try for ppc64le.

Too many tests still fail and need investigation.
Newer clang from Fedora 43 builds numpy and passes tests better than clang from Ubuntu 24.04
AT_HWCAP is now a hex number and needs to be decoded.
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=e5893e6349541d871e8a25120bca014551d13ff5

Use /prc/cpuinfo to obtain same information.
Too many failing tests which need to be looked at.
@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

I've rebased this PR on new main due to previous conflict in requirements/test_requirements.txt

@AlekseiNikiforovIBM AlekseiNikiforovIBM force-pushed the s390x_ci branch 2 times, most recently from ee361ad to a93e9ec Compare March 11, 2026 09:38
@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

@sandeepgupta12 has found that adding -fno-strict-aliasing resolves issues on ppc64le, and using clang-20 works on s390x.

The need for -fno-strict-aliasing can be investigated separately, I think.

It is already set for numpy itself, use it for compiled modules as well.
With this change no -fno-strict-aliasing is needed for clang on s390x and ppc64le
@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

It looks like numpy is compiled with -fno-strict-aliasing, but on-demand compiled additional modules were compiled without -fno-strict-aliasing. I've added this option there and now it's no longer needed to be specified in CFLAGS for ppc64le and s390x.

I've also fixed one clang warning on s390x about undefined behaviour in negative number shifting.

@AlekseiNikiforovIBM

Copy link
Copy Markdown
Contributor Author

Is anything blocking this PR?

@seiko2plus seiko2plus merged commit 8a0557f into numpy:main Apr 2, 2026
80 of 81 checks passed
@seiko2plus

Copy link
Copy Markdown
Member

Is anything blocking this PR?

My apologies. It was the bottleneck on my part. I merged this PR to speed things up and land the fix, and since the changes wouldn't affect performance on non-bigendian archs, I think it's fine to apply the suggestion I made in here #30819 (comment) in another PR.

It looks like numpy is compiled with -fno-strict-aliasing, but on-demand compiled additional modules were compiled without -fno-strict-aliasing. I've added this option there and now it's no longer needed to be specified in CFLAGS for ppc64le and s390x.

The fix is good to me. Yes, the flag should be passed to the C/C++ source already defined in

numpy/meson.build

Lines 48 to 51 in 245e20b

add_project_arguments(
cc.get_supported_arguments( '-fno-strict-aliasing'), language : 'c'
)
#
.

Thank you, @AlekseiNikiforovIBM.

@seiko2plus seiko2plus added the 09 - Backport-Candidate PRs tagged should be backported label Apr 2, 2026
AlekseiNikiforovIBM added a commit to AlekseiNikiforovIBM/numpy that referenced this pull request Apr 2, 2026
It no longer contains only ppc64le workflows, use more common name.
actionspz could work as a project name that provides runners.

Follow up to numpy#30819.
AlekseiNikiforovIBM added a commit to AlekseiNikiforovIBM/numpy that referenced this pull request Apr 7, 2026
It no longer contains only ppc64le workflows, use more common name.

Follow up to numpy#30819.
@rgommers rgommers added this to the 2.5.0 Release milestone Apr 7, 2026
charris pushed a commit to charris/numpy that referenced this pull request Apr 10, 2026
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Apr 10, 2026
charris added a commit that referenced this pull request Apr 10, 2026
CI/BUG: add native jobs for s390x, fix bug in `pack_inner` (#30819)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

03 - Maintenance component: CI component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: use IBM self-hosted runner for native s390x tests

6 participants