CI/BUG: add native jobs for s390x, fix bug in `pack_inner` by AlekseiNikiforovIBM · Pull Request #30819 · numpy/numpy

AlekseiNikiforovIBM · 2026-02-11T14:55:39Z

Rename workflow linux-ppc64le.yml into linux-actionspz.yml.
Copy and update ppc64le workflow for s390x.
Disable qemu-based workflows for s390x.

Fix big endian issue in pack_inner.

Closes: #30817

@mkumatag @rgommers

mkumatag · 2026-02-11T15:14:16Z

@sandeepgupta12 ptal

AlekseiNikiforovIBM · 2026-02-11T16:39:22Z

So, currently clang tests are failing on ppc64le and s390x, and clang and gcc tests with forced vectorization fail on s390x.

Can we proceed with these workflows now or should tests be fixed first? I'm already looking into s390x failures.

AlekseiNikiforovIBM · 2026-02-11T17:38:26Z

This change fixes s390x tests with forced zvector:

663107d

I could also split it into a separate PR if you'd like.

Now there are only some failures when clang is used.

rgommers

Thanks @AlekseiNikiforovIBM. This is a bit tricky to review (let's blame the GitHub UI), so it'd be great if you could revert the name change and leave it at linux-ppc64le.yml for now. That will make the diff much smaller, and it'll be visible what you tweaked in the ppc64le job.

rgommers · 2026-02-11T20:19:16Z

+      fail-fast: false
+      matrix:
+        config:
+          - name: "GCC (ppc64le)"


This whole job matrix is ppc64le, so this name change seem a bit unnecessary

It looks like it takes title of workflow file, and then job name. Without this change there will be two jobs named "Native ppc64le and s390x Linux Test / GCC", one for ppc64le and one for s390x.

Suggested change

- name: "GCC (ppc64le)"

- name: "ppc64le/gcc - baseline(default)"

I agree with @AlekseiNikiforovIBM. We need to be specific about the job name since this workflow supports both architectures. I would like to be more specific about the baseline since we enabled multiple baseline configurations.

rgommers · 2026-02-11T20:23:39Z

+          - name: "GCC with zvector (s390x)"
+            args: "-Dallow-noblas=false -Dcpu-baseline=vx"
+          - name: "clang with zvector (s390x)"
+            args: "-Dallow-noblas=false -Dcpu-baseline=vx"


We had two jobs under QEMU; four seems a little excessive - can you do GCC and Clang with zvector here (or vice versa?

They produce different results, so I think it's better to have all 4 of them. Considering issues with clang, maybe clang jobs should be dropped until all issues when compiling with clang are resolved?

With clang 18, ppc64le tests are passing. See here- https://github.com/numpy/numpy/actions/runs/21948675141/job/63393053952

I don't see clang-18 being used there. I see gcc. From linked build log:

C compiler for the host machine: cc (gcc 13.3.0 "cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0") C linker for the host machine: cc ld.bfd 2.42 C++ compiler for the host machine: c++ (gcc 13.3.0 "c++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0") C++ linker for the host machine: c++ ld.bfd 2.42

Clang-20 logs for comparison:

C compiler for the host machine: clang-20 (clang 20.1.2 "Ubuntu clang version 20.1.2 (0ubuntu1~24.04.2)") C linker for the host machine: clang-20 ld.bfd 2.42 C++ compiler for the host machine: clang++-20 (clang 20.1.2 "Ubuntu clang version 20.1.2 (0ubuntu1~24.04.2)") C++ linker for the host machine: clang++-20 ld.bfd 2.42

rgommers · 2026-02-11T20:26:09Z

I could also split it into a separate PR if you'd like.

A single PR seems fine to me, thanks.

AlekseiNikiforovIBM · 2026-02-12T15:44:00Z

I've disabled clang builds and tests for s390x while there are still too many issues to be fixed. I think enabling them later would be a good choice.

AlekseiNikiforovIBM · 2026-02-13T15:12:01Z

I've updated PR.

I've used docker and fedora:43 image to build and test numpy with clang. Fedora has newer clang, and maybe clang is patched differently there, so there are no miscompilations detected by testsuite.

Only 1 test failed and was needed to be fixed: test_cpu_features.py::Test_ZARCH_Features::test_features.

AT_HWCAP is no longer decoded in glibc output:
https://sourceware.org/git/?p=glibc.git;a=commit;h=e5893e6349541d871e8a25120bca014551d13ff5
Now it's provided as a hex number:

$ LD_SHOW_AUXV=1 /bin/true | grep AT_HWCAP
AT_HWCAP:             0x67ffff

Luckily, on s390x a copy of output is available in /proc/cpuinfo and I've switched to using it:

$ grep feature /proc/cpuinfo 
features        : esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt pcimio sie

Unfortunately, I don't think similar output is available for ppc64le. Hex output of AT_HWCAP would be needed to be decoded for ppc64le when it'd be detected, or some other approach would have to be used.

I've also bumped zvector build to enable everything up to VXE2.

There still were 65 failing tests for ppc64le when building with clang:
https://github.com/numpy/numpy/actions/runs/21990958320/job/63537634918?pr=30819

2026-02-13T14:52:11.1751562Z = 65 failed, 47875 passed, 344 skipped, 2843 deselected, 34 xfailed, 3 xpassed in 256.65s (0:04:16) =

I disabled ppc64le clang build for now. It'd be better to have those tests fixed or disabled first, or some other build environment used.

@rgommers please take one more look.

seiko2plus

Thank you for the fix and extending the CI. The work is good to me. I left some comments below — not necessary to be accomplished, and they can be done in another PR. Regarding the SIMD optimization comment I left, it's for reference — it can be done within this PR or within another PR using the current C SIMD interface or during the transition of moving to Google Highway.

seiko2plus · 2026-02-20T20:05:32Z

+      fail-fast: false
+      matrix:
+        config:
+          - name: "GCC (ppc64le)"


Suggested change

- name: "GCC (ppc64le)"

- name: "ppc64le/gcc - baseline(default)"

I agree with @AlekseiNikiforovIBM. We need to be specific about the job name since this workflow supports both architectures. I would like to be more specific about the baseline since we enabled multiple baseline configurations.

seiko2plus · 2026-02-20T20:35:22Z

@@ -1632,6 +1652,11 @@ pack_inner(const char *inptr,
                va = npyv_rev64_u8(va);


I think we can do better than just a fix. We can avoid using npyv_rev64_u8 on bigendian if the required order is PACK_ORDER_BIG. So we reverse if the required order is PACK_ORDER_BIG and byte order is NPY_LITTLE_ENDIAN, and we swap if the byte order is NPY_BIG_ENDIAN.

How would that be done? I don't see how this place can be improved. Maybe such change could be added later in a separate PR?

What I mean is simply no need to swap what's already mistakenly swapped! So this fix is like performing "!!". The core of the problem is in:

numpy/numpy/_core/src/multiarray/compiled_base.c

Lines 1595 to 1599 in 245e20b

if (order == PACK_ORDER_BIG) {

v0 = npyv_rev64_u8(v0);

v1 = npyv_rev64_u8(v1);

v2 = npyv_rev64_u8(v2);

v3 = npyv_rev64_u8(v3);

A fix should be as simple as:

if ((order == PACK_ORDER_BIG) != (NPY_BYTE_ORDER == NPY_BIG_ENDIAN)) { v0 = npyv_rev64_u8(v0); v1 = npyv_rev64_u8(v1); v2 = npyv_rev64_u8(v2); v3 = npyv_rev64_u8(v3); }

Am I missing something in here?

Yes, it doesn't work as you'd expect it due to npyv_tobits_b8 calls.

It looks like PACK_ORDER_BIG is bit-order parameter, not byte-order parameter.

Let's take values from testcase I'm getting:

v[0]: 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x01 v[1]: 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x01 v[2]: 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x01 v[3]: 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x01 bb[0]: 0x0000000000009010 bb[1]: 0x000000000000d050 bb[2]: 0x000000000000b030 bb[3]: 0x000000000000f070 arr[0]: 0xf070b030d0509010

If excessive byteswapping is done where suggested and removed from other places, result would be:

v[0]: 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x01 0x00 0x00 0x00 0x00 v[1]: 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x01 0x01 0x00 0x01 0x00 0x00 0x00 0x00 v[2]: 0x00 0x00 0x01 0x01 0x00 0x00 0x00 0x00 0x01 0x00 0x01 0x01 0x00 0x00 0x00 0x00 v[3]: 0x00 0x01 0x01 0x01 0x00 0x00 0x00 0x00 0x01 0x01 0x01 0x01 0x00 0x00 0x00 0x00 bb[0]: 0x0000000000000908 bb[1]: 0x0000000000000b0a bb[2]: 0x0000000000000d0c bb[3]: 0x0000000000000f0e

I don't see how from those bb values I can easily get correct value of arr[0].

seiko2plus · 2026-02-20T20:54:38Z

Luckily, on s390x a copy of output is available in /proc/cpuinfo and I've switched to using it:

It's actually better to rely on cpuinfo in the testing unit since CPU feature detection depends on auxv. So validating auxv using the kernel's final report is a good idea, but I'm not sure if this would be friendly with chroots over QEMU BFMTs. However, it looks good to me.

I disabled ppc64le clang build for now. It'd be better to have those tests fixed or disabled first, or some other build environment used.

It would be great to have a minimal reproducer, and you can open a separate issue for it. It could be a clang bug!

AlekseiNikiforovIBM · 2026-02-23T12:56:19Z

Thanks! I've incorporated new names for workflow and jobs.

I've also rebased this PR due to pyrefly in test requirements: it fails to build on s390x.

AlekseiNikiforovIBM · 2026-02-23T15:24:35Z

It would be great to have a minimal reproducer, and you can open a separate issue for it. It could be a clang bug!

Yes, it could be a clang bug. I've seen similar failures on s390x on Ubuntu 24.04 with clang 20. But on Fedora 43 with clang 20 it didn't reproduce for me. That's why I'm using docker and Fedora 43 for clang builds on s390x.

I don't have a ppc64le machine at the moment to debug this issue. I'll have to leave it to others.

sandeepgupta12 · 2026-02-24T12:38:31Z

It would be great to have a minimal reproducer, and you can open a separate issue for it. It could be a clang bug!

Yes, it could be a clang bug. I've seen similar failures on s390x on Ubuntu 24.04 with clang 20. But on Fedora 43 with clang 20 it didn't reproduce for me. That's why I'm using docker and Fedora 43 for clang builds on s390x.

I don't have a ppc64le machine at the moment to debug this issue. I'll have to leave it to others.

I will give it a try for ppc64le.

Add parameters for runner and compiler

Only 16-byte vectorization path is tested for big endian.

Too many tests still fail and need investigation.

Newer clang from Fedora 43 builds numpy and passes tests better than clang from Ubuntu 24.04

AT_HWCAP is now a hex number and needs to be decoded. https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=e5893e6349541d871e8a25120bca014551d13ff5 Use /prc/cpuinfo to obtain same information.

Too many failing tests which need to be looked at.

…ithub view

AlekseiNikiforovIBM · 2026-02-25T09:44:32Z

I've rebased this PR on new main due to previous conflict in requirements/test_requirements.txt

AlekseiNikiforovIBM · 2026-03-11T10:04:11Z

@sandeepgupta12 has found that adding -fno-strict-aliasing resolves issues on ppc64le, and using clang-20 works on s390x.

The need for -fno-strict-aliasing can be investigated separately, I think.

…tive value

It is already set for numpy itself, use it for compiled modules as well. With this change no -fno-strict-aliasing is needed for clang on s390x and ppc64le

AlekseiNikiforovIBM · 2026-03-13T15:04:23Z

It looks like numpy is compiled with -fno-strict-aliasing, but on-demand compiled additional modules were compiled without -fno-strict-aliasing. I've added this option there and now it's no longer needed to be specified in CFLAGS for ppc64le and s390x.

I've also fixed one clang warning on s390x about undefined behaviour in negative number shifting.

AlekseiNikiforovIBM · 2026-04-01T11:11:24Z

Is anything blocking this PR?

seiko2plus · 2026-04-02T13:49:35Z

Is anything blocking this PR?

My apologies. It was the bottleneck on my part. I merged this PR to speed things up and land the fix, and since the changes wouldn't affect performance on non-bigendian archs, I think it's fine to apply the suggestion I made in here #30819 (comment) in another PR.

It looks like numpy is compiled with -fno-strict-aliasing, but on-demand compiled additional modules were compiled without -fno-strict-aliasing. I've added this option there and now it's no longer needed to be specified in CFLAGS for ppc64le and s390x.

The fix is good to me. Yes, the flag should be passed to the C/C++ source already defined in

numpy/meson.build

Lines 48 to 51 in 245e20b

    
           add_project_arguments( 
        
             cc.get_supported_arguments( '-fno-strict-aliasing'), language : 'c' 
        
           ) 
        
           #

.

Thank you, @AlekseiNikiforovIBM.

It no longer contains only ppc64le workflows, use more common name. actionspz could work as a project name that provides runners. Follow up to numpy#30819.

It no longer contains only ppc64le workflows, use more common name. Follow up to numpy#30819.

CI/BUG: add native jobs for s390x, fix bug in `pack_inner` (#30819)

AlekseiNikiforovIBM marked this pull request as draft February 11, 2026 15:05

AlekseiNikiforovIBM marked this pull request as ready for review February 11, 2026 16:39

rgommers reviewed Feb 11, 2026

View reviewed changes

rgommers added 03 - Maintenance component: CI component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Feb 11, 2026

rgommers requested a review from seiko2plus February 11, 2026 20:25

rgommers changed the title ~~Implement workflows for s390x native CI~~ CI/BUG: Implement workflows for s390x native CI Feb 12, 2026

rgommers changed the title ~~CI/BUG: Implement workflows for s390x native CI~~ CI/BUG: add native jobs for s390x, fix bug in pack_inner Feb 12, 2026

AlekseiNikiforovIBM force-pushed the s390x_ci branch 2 times, most recently from 7f05a2d to d3779d0 Compare February 12, 2026 12:52

AlekseiNikiforovIBM force-pushed the s390x_ci branch 2 times, most recently from bae121c to c9693ef Compare February 13, 2026 14:52

seiko2plus approved these changes Feb 20, 2026

View reviewed changes

AlekseiNikiforovIBM force-pushed the s390x_ci branch from fde7603 to 2e215d9 Compare February 23, 2026 12:55

AlekseiNikiforovIBM force-pushed the s390x_ci branch from 343ed1b to d7f772c Compare February 23, 2026 13:31

AlekseiNikiforovIBM added 4 commits February 25, 2026 10:43

Implement workflow for s390x

e29cb19

Add parameters for runner and compiler

Disable qemu-based workflows for s390x

ef50867

Fix test_packbits.py::test_packbits_large on s390x

3cf7672

Only 16-byte vectorization path is tested for big endian.

Fix setting CC and CXX for clang jobs

4bc75ce

AlekseiNikiforovIBM added 7 commits February 25, 2026 10:43

Disable clang builds for s390x

93c0430

Too many tests still fail and need investigation.

Implement clang CI for ppc64le and s390x using docker and fedora:43

84c2456

Newer clang from Fedora 43 builds numpy and passes tests better than clang from Ubuntu 24.04

Use /proc/cpuinfo to get CPU features on s390x

7aebfbe

AT_HWCAP is now a hex number and needs to be decoded. https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=e5893e6349541d871e8a25120bca014551d13ff5 Use /prc/cpuinfo to obtain same information.

Disable clang build for ppc64le

c08c905

Too many failing tests which need to be looked at.

Bump zvector builds to VXE2

d37eeac

Change workflow title to move it closer to other linux workflows in g…

a028651

…ithub view

Rename workflow and jobs for ppc64le and s390x

9a03276

AlekseiNikiforovIBM force-pushed the s390x_ci branch from d7f772c to 9a03276 Compare February 25, 2026 09:43

AlekseiNikiforovIBM force-pushed the s390x_ci branch 2 times, most recently from ee361ad to a93e9ec Compare March 11, 2026 09:38

Try clang-20 with -fno-strict-aliasing for ppc64le and s390x

ff5d22a

AlekseiNikiforovIBM force-pushed the s390x_ci branch from a93e9ec to ff5d22a Compare March 11, 2026 09:46

AlekseiNikiforovIBM added 2 commits March 13, 2026 15:50

s390x: fix clang warning about undefined behaviour when shifting nega…

4d429e5

…tive value

Add -fno-strict-aliasing to numpy modules project file

d51ea67

It is already set for numpy itself, use it for compiled modules as well. With this change no -fno-strict-aliasing is needed for clang on s390x and ppc64le

seiko2plus merged commit 8a0557f into numpy:main Apr 2, 2026
80 of 81 checks passed

seiko2plus added the 09 - Backport-Candidate PRs tagged should be backported label Apr 2, 2026

AlekseiNikiforovIBM mentioned this pull request Apr 2, 2026

MAINT: Rename ppc64le and s390x workflow #31121

Merged

AlekseiNikiforovIBM added a commit to AlekseiNikiforovIBM/numpy that referenced this pull request Apr 7, 2026

Rename ppc64le and s390x workflow

54a3f81

It no longer contains only ppc64le workflows, use more common name. Follow up to numpy#30819.

rgommers added this to the 2.5.0 Release milestone Apr 7, 2026

AlekseiNikiforovIBM mentioned this pull request Apr 7, 2026

MAINT: allow reading auxv output from recent glibc versions on s390x #31175

Open

charris pushed a commit to charris/numpy that referenced this pull request Apr 10, 2026

CI/BUG: add native jobs for s390x, fix bug in pack_inner (numpy#30819)

feb4052

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Apr 10, 2026

charris added a commit that referenced this pull request Apr 10, 2026

Merge pull request #31198 from charris/backport-30819

1388e93

CI/BUG: add native jobs for s390x, fix bug in `pack_inner` (#30819)

charris mentioned this pull request Apr 10, 2026

MAINT: Rename ppc64le and s390x workflow (#31121) #31207

Merged

	- name: "GCC (ppc64le)"
	- name: "ppc64le/gcc - baseline(default)"

		@@ -1632,6 +1652,11 @@ pack_inner(const char *inptr,
		va = npyv_rev64_u8(va);

	if (order == PACK_ORDER_BIG) {
	v0 = npyv_rev64_u8(v0);
	v1 = npyv_rev64_u8(v1);
	v2 = npyv_rev64_u8(v2);
	v3 = npyv_rev64_u8(v3);

Uh oh!

Uh oh!

Conversation

AlekseiNikiforovIBM commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkumatag commented Feb 11, 2026

Uh oh!

AlekseiNikiforovIBM commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlekseiNikiforovIBM commented Feb 11, 2026

Uh oh!

rgommers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlekseiNikiforovIBM Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgommers commented Feb 11, 2026

Uh oh!

AlekseiNikiforovIBM commented Feb 12, 2026

Uh oh!

AlekseiNikiforovIBM commented Feb 13, 2026

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus commented Feb 20, 2026

Uh oh!

AlekseiNikiforovIBM commented Feb 23, 2026

Uh oh!

AlekseiNikiforovIBM commented Feb 23, 2026

Uh oh!

sandeepgupta12 commented Feb 24, 2026

Uh oh!

AlekseiNikiforovIBM commented Feb 25, 2026

Uh oh!

AlekseiNikiforovIBM commented Mar 11, 2026

Uh oh!

AlekseiNikiforovIBM commented Mar 13, 2026

Uh oh!

AlekseiNikiforovIBM commented Apr 1, 2026

Uh oh!

Uh oh!

seiko2plus commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

AlekseiNikiforovIBM commented Feb 11, 2026 •

edited

Loading

AlekseiNikiforovIBM commented Feb 11, 2026 •

edited

Loading

AlekseiNikiforovIBM Feb 12, 2026 •

edited

Loading