Skip to content

fix: do not force LE data-order in form.py#3629

Merged
ianna merged 3 commits intoscikit-hep:mainfrom
nileshpatra:s390-fix
Sep 5, 2025
Merged

fix: do not force LE data-order in form.py#3629
ianna merged 3 commits intoscikit-hep:mainfrom
nileshpatra:s390-fix

Conversation

@nileshpatra
Copy link
Copy Markdown
Contributor

Hi, on big endian (s390x) a simple array create fails as follows. This seems to stem from the fact that byte order is hard-coded. The same is also seen in debian CI, see here.

It should instead use the (native) byte order of the build machine.

% python
Python 3.13.7 (main, Aug 20 2025, 22:17:40) [GCC 14.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward
>>> array = awkward.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}], [], [{"x": 4.4, "y": [1, 2, 3, \
4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]])
Traceback (most recent call last):
   File "<python-input-1>", line 1, in <module>
     array = awkward.Array([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}], [], [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]])
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/highlevel.py", line 325, in __init__
     layout = ak.operations.to_layout(
         data, allow_record=False, regulararray=False, primitive_policy="error"
     )
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/_dispatch.py", line 41, in dispatch
     with OperationErrorContext(name, args, kwargs):
          ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/_errors.py", line 80, in __exit__
     raise self.decorate_exception(exception_type, exception_value)
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/_dispatch.py", line 67, in dispatch
     next(gen_or_result)
     ~~~~^^^^^^^^^^^^^^^
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_to_layout.py", line 80, in to_layout
     return _impl(
         array,
     ...<6 lines>...
         string_policy,
     )
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_to_layout.py", line 280, in _impl
     return ak.operations.from_iter(
            ~~~~~~~~~~~~~~~~~~~~~~~^
         obj, highlevel=False, allow_record=allow_record
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     )
     ^
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/_dispatch.py", line 42, in dispatch
     gen_or_result = func(*args, **kwargs)
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_from_iter.py", line 70, in from_iter
     return _impl(iterable, highlevel, behavior, allow_record, initial, resize, attrs)
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_from_iter.py", line 105, in _impl
     return ak.operations.ak_from_buffers._impl(
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
         form,
         ^^^^^
     ...<8 lines>...
         attrs=attrs,
         ^^^^^^^^^^^^
     )[0]
     ^
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py", line 157, in _impl
     out = _reconstitute(
         form,
     ...<7 lines>...
         shape_generator=lambda: (length,),
     )
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py", line 571, in _reconstitute
     content = _reconstitute(
         form.content,
     ...<7 lines>...
         _shape_generator,
     )
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py", line 549, in _reconstitute
     offsets = _from_buffer(
         backend.nplike,
     ...<5 lines>...
         shape_generator=_shape_generator,
     )
   File "/home/nilesh/venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py", line 241, in _from_buffer
     raise TypeError(
         f"size of array ({array.size}) is less than size of form ({count})"
     )
TypeError: size of array (4) is less than size of form (216172782113783809)

This error occurred while calling

     ak.to_layout(
         [[{'x': 1.1, 'y': [1]}, {'x': 2.2, 'y': [1, 2]}, {'x': 3.3, 'y': [1, ...
         allow_record = False
         regulararray = False
         primitive_policy = 'error'
     )

@nileshpatra
Copy link
Copy Markdown
Contributor Author

/cc: @jpivarski @tcawlfield

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.65%. Comparing base (b749e49) to head (867623c).
⚠️ Report is 414 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/forms/form.py 81.33% <ø> (-0.94%) ⬇️
src/awkward/operations/ak_to_buffers.py 93.75% <ø> (ø)

... and 196 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Moelf
Copy link
Copy Markdown
Contributor

Moelf commented Aug 21, 2025

It should instead use the (native) byte order of the build machine.

in that case we would need more infra to track what's the native byte order because this package interpolate with use cases (cuda) and formats (arrow/parquet/root) that care about byte order when you send chunks of data to them or store onto disks.

@ikrommyd
Copy link
Copy Markdown
Collaborator

We do track what's the native byteorder. ak._util.native_byteorder and there is also ak._util.native_to_byteorder. It's a bit weird because ak.from_buffers let's us specify a byteorder and when you are defining an array like that from an iterator, ak.from_iter will be used which passes the native byteorder to from_buffers here:

byteorder=ak._util.native_byteorder,

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 21, 2025

@nileshpatra - Thanks for reporting it! I think, there are two possible fixes inside Awkward:

  1. Honor native byte order when generating forms

Instead of hardcoding <f8, the form should use the array’s actual dtype.str (which encodes endianness, e.g. >f8 on s390x). That way, the form and the buffer stay consistent.

  1. Normalize to little-endian at buffer creation

Always cast dtypes to little-endian (array.astype(array.dtype.newbyteorder('<'))) before serializing. This is what formats like Arrow do.
→ The downside is that every big-endian machine will be forced to copy/convert.

@Moelf
Copy link
Copy Markdown
Contributor

Moelf commented Aug 21, 2025

→ The downside is that every big-endian machine will be forced to copy/convert.

we know this is not too bad because we have been doing this for reading TTree for decades on LE machines I guess

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 21, 2025

→ The downside is that every big-endian machine will be forced to copy/convert.

we know this is not too bad because we have been doing this for reading TTree for decades on LE machines I guess

I tend to agree that we want to standardize on little-endian (for cross-platform interchange), then the mapping should instead explicitly convert all dtypes to < (and cast on big-endian platforms).

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Aug 21, 2025

My understanding by emulating a s390x machine is that the problem is here:

array = nplike.reshape(buffer.view(dtype), shape=(-1,), copy=copy)

We view the buffers according to dtype. But the dtype is defined from the _reconstitute recursion and is passed through this index_to_dtype. See here for example:

dtype=index_to_dtype[form.offsets],

In this case the array builder is used to build the array (if you manually provide the right buffers in the form of numpy arrays, it's not a problem like ak.Array(np.array([1,2,3])) for example). The array builder here reports the buffers in the native byte order:
builder = _ext.ArrayBuilder(initial=initial, resize=resize)
builder.fromiter(iterable)
formstr, length, buffers = builder.to_buffers()
form = ak.forms.from_json(formstr)

Yet even though from_buffers allows both byte orders as options, index_to_dtype enforces a view as little endian when the buffer may not be little endian.

If we wanna keep this behavior of the array builder reporting back buffers in native byteorder and from_buffers allowing both byte orders, we must not force little endian dtype.

If we wanna allow little-endian byteorder only, we need to A) ensure that the array builder gives back little endian buffers only and B) disallow big-endian byteorders everywhere and force a litttle endian byteorder here:

array = nplike.reshape(buffer.view(dtype), shape=(-1,), copy=copy)
.

@ianna suggested both of these solutions. I'm a bit skeptical about the second one because I think numpy creates new arrays in the native byte order. We create many intermediate arrays in the awkward code base and also for the kernels and if you force little endian in from_buffers, your intermediate arrays will still be in the native byteorder (which will be different for big endian systems). I don't know what kind of effects math between these arrays will have and also what kind of cpp kernel effects.
You also can't enforce this everywhere. For example ak.Array(some numpy array) doesn't use from_buffers. It straight up converts the numpy array into a NumpyArray layout.
I tend to believe that allowing any byte order is safer and nicer to more people since numpy doesn't enforce little endian upon array creation.

Please let me know if my analysis is wrong and where.

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Aug 21, 2025

For a concrete example, on a big endian machine when calling ak.Array([1,2,3]), for the buffer reported from the array builder

(Pdb) buffers["node1-data"]
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
       0, 3], dtype=uint8)
(Pdb) buffers["node1-data"].view(">i8")
array([1, 2, 3])
(Pdb) buffers["node1-data"].view("<i8")
array([ 72057594037927936, 144115188075855872, 216172782113783808],
      dtype='<i8')

You can see how viewing it as >i8 is the correct way to view it. For the same example on a little endian machine

(Pdb) buffers["node1-data"]
array([1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0,
       0, 0], dtype=uint8)
(Pdb) buffers["node1-data"].view(">i8")
array([ 72057594037927936, 144115188075855872, 216172782113783808],
      dtype='>i8')
(Pdb) buffers["node1-data"].view("<i8")
array([1, 2, 3])

Viewing it as <i8 is the correct thing to do.

For clarification, buffers is this buffers here:

formstr, length, buffers = builder.to_buffers()

The error being raised is

TypeError: size of array (3) is less than size of form (216172782113783808)

because the length is calculated from the last element of the offsets with the wrong view due to index_to_dtype

(Pdb) buffers
{'node0-offsets': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3], dtype=uint8), 'node1-data': array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
       0, 3], dtype=uint8)}
(Pdb) buffers["node0-offsets"]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3], dtype=uint8)
(Pdb) buffers["node0-offsets"].view("<i8")
array([                 0, 216172782113783808], dtype='<i8')

And while the array has length 3 ([1,2,3]), the length is assumed to be 216172782113783808, hence the error message.

@ikrommyd
Copy link
Copy Markdown
Collaborator

This PR still doesn't solve from_buffers(to_buffers) round-trips though because to_buffers spits back little endian buffers which from_buffers will try to see as native with this PR now

>>> ak.to_buffers(ak.Array([1, 2, 3, 4, 5]))
(NumpyForm('int64', form_key='node0'), 5, {'node0-data': array([ 72057594037927936, 144115188075855872, 216172782113783808,
       288230376151711744, 360287970189639680])})

Maybe I'm just stupid and the only problem is that the array builder spits back native byteorder buffers when using builder.to_buffers. Since ak.to_buffers by default returns little endian buffers and from_buffers by default accepts little endian buffers (although we should probably let the user choose instead of the hard-coded index_to_dtype), maybe the right solution is just making the array builder spit back little endian buffers

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Aug 21, 2025

I'm cc'ing @jpivarski and @agoose77 for any comments here.

Basically awkward to/from buffers (+round trips) and from iter do not work on big endian systems and we just didn't know.

I find it weird that the array builder to_buffers spits back native byteorder buffers while ak.to_buffers spits back little-endian.
I also find it weird that the user can choose a byte order as argument in from_buffers, yet this is not used everywhere due to the hard-coded index_to_dtype which converts dtypes to little endian even though the user may request big-endian byteorder manually.

@nileshpatra
Copy link
Copy Markdown
Contributor Author

This PR still doesn't solve from_buffers(to_buffers) round-trips though because to_buffers spits back little endian buffers which from_buffers will try to see as native with this PR now

I've just pushed a fix for this.

I read all the comments on this PR, and am at the end, quite confused as to what is the best path forward here :-)

@nileshpatra
Copy link
Copy Markdown
Contributor Author

maybe the right solution is just making the array builder spit back little endian buffers

I hopefully have the right fix for this too locally, here

Let me know if I should it this to the PR instead of current commits.

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 22, 2025

@ianna suggested both of these solutions.

Correction:
@ianna suggested to follow the default little-endian format of the Apache Arrow columnar data format, which is a standard for efficient data serialization.

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Aug 22, 2025

@ianna suggested both of these solutions.

Correction: @ianna suggested to follow the default little-endian format of the Apache Arrow columnar data format, which is a standard for efficient data serialization.

Yeah not very accurate language here on my side. I just meant that you mentioned both ways in your message of A) mostly leaving it as is and fixing it and B) enforcing little endian, that’s all.

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 22, 2025

@ianna suggested both of these solutions.

Correction: @ianna suggested to follow the default little-endian format of the Apache Arrow columnar data format, which is a standard for efficient data serialization.

Yeah not very accurate language here on my side. I just meant that you mentioned both ways in your message of A) mostly leaving it as is and fixing it and B) enforcing little endian, that’s all.

Please see my previous post- we are going with plan B 😀

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Aug 22, 2025

Yeah I noticed. I have one question about this plan. Do you wanna enforce it literally everywhere? i.e disable completely the byteorder argument of from_buffers here as well?


I am asking because if someone uses the ak.from_buffers interface directly from the outside and provides their own buffers, you can't control their endianess. I guess you can say that awkward array expects only little-endian buffers. What's the suggestion regarding this one? Do you want this option to stay and only enforce little-endian inside the awkward code base but still let users provide big-endian buffers or go away completely?

Awkward also assumes in multiple places that the byteorder is native. See

def dtype_to_primitive(dtype):
if dtype.kind.upper() == "M" and dtype == dtype.newbyteorder("="):
return str(dtype)
else:
out = _dtype_to_primitive_dict.get(dtype)
if out is None:
raise TypeError(
"unsupported dtype: {}. Must be one of\n\n {}\n\nor a "
"datetime64/timedelta64 with units (e.g. 'datetime64[15us]')".format(
repr(dtype), ", ".join(_primitive_to_dtype_dict)
)
)
return out
for example

@agoose77
Copy link
Copy Markdown
Collaborator

I'm being really rude and skimming this thread -- I've timeboxed 10m to think about this!

The context of this, which I think @ianna and others have shared / intuited, is that we chose to serialise to little-endian in order to yield a predominantly zero-copy reading experience: #2067 (comment)

The intention was that whilst we serialise to little endian, we should operate on arrays in the host endianness. As such, we only need to enforce endianness at the IO boundaries.

The awkward ak.to_buffers and ak.from_buffers should support specifying endianness, so users can still zero-copy read/write arrays on big endian -- they just need to opt-in.

It sounds like the compiled ArrayBuilder isn't plumbed in to Awkward with consideration to endianness, so that sounds like the main issue.

TL;DR

  • Round-tripping via to_buffers should work without any user input, but it should copy on big endian
  • Power users can specify the big endian byteorder to prevent copying
  • We might have a bug in the hand-off between ArrayBuilder and awkward.

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Aug 22, 2025

Thanks @agoose77!

Should big endian round trips also work on little endian systems like this? Because this is not a round-trip atm

In [1]: import awkward as ak

In [2]: ak._util.native_byteorder
Out[2]: '<'

In [3]: array = ak.Array([1,2,3])

In [4]: ak.from_buffers(*ak.to_buffers(array, byteorder=">"), byteorder=">")
Out[4]: <Array [72057594037927936, ..., 216172782113783808] type='3 * int64'>

@agoose77
Copy link
Copy Markdown
Collaborator

agoose77 commented Aug 22, 2025

@ikrommyd yes, that should work (although it would be handy to add an assert that the native byteorder is > for the reproducer).

We might want to add a qemu test that runs Awkward on big endian, but I don't want to unduly increase the maintenance burden.

Crucially, forms describe in-memory arrays, and therefore can't have an endianness -- how would we handle LE on a BE system? We need to convert at the boundaries instead. So, forms should just be int64 instead of <i8.

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Aug 22, 2025

Yeah I fixed the repro 😃. This is on my macbook so it's little endian.

Endianess should probably be checked here in this block too and flipped accordingly right?

elif isinstance(buffer, PlaceholderArray) or nplike.is_own_array(buffer):
# Require 1D buffers
copy = None if isinstance(nplike, Jax) else False # Jax can not avoid this
array = nplike.reshape(buffer.view(dtype), shape=(-1,), copy=copy)
# we can't compare with count or slice when we're working with tracers
if not (isinstance(nplike, Jax) and nplike.is_currently_tracing()):
if array.size < count:
raise TypeError(
f"size of array ({array.size}) is less than size of form ({count})"
)
return array[:count]
else:
return array

It's only checked in the else block below which is when is_own_array is False.

@agoose77
Copy link
Copy Markdown
Collaborator

I reckon that's probably it!

@nileshpatra
Copy link
Copy Markdown
Contributor Author

Hi @ikrommyd would you consider to list the changes that are needed here? I'll be happy to adjust my PR accordingly (given that I did spend some cycles to write a working patch).

I don't understand the codebase in great detail -- it was meant to be a drive by contribution since I saw the package failing in Debian.

Alternatively, if you want to work out the details, I could as well close the PR. Let me know.

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 22, 2025

Hi @ikrommyd would you consider to list the changes that are needed here? I'll be happy to adjust my PR accordingly (given that I did spend some cycles to write a working patch).

I don't understand the codebase in great detail -- it was meant to be a drive by contribution since I saw the package failing in Debian.

Alternatively, if you want to work out the details, I could as well close the PR. Let me know.

This PR thread is now officially a choose-your-own-adventure novel. Everyone’s commenting like it’s a race, skipping chapters, and somehow we’ve ended up debating something that was resolved three scrolls ago. I vote we pause the chaos, summon the fellowship (aka a meeting), and bring some actual coherence to this saga before GitHub files for emotional distress.

@ianna
Copy link
Copy Markdown
Member

ianna commented Sep 5, 2025

@all-contributors please add nileshpatra for code

@allcontributors
Copy link
Copy Markdown
Contributor

@ianna

I've put up a pull request to add @nileshpatra! 🎉

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nileshpatra - Thanks for you contribution! The changes proposed do not affect LE workflows - all the tests pass. And as we currently do not have an IB on BE and your PR fixes it for you I will merge it. Thanks!

@ianna ianna added the pr-next-release Required for the next release label Sep 5, 2025
@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 5, 2025

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3629

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Sep 5, 2025

Thank you @ianna! As I said previously and on the relevant issue, this fixes like 95% of the tests on a s390x VM so I think it's a good decision!

@ianna ianna merged commit d8e2dcf into scikit-hep:main Sep 5, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-next-release Required for the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants