Skip to content

fix: VirtualArray's deep copy + accidental .data access#3599

Merged
ianna merged 5 commits intoscikit-hep:mainfrom
ikrommyd:remove-accidental-data-getattr
Aug 26, 2025
Merged

fix: VirtualArray's deep copy + accidental .data access#3599
ianna merged 5 commits intoscikit-hep:mainfrom
ikrommyd:remove-accidental-data-getattr

Conversation

@ikrommyd
Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd commented Aug 3, 2025

.copy() of an ndarray is a deep copy. It should be like that for VirtualArray too.
Also ak.merge_union_of_records asks for .data in two lines when the objects it asks for the .data attribute are already buffers (not ak.contents.Content or ak.index.Index). This accidentally returns a memoryview object which numpy and cupy can handle in asarray, but It doesn't look intended. Also the buffers of other backends don't implement a .data property which means this won't work for virtual and typetracer arrays.

  • VirtualArray's .copy() should be a deep copy
  • handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records

@ikrommyd ikrommyd changed the title fix: remove accidentally correct .data access in ak.merge_union_of_records fix: remove accidentally correct .data attribute access in ak.merge_union_of_records Aug 3, 2025
@ianna ianna changed the title fix: remove accidentally correct .data attribute access in ak.merge_union_of_records fix: handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records Aug 4, 2025
@ianna ianna changed the title fix: handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records fix: handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records Aug 4, 2025
@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 4, 2025

Added tests but ci will fail until #3600 is merged

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 4, 2025

Added tests but ci will fail until #3600 is merged

Thanks! Please add the changes from #3600 to this PR - it's all VirualArray-related bug fixes.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 4, 2025

Added tests but ci will fail until #3600 is merged

Thanks! Please add the changes from #3600 to this PR - it's all VirualArray-related bug fixes.

Uhhmm I think this makes it less clear. This PR just removes a .data which shouldn't be there in the first place. It's not a virtual array related PR, It just happens that it doesn't work for virtual arrays because they don't implement .data. But the typetracer doesn't either, nor any other array likes apart from numpy/cupy ndarray.

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 4, 2025

Added tests but ci will fail until #3600 is merged

Thanks! Please add the changes from #3600 to this PR - it's all VirualArray-related bug fixes.

Uhhmm I think this makes it less clear. This PR just removes a .data which shouldn't be there in the first place. It's not a virtual array related PR, It just happens that it doesn't work for virtual arrays because they don't implement .data. But the typetracer doesn't either, nor any other array likes apart from numpy/cupy ndarray.

I see your point, but I actually think keeping these changes together is clearer in this case. The .data fix is small, but it directly revealed the VirtualArray bug, and the tests won’t pass without the related fix from #3600 anyway. Splitting them just fragments the context and creates unnecessary back-and-forth.

Let’s keep this as a single PR that cleanly fixes the root issue and demonstrates it with tests.

@ikrommyd ikrommyd changed the title fix: handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records fix: VirtualArray's .copy() should be a deep copy + handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records` Aug 4, 2025
@ikrommyd ikrommyd changed the title fix: VirtualArray's .copy() should be a deep copy + handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records` fix: VirtualArray's .copy() should be a deep copy + handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records Aug 4, 2025
@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 4, 2025

I put everything here and closed #3600

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.52%. Comparing base (b749e49) to head (554fcdd).
⚠️ Report is 400 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/_nplikes/virtual.py 90.60% <100.00%> (ø)
...rc/awkward/operations/ak_merge_union_of_records.py 84.48% <100.00%> (-1.11%) ⬇️

... and 194 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 4, 2025

This is what the copy change fixes btw:

In [1]: import awkward as ak
   ...: import numpy as np
   ...:
   ...: from awkward._nplikes.virtual import VirtualArray, materialize_if_virtual
   ...: from awkward._nplikes.numpy import Numpy
   ...: from awkward._nplikes.shape import unknown_length
   ...:
   ...: numpy = Numpy.instance()
   ...:
   ...: def shape_generator():
   ...:     print("CALLING SHAPE GENERATOR!")
   ...:     return (10,)
   ...:
   ...: def data_generator():
   ...:     print("CALLING DATA GENERATOR!")
   ...:     return np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=np.int32)
   ...:
   ...: v = VirtualArray(numpy, (unknown_length,), np.int32, data_generator, shape_generator)

In [2]: v.materialize()
CALLING DATA GENERATOR!
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

In [3]: v2 = v.copy()

In [4]: v2[0] = 111

In [5]: v
Out[5]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

In [6]: v2
Out[6]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 5, 2025

This is what the copy change fixes btw:

In [1]: import awkward as ak
   ...: import numpy as np
   ...:
   ...: from awkward._nplikes.virtual import VirtualArray, materialize_if_virtual
   ...: from awkward._nplikes.numpy import Numpy
   ...: from awkward._nplikes.shape import unknown_length
   ...:
   ...: numpy = Numpy.instance()
   ...:
   ...: def shape_generator():
   ...:     print("CALLING SHAPE GENERATOR!")
   ...:     return (10,)
   ...:
   ...: def data_generator():
   ...:     print("CALLING DATA GENERATOR!")
   ...:     return np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=np.int32)
   ...:
   ...: v = VirtualArray(numpy, (unknown_length,), np.int32, data_generator, shape_generator)

In [2]: v.materialize()
CALLING DATA GENERATOR!
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

In [3]: v2 = v.copy()

In [4]: v2[0] = 111

In [5]: v
Out[5]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

In [6]: v2
Out[6]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

There is already a deep copy solution for that:

>>> v
VirtualArray(array=[0 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))
>>> v2 = ak.copy(v)
>>> v2
VirtualArray(array=[0 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))
>>> v[0]=10
>>> v2
VirtualArray(array=[0 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))
>>> v
VirtualArray(array=[10  1  2  3  4  5  6  7  8  9], dtype('int32'), shape=(10,))

The bigger issue is the following. Since Awkward Arrays are immutable by design, explicit copying is only necessary when you want to ensure no shared underlying memory (e.g. copying from a NumPy-backed array that may be mutated externally). I'm not convinced that we should allow users to mutate awkward arrays (even if they are virtual).

Also this is kind of bad behavior:

>>> v
VirtualArray(array=UNMATERIALIZED, dtype('int32'), shape=(awkward._nplikes.shape.unknown_length,))
>>> v.materialize()
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
>>> v[0]
np.int32(0)
>>> v[0] = 1.1
>>> v
VirtualArray(array=[1 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 5, 2025

VirtualArray is not an awkward array. It's a "buffer". It should behave identically to a np.ndarray or a cupy.ndarray. The awkward codebase wants for good reason in multiple places to copy such buffers and uses .copy() on them. This has nothing to do with highlevel ak.copy. Therefore .copy() should behave similarly for VirtualArray as it does for ndarray. The user is never going to manually copy a VirtualArray. They don't ever see them. It's about making them behave like ndarrays because that's what they should behave like.

ak.copy deepcopies indeed because all it does it call copy.deepcopy on the input. So it can copy anything that implements __deepcopy__ and that doesn't have to be an awkward array. This .copy() change is only about VirtualArray.copy() which should be a deepcopy because it's a deep copy for numpy/cupy.ndarray.

@pfackeldey
Copy link
Copy Markdown
Collaborator

This is what the copy change fixes btw:

In [1]: import awkward as ak
   ...: import numpy as np
   ...:
   ...: from awkward._nplikes.virtual import VirtualArray, materialize_if_virtual
   ...: from awkward._nplikes.numpy import Numpy
   ...: from awkward._nplikes.shape import unknown_length
   ...:
   ...: numpy = Numpy.instance()
   ...:
   ...: def shape_generator():
   ...:     print("CALLING SHAPE GENERATOR!")
   ...:     return (10,)
   ...:
   ...: def data_generator():
   ...:     print("CALLING DATA GENERATOR!")
   ...:     return np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=np.int32)
   ...:
   ...: v = VirtualArray(numpy, (unknown_length,), np.int32, data_generator, shape_generator)

In [2]: v.materialize()
CALLING DATA GENERATOR!
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

In [3]: v2 = v.copy()

In [4]: v2[0] = 111

In [5]: v
Out[5]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

In [6]: v2
Out[6]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

There is already a deep copy solution for that:

>>> v
VirtualArray(array=[0 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))
>>> v2 = ak.copy(v)
>>> v2
VirtualArray(array=[0 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))
>>> v[0]=10
>>> v2
VirtualArray(array=[0 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))
>>> v
VirtualArray(array=[10  1  2  3  4  5  6  7  8  9], dtype('int32'), shape=(10,))

The bigger issue is the following. Since Awkward Arrays are immutable by design, explicit copying is only necessary when you want to ensure no shared underlying memory (e.g. copying from a NumPy-backed array that may be mutated externally). I'm not convinced that we should allow users to mutate awkward arrays (even if they are virtual).

Also this is kind of bad behavior:

>>> v
VirtualArray(array=UNMATERIALIZED, dtype('int32'), shape=(awkward._nplikes.shape.unknown_length,))
>>> v.materialize()
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
>>> v[0]
np.int32(0)
>>> v[0] = 1.1
>>> v
VirtualArray(array=[1 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))

I think this not about a highlevel interface for users to copy things. This fix is needed because awkward-array itself does inplace mutation and copies. We have to make sure that VirtualArrays behave in these operations the same as numpy arrays, and this PR goes in this direction / does that.
I haven't looked at it thoroughly yet, but I wanted to clarify that this is not about ak.copy or a highlevel functionality that users will ever see/work with.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 5, 2025

Also this is kind of bad behavior:

>>> v
VirtualArray(array=UNMATERIALIZED, dtype('int32'), shape=(awkward._nplikes.shape.unknown_length,))
>>> v.materialize()
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
>>> v[0]
np.int32(0)
>>> v[0] = 1.1
>>> v
VirtualArray(array=[1 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))

This is not our choice. This is standard numpy behavior

In [3]: x = np.arange(10, dtype=np.int32)

In [4]: x
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

In [5]: x[0] = 1.1

In [6]: x
Out[6]: array([1, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 5, 2025

Also this is kind of bad behavior:

>>> v
VirtualArray(array=UNMATERIALIZED, dtype('int32'), shape=(awkward._nplikes.shape.unknown_length,))
>>> v.materialize()
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
>>> v[0]
np.int32(0)
>>> v[0] = 1.1
>>> v
VirtualArray(array=[1 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))

This is not our choice. This is standard numpy behavior

In [3]: x = np.arange(10, dtype=np.int32)

In [4]: x
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

In [5]: x[0] = 1.1

In [6]: x
Out[6]: array([1, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

This is our choice:

>>> array = ak.Array([1,2,3,4,5])
>>> array
<Array [1, 2, 3, 4, 5] type='5 * int64'>
>>> array[0]=0
Traceback (most recent call last):
  File "<python-input-30>", line 1, in <module>
    array[0]=0
    ~~~~~^^^
  File "/Users/yana/Projects/refactoring/original/awkward/src/awkward/highlevel.py", line 1193, in __setitem__
    with ak._errors.OperationErrorContext(
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        "ak.Array.__setitem__",
        ^^^^^^^^^^^^^^^^^^^^^^^
        (self,),
        ^^^^^^^^
        {"where": where, "what": what},
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ):
    ^
  File "/Users/yana/Projects/refactoring/original/awkward/src/awkward/_errors.py", line 80, in __exit__
    raise self.decorate_exception(exception_type, exception_value)
  File "/Users/yana/Projects/refactoring/original/awkward/src/awkward/highlevel.py", line 1202, in __setitem__
    raise TypeError("only fields may be assigned in-place (by field name)")
TypeError: only fields may be assigned in-place (by field name)

This error occurred while calling

    ak.Array.__setitem__(
        <Array [1, 2, 3, 4, 5] type='5 * int64'>
        where = 0
        what = 0
    )

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 5, 2025

I think there is some confusion here in the whole discussion, a VirtualArray is not a high-level array. It's a "buffer". It behaves and should behave like a numpy ndarray. An awkward Array is a high-level array with different rules. They are not to be confused with each other. VirtualArrays are buffers like the ones you get when you call ak.to_buffers on an awkward array. They take the place of array([0, 3, 3, 5]) and array([1, 2, 3, 4, 5]) below

In [1]: import awkward as ak
ak
In [2]: array = ak.Array([[1,2,3], [], [4,5]])

In [3]: ak.to_buffers(array)
Out[3]:
(ListOffsetForm('i64', NumpyForm('int64', form_key='node1'), form_key='node0'),
 3,
 {'node0-offsets': array([0, 3, 3, 5]), 'node1-data': array([1, 2, 3, 4, 5])})

@pfackeldey
Copy link
Copy Markdown
Collaborator

Also this is kind of bad behavior:

>>> v
VirtualArray(array=UNMATERIALIZED, dtype('int32'), shape=(awkward._nplikes.shape.unknown_length,))
>>> v.materialize()
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
>>> v[0]
np.int32(0)
>>> v[0] = 1.1
>>> v
VirtualArray(array=[1 1 2 3 4 5 6 7 8 9], dtype('int32'), shape=(10,))

This is not our choice. This is standard numpy behavior

In [3]: x = np.arange(10, dtype=np.int32)

In [4]: x
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

In [5]: x[0] = 1.1

In [6]: x
Out[6]: array([1, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

This is our choice:

>>> array = ak.Array([1,2,3,4,5])
>>> array
<Array [1, 2, 3, 4, 5] type='5 * int64'>
>>> array[0]=0
Traceback (most recent call last):
  File "<python-input-30>", line 1, in <module>
    array[0]=0
    ~~~~~^^^
  File "/Users/yana/Projects/refactoring/original/awkward/src/awkward/highlevel.py", line 1193, in __setitem__
    with ak._errors.OperationErrorContext(
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        "ak.Array.__setitem__",
        ^^^^^^^^^^^^^^^^^^^^^^^
        (self,),
        ^^^^^^^^
        {"where": where, "what": what},
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ):
    ^
  File "/Users/yana/Projects/refactoring/original/awkward/src/awkward/_errors.py", line 80, in __exit__
    raise self.decorate_exception(exception_type, exception_value)
  File "/Users/yana/Projects/refactoring/original/awkward/src/awkward/highlevel.py", line 1202, in __setitem__
    raise TypeError("only fields may be assigned in-place (by field name)")
TypeError: only fields may be assigned in-place (by field name)

This error occurred while calling

    ak.Array.__setitem__(
        <Array [1, 2, 3, 4, 5] type='5 * int64'>
        where = 0
        what = 0
    )

No, a VirtualArray is not a highlevel ak.Array, and can't be used like that. A VirtualArray has the same purpose as a np.array, a cp.array, a TypeTracerArray(), or a PlaceHolderArray(). It's what we refer to as a "buffer" that lives on the leaves of a layout. Thus, it has to behave like those as well, e.g. like a np.array for the cpu backend. If np.array allows inplace mutation and copying, we have to allow it in the same way here for VirtualArrays, otherwise awkward-array will silently do things wrong or error where it shouldn't as soon as a layout interacts with this buffer type.

@ianna ianna added the pr-next-release Required for the next release label Aug 14, 2025
@ianna ianna changed the title fix: VirtualArray's .copy() should be a deep copy + handle VirtualArray correctly by removing accidental .data access in ak.merge_union_of_records fix: VirtualArray's deep copy + accidental .data access Aug 26, 2025
@ianna ianna merged commit 9a585a3 into scikit-hep:main Aug 26, 2025
76 of 80 checks passed
@ariostas
Copy link
Copy Markdown
Member

@ikrommyd It seems like this PR broke something. See here

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

@ariostas Ah it must have been a combination of this + Peter setting the generator to assert_never in another PR upon materialization. This is why it's recommended to rebase PRs to main before merging. Will fix as soon as @ianna is done with merging PRs

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

Fix in #3631

@ikrommyd ikrommyd deleted the remove-accidental-data-getattr branch August 26, 2025 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-next-release Required for the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants