Skip to content

fix: VirtualArray's .copy() method should be a deep copy#3600

Closed
ikrommyd wants to merge 1 commit intoscikit-hep:mainfrom
ikrommyd:fix-virtual-array-copy
Closed

fix: VirtualArray's .copy() method should be a deep copy#3600
ikrommyd wants to merge 1 commit intoscikit-hep:mainfrom
ikrommyd:fix-virtual-array-copy

Conversation

@ikrommyd
Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd commented Aug 4, 2025

While adding tests for #3599, I found this bug/typo. .copy() of an ndarray is a deep copy. It should be like that for VirtualArray too. This is semi-critical too. It's probably not being hit often but I just hit it while writing tests and was getting wrong results.

@ikrommyd ikrommyd changed the title virtual array .copy() should be a deepcopy fix: VirtualArray's .copy() should be a deep copy Aug 4, 2025
@ikrommyd ikrommyd requested review from ianna and pfackeldey August 4, 2025 13:24
@ikrommyd ikrommyd changed the title fix: VirtualArray's .copy() should be a deep copy fix: VirtualArray's .copy() method should be a deep copy Aug 4, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Aug 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.48%. Comparing base (b749e49) to head (d220770).
⚠️ Report is 392 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/_nplikes/virtual.py 90.60% <100.00%> (ø)

... and 195 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - Thanks for catching the bug! I'm a bit concerned that the tests are being adjusted after the fact. Let's aim to batch small VirtualArray-related changes into fewer, more complete PRs—unless it's something blocking or critical. Thanks!

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 4, 2025

@ikrommyd - Thanks for catching the bug! I'm a bit concerned that the tests are being adjusted after the fact. Let's aim to batch small VirtualArray-related changes into fewer, more complete PRs—unless it's something blocking or critical. Thanks!

Yes the original tests are wrong. When I wrote them I probably thought that .copy() should return a shallow copy which was a mistake and it wasn't caught in the review I guess. .copy() on an ndarray deep copies all the elements unless the array has an object dtype (which we don't care abiut in virtual arrays).

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 4, 2025

Closing to add everything in #3599

@ikrommyd ikrommyd closed this Aug 4, 2025
@ianna
Copy link
Copy Markdown
Member

ianna commented Aug 4, 2025

@ikrommyd - Thanks for catching the bug! I'm a bit concerned that the tests are being adjusted after the fact. Let's aim to batch small VirtualArray-related changes into fewer, more complete PRs—unless it's something blocking or critical. Thanks!

Yes the original tests are wrong. When I wrote them I probably thought that .copy() should return a shallow copy which was a mistake and it wasn't caught in the review I guess. .copy() on an ndarray deep copies all the elements unless the array has an object dtype (which we don't care abiut in virtual arrays).

For NumPy ndarrays, .copy() creates a deep copy of all elements, which can be costly for large arrays because it duplicates the entire data buffer. In contrast, a shallow copy would only copy references, which is much faster but risks unintended side effects if the data is mutated.

Since VirtualArrays don’t support .data and don’t rely on object dtype, forcing a deep copy could potentially slow things down if the arrays are large. However, if the copy is necessary to ensure data integrity, the trade-off might be worthwhile.

It’s worth profiling specific use cases to see if this impacts performance noticeably in your context.

Another point: moving to a deep copy could invalidate Numba-typed references since those rely on pointers to the original array’s memory buffer. When .copy() makes a deep copy, it creates a new data buffer, so existing Numba pointers to the old data would become stale or invalid.

This could lead to subtle bugs or performance issues in JIT-compiled code relying on those pointers. It’s important to consider this when changing from shallow to deep copies, especially in performance-critical sections using Numba.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 4, 2025

Well we could potentially implement .copy() using np.copy which is slightly faster than copy.deepcopy. In any case, .copy() should be a deep copy. Without these changes, we have this problem which is kinda bad:

In [1]: import awkward as ak
   ...: import numpy as np
   ...:
   ...: from awkward._nplikes.virtual import VirtualArray, materialize_if_virtual
   ...: from awkward._nplikes.numpy import Numpy
   ...: from awkward._nplikes.shape import unknown_length
   ...:
   ...: numpy = Numpy.instance()
   ...:
   ...: def shape_generator():
   ...:     print("CALLING SHAPE GENERATOR!")
   ...:     return (10,)
   ...:
   ...: def data_generator():
   ...:     print("CALLING DATA GENERATOR!")
   ...:     return np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=np.int32)
   ...:
   ...: v = VirtualArray(numpy, (unknown_length,), np.int32, data_generator, shape_generator)

In [2]: v.materialize()
CALLING DATA GENERATOR!
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)

In [3]: v2 = v.copy()

In [4]: v2[0] = 111

In [5]: v
Out[5]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

In [6]: v2
Out[6]: VirtualArray(array=[111   1   2   3   4   5   6   7   8   9], dtype('int32'), shape=(10,))

Regarding numba, we do not let virtual arrays exists in numba compiled function and ask the user to materialize the array before passing it in.

@ikrommyd ikrommyd deleted the fix-virtual-array-copy branch August 26, 2025 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants