Skip to content

feat: ensure ability to pickle ak.Array and ak.Record with VirtualArray buffers#3612

Merged
ianna merged 15 commits intoscikit-hep:mainfrom
ikrommyd:ensure-virtual-array-pickling
Aug 27, 2025
Merged

feat: ensure ability to pickle ak.Array and ak.Record with VirtualArray buffers#3612
ianna merged 15 commits intoscikit-hep:mainfrom
ikrommyd:ensure-virtual-array-pickling

Conversation

@ikrommyd
Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd commented Aug 9, 2025

Fixes scikit-hep/coffea#1380

This PR ensures that highlevel ak.Array and ak.Record are picklable when they contain VirtualArray buffers.
The solution to this problem is in 3 parts (+1 typo fix):

  1. We implement __reduce__ for VirtualArray in order to make the VirtualArray picklable. TypeTracerArray implements this too. This operation materializes the VirtualArray. We don't want to serialize unmaterialized virtual arrays and we can't because they hold lambda functions on them
  2. We ensure that to_packed(recursive=True) materializes. ak.to_packed should do that anyways like it did in awkward1 and it's used before IO and pickling which means the array should be materialized at that point. It was already materializing every layout apart from bit and byte masked arrays, we just now ensure it with an explicit call. recursive=False which is not exposed to the user via ak.to_packed doesn't recursively materialize because it leads to overmaterialization in ak.unflatten. This is the only usage of recursive=False I've seen overall. ak.to_packed is also used in the __reduce_ex__ implementation of the highlevel objects to implement pickling so the whole layout should be fully materialized before being pickled. The simplest way to implement that was to implement _to_packed and to_packed for all layouts where to_packed just calls materializes if necessary and then calls _to_packed. This pattern of implementation is there for ALL other layout methods (see to_backend and others for example).
  3. The __reduce_ex__ implementations of the high-level objects try to find a foreign pickler if another library implements one for awkward arrays and records. Therefore we ensure materialization before we pass the array to the foreign pickler. If dask-awkward is installed in the environment, it will find that one for example because it implements one. It searches globally the whole environment for implementations.
  4. We fix a tiny typo in ak.from_buffers where this line should pass down shape_generator not _shape_generator. It's added in this PR because it's a one-line fix and it was found via testing this PR.

A new test file has been added that tests all that. A single test will fail because it needs the .copy() fix for VirtualArrays implemented in #3599. The PR is in draft mode because of that and because it needs some testing with full analyses to ensure we haven't broken anything with the materializing to_packed(True) in other closely related libraries like coffea and awkward-zipper.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 9, 2025

cc @pfackeldey I think this is in line with everything we found out discussing about this issue.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Aug 9, 2025

Hi @kratsg, could you try this PR for scikit-hep/coffea#1380 please :). The array will be naturally materialized during pickling and also it must not contain behaviors from vector because those have lambda functions and are unpicklable. You can remove all behaviors with array.behavior = {}.

Copy link
Copy Markdown
Collaborator

@pfackeldey pfackeldey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just minor comments. I think it would be good to get the confirmation from @kratsg that this PR solves the pickling issue he reported before we merge this 👍

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

Thanks for the comments @pfackeldey. Indeed waiting for Giordon's check is good. I have independently tested it with the following since Giordon reported that the problem was found if a coffea processor returned an unmaterialized array. The following errors on main.

from coffea.nanoevents import NanoEventsFactory
from coffea import processor
from distributed import Client
import awkward as ak


fileset = {"dy": {"files": {"../coffea/tests/samples/nano_dy.root": "Events"}}}


class P(processor.ProcessorABC):
    def process(self, events):
        events.behavior = {}
        return {"dy": events.run}

    def postprocess(self, accumulator):
        pass

if __name__ == "__main__":
    run = processor.Runner(executor=processor.FuturesExecutor(compression=5), chunksize=40, savemetrics=True)
    print(run(fileset=fileset, processor_instance=P()))

@kratsg
Copy link
Copy Markdown

kratsg commented Aug 20, 2025

Quick test confirms this solves it. I don't have the old reproducer available, but I re-ran with the script and just pickled the outputs from the processor and it looks good to me (no errors, no crashes).

@ikrommyd ikrommyd marked this pull request as ready for review August 26, 2025 16:21
@github-actions
Copy link
Copy Markdown

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3612

@ikrommyd ikrommyd requested a review from ianna August 26, 2025 16:39
@codecov
Copy link
Copy Markdown

codecov bot commented Aug 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.65%. Comparing base (b749e49) to head (565c8c2).
⚠️ Report is 409 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/_nplikes/virtual.py 90.04% <100.00%> (ø)
src/awkward/contents/bitmaskedarray.py 69.85% <100.00%> (-0.59%) ⬇️
src/awkward/contents/bytemaskedarray.py 87.64% <100.00%> (-2.02%) ⬇️
src/awkward/contents/content.py 77.04% <100.00%> (+2.23%) ⬆️
src/awkward/contents/emptyarray.py 76.36% <100.00%> (+0.74%) ⬆️
src/awkward/contents/indexedarray.py 85.00% <100.00%> (+3.96%) ⬆️
src/awkward/contents/indexedoptionarray.py 89.10% <100.00%> (+0.89%) ⬆️
src/awkward/contents/listarray.py 90.55% <100.00%> (+1.12%) ⬆️
src/awkward/contents/listoffsetarray.py 81.09% <100.00%> (-1.78%) ⬇️
src/awkward/contents/numpyarray.py 91.35% <100.00%> (-0.16%) ⬇️
... and 7 more

... and 180 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - thanks, looks great. Please update the docstring. Thanks

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - thanks for addressing the comments. Looks great! Thanks.

@ianna ianna merged commit 34d6233 into scikit-hep:main Aug 27, 2025
46 checks passed
@ikrommyd ikrommyd deleted the ensure-virtual-array-pickling branch August 27, 2025 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to pickle unmaterialized virtual arrays returned from a processor

4 participants