feat: ensure ability to pickle ak.Array and ak.Record with VirtualArray buffers#3612
Conversation
|
cc @pfackeldey I think this is in line with everything we found out discussing about this issue. |
|
Hi @kratsg, could you try this PR for scikit-hep/coffea#1380 please :). The array will be naturally materialized during pickling and also it must not contain behaviors from |
pfackeldey
left a comment
There was a problem hiding this comment.
I have just minor comments. I think it would be good to get the confirmation from @kratsg that this PR solves the pickling issue he reported before we merge this 👍
|
Thanks for the comments @pfackeldey. Indeed waiting for Giordon's check is good. I have independently tested it with the following since Giordon reported that the problem was found if a coffea processor returned an unmaterialized array. The following errors on from coffea.nanoevents import NanoEventsFactory
from coffea import processor
from distributed import Client
import awkward as ak
fileset = {"dy": {"files": {"../coffea/tests/samples/nano_dy.root": "Events"}}}
class P(processor.ProcessorABC):
def process(self, events):
events.behavior = {}
return {"dy": events.run}
def postprocess(self, accumulator):
pass
if __name__ == "__main__":
run = processor.Runner(executor=processor.FuturesExecutor(compression=5), chunksize=40, savemetrics=True)
print(run(fileset=fileset, processor_instance=P())) |
|
Quick test confirms this solves it. I don't have the old reproducer available, but I re-ran with the script and just pickled the outputs from the processor and it looks good to me (no errors, no crashes). |
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3612 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
Fixes scikit-hep/coffea#1380
This PR ensures that highlevel
ak.Arrayandak.Recordare picklable when they containVirtualArraybuffers.The solution to this problem is in 3 parts (+1 typo fix):
__reduce__forVirtualArrayin order to make theVirtualArraypicklable.TypeTracerArrayimplements this too. This operation materializes theVirtualArray. We don't want to serialize unmaterialized virtual arrays and we can't because they hold lambda functions on themto_packed(recursive=True)materializes.ak.to_packedshould do that anyways like it did in awkward1 and it's used before IO and pickling which means the array should be materialized at that point. It was already materializing every layout apart from bit and byte masked arrays, we just now ensure it with an explicit call.recursive=Falsewhich is not exposed to the user viaak.to_packeddoesn't recursively materialize because it leads to overmaterialization inak.unflatten. This is the only usage ofrecursive=FalseI've seen overall.ak.to_packedis also used in the__reduce_ex__implementation of the highlevel objects to implement pickling so the whole layout should be fully materialized before being pickled. The simplest way to implement that was to implement_to_packedandto_packedfor all layouts whereto_packedjust calls materializes if necessary and then calls_to_packed. This pattern of implementation is there for ALL other layout methods (seeto_backendand others for example).__reduce_ex__implementations of the high-level objects try to find a foreign pickler if another library implements one for awkward arrays and records. Therefore we ensure materialization before we pass the array to the foreign pickler. If dask-awkward is installed in the environment, it will find that one for example because it implements one. It searches globally the whole environment for implementations.ak.from_bufferswhere this lineawkward/src/awkward/operations/ak_from_buffers.py
Line 348 in 5ef8ad5
shape_generatornot_shape_generator. It's added in this PR because it's a one-line fix and it was found via testing this PR.A new test file has been added that tests all that. A single test will fail because it needs the
.copy()fix forVirtualArrays implemented in #3599. The PR is in draft mode because of that and because it needs some testing with full analyses to ensure we haven't broken anything with the materializingto_packed(True)in other closely related libraries like coffea and awkward-zipper.