fix: make is_c_contiguous touch data#3636
Conversation
is_c_contiguous touchis_c_contiguous touch data
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
🚀 New features to boost your workflow:
|
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3636 |
|
FYI @agoose77 in case you look at this, this effectively makes |
|
@ikrommyd thank you for the ping! I have thought a lot about this topic, and it's lovely to be brought in. I'm pretty sure we don't want to do this, but the reason for that might not be very clear! I remember digging into it explicitly somewhere, and found a little bit of discussion on Slack. Let me try and capture what I remember: In Awkward 1.0, we had a pretty simple type-data separation: forms were type information, and layouts composed this with buffers for data. Typetracer came along in Awkward 2 to give usthe ability to trace operations on data through low level manipulation, and decompose these into type changes and data manipulation. Whilst it's obvious that Right now, we explicitly choose to consider contiguousness a detail that depends upon the data in a non-touching way. When we say that an operation is touching, what we're really doing is partitioning the set of operations on an array into two groups:
With contiguousness, there's a third option:
By imposing this axiom, it means that any time we have a branch that tests
At this point, I would expect you to feel uncertain or hestitant — I certainly did at first. What about So, why can we skip touching buffers in
The trick, therefore, is to declare that placeholders and typetracers are C contiguous. That way, I'm not convinced that this last step will entirely make sense, so please do feel free to query it and I can try and rewrite it. FWIW I do find it really helpful to avoid thinking about typetracing as touching, and more about a system that lets us reason about our arrays whose rules we are choosing consciously. |
|
Thanks @agoose77! I will read it carefully tomorrow with a clear brain. What I want to quickly comment though is that there are multiple cases where we have branching like The way I quickly thought about |
|
@pfackeldey and @ianna also please comment here and let me know how to proceed. From the user's perspective, the users will just want awkward/src/awkward/contents/content.py Line 1184 in 72655a6 So if it turns out that is_c_contiguous shouldn't touch data, we are gonna have to make to have layout.materialize() ignore placeholders.
|
|
Hi @agoose77, I agree that contiguousness is in a weird spot where it isn't obvious directly if it should involve touching or not. To me the purpose of typetracing is to infer if an array/buffer needs to be in memory or not during runtime. Checking for contiguousness does need the array/buffer in memory, because contiguousness describes part of the memory layout. Sure, we do not need the numerical values, but we do need it to be in memory. That's why I'd argue that it does make sense to treat operations that involve contiguousness as operations that need to add touching. We can of course do this shortcut to assume that TypeTracerArrays are C-contiguous, but that raises:
and
We're seeing this because we made the decision that VirtualArrays should fully materialize in
or
Personally, I'd still prefer (1.) because if we go with (2.) we just introduce yet another special handling that makes the codebase and the tracing step harder to reason about in the future. In addition, I'm still lacking the motivation for this special handling in the first place, I do not yet see why or where checking for contiguousness (or maybe even calling On the other hand, I do not want to delay the release much longer either... I'd be fine with (2.) as well, although it feels like a hack to me 😅 |
|
Apologies for the wall of text. It's as much to get my brain on the page as it is to be helpful.
This is where we slightly differ in opinion -- my interpretation is that typetracer is a system the Awkward team (us) have designed that lets us predict certain properties about the resulting array. We have the choice about the extent of that prediction, and it's a trade-off between completeness and performance. Internally in Awkward, we never test contiguousness in a type-dependent way (IIRC), we only test for performance / kernel preparedness. We assume (know) that real (loaded) non-contiguous data will be properly made contiguous by the high level functions that are re-run after tracing.
Let's say that you make To make this point more clear, if we do touch data when we test contiguousness, we're still going to have to make a data-less decision during type tracing. That's identical to what happens if we don't touch data.
In both cases, the code that tests contiguousness has to do something, and the traced type of the result(s) needs to be independent of the actual value. So we can't reduce the number of code-paths here. The difference is really only that suddenly a lot of calls become touching that previously weren't. The distinction from data touching is that data is not an implementation detail. If we have an ak operation that wants to e.g. compute In simpler words, the present situation is
And that's pretty much true for how we treat it internally -- Awkward freely modifies contiguousness in nearly every operation. So, regarding Aside -- the way to fully touch an array to ensure all buffers are present is I'm quite conscious that I am not as active in the project as I used to be. I am acutely aware of the tension between providing insight into previous design decisions and patterns, and impeding change through lack of bandwidth. Ultimately the decision rests with you all, but I can make time for a Zoom call etc. if it's helpful |
Ok, it's just that to me this means then slightly different things for awkward-array alone (what you describe) and for the application with e.g. dask-awkward that uses tracing then also for a column projection (what I describe). The type prediction is of course important for the graph construction. So effectively we're using tracers for both: predicting output types and predicting buffer requirements. If the latter is not true (what I describe), then isn't it wrong to use TypeTracerArrays for column projection?
That is true, I was rather thinking there's no point in this check at all and remove it entirely 😅. We could just call I can accept that this was just a choice at some point, and we're at a point where we don't want to change the typetracer interface anymore. My hope was that we do not develop VirtualArrays, PlaceholderArrays, and TypeTracerArrays as 3 different entities: they share a lot of common logic! And given that we do not see any over-materialization with VirtualArrays when just materializing full layouts in |
|
In order to not delay a release and have a safe fix, I'd like to propose changing this PR to instead not materialize PlaceholderArrays for Are you ok with this @ianna for the release? I still think that maybe we could get rid of the contiguous check entirely, and be able to get away with just calling |
ianna
left a comment
There was a problem hiding this comment.
@ikrommyd, @agoose77 and @pfackeldey - Thanks for your input! I think we can close this PR as there is an alternative solution in place. Thanks!
I'm pretty sure that the logic here should be "The code wants to know whether a buffer is contiguous, therefore this buffer should be loaded"