Skip to content

fix: make is_c_contiguous touch data#3636

Closed
ikrommyd wants to merge 2 commits intoscikit-hep:mainfrom
ikrommyd:ikrommyd/touch-in-is-c-contiguous
Closed

fix: make is_c_contiguous touch data#3636
ikrommyd wants to merge 2 commits intoscikit-hep:mainfrom
ikrommyd:ikrommyd/touch-in-is-c-contiguous

Conversation

@ikrommyd
Copy link
Copy Markdown
Collaborator

@ikrommyd ikrommyd commented Sep 4, 2025

I'm pretty sure that the logic here should be "The code wants to know whether a buffer is contiguous, therefore this buffer should be loaded"

@ikrommyd ikrommyd changed the title fix: make is_c_contiguous touch fix: make is_c_contiguous touch data Sep 4, 2025
@ianna ianna requested review from agoose77 and ianna September 4, 2025 15:37
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.65%. Comparing base (b749e49) to head (dbd520e).
⚠️ Report is 413 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
src/awkward/_nplikes/typetracer.py 77.18% <100.00%> (+2.33%) ⬆️
src/awkward/operations/ak_to_packed.py 100.00% <ø> (ø)

... and 195 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 4, 2025

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3636

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Sep 4, 2025

FYI @agoose77 in case you look at this, this effectively makes ak.to_packed touch every leaf NumpyArray which is what we think it should do. ak.to_packed materialized the whole virtual layout in awkward 1 and also materializes all virtual buffers in awkward 2.

@agoose77
Copy link
Copy Markdown
Collaborator

agoose77 commented Sep 4, 2025

@ikrommyd thank you for the ping! I have thought a lot about this topic, and it's lovely to be brought in.

I'm pretty sure we don't want to do this, but the reason for that might not be very clear! I remember digging into it explicitly somewhere, and found a little bit of discussion on Slack. Let me try and capture what I remember:

In Awkward 1.0, we had a pretty simple type-data separation: forms were type information, and layouts composed this with buffers for data. Typetracer came along in Awkward 2 to give usthe ability to trace operations on data through low level manipulation, and decompose these into type changes and data manipulation.

Whilst it's obvious that dtype is a type-thing, and sum(arr) is a data-thing, it's not clear where contiguousness sits (which I touch upon here: https://iris-hep.slack.com/archives/D05PL00PTDJ/p1740579855952359). Really, it's a data-thing except that it only matters if you're probing the data in a funny way.

Right now, we explicitly choose to consider contiguousness a detail that depends upon the data in a non-touching way. When we say that an operation is touching, what we're really doing is partitioning the set of operations on an array into two groups:

  • You need to load the data to answer that question
  • You don't need to load the data to answer that question

With contiguousness, there's a third option:

  • The answer to the question depends upon whether you have loaded data or not

By imposing this axiom, it means that any time we have a branch that tests is_c_contiguous, we must ensure that:

  1. typetracer skips that test and has a dedicated typetracer block
  2. the non-typetracer branch doesn't depend upon the result of is_c_contiguous for the array type

At this point, I would expect you to feel uncertain or hestitant — I certainly did at first. What about to_packed? By definition, it needs to test contiguousness. Yes, it does! But, not in a type-changing way. It only checks contiguousness in a "ensure that this ends up being contiguous" kind of way. That means it's more of an implementation detail than an important part of the type system.

So, why can we skip touching buffers in to_packed? The most helpful way to think about typetracer (for me, at least) is

  • Typetracing is designed so that you can run a graph of array transforms.
  • This process can yield information that is used to build a new set of input arrays containing real and placeholder buffers
  • The graph can then be re-computed over these inputs such that it doesn't fail

The trick, therefore, is to declare that placeholders and typetracers are C contiguous. That way, to_packed doesn't fail -- it just doesn't do anything, because placeholders are already packed. Another way to think about this is that making contiguousness an implementation detail, we can apply Awkward graph operations to packed and non-packed inputs identically.

I'm not convinced that this last step will entirely make sense, so please do feel free to query it and I can try and rewrite it. FWIW I do find it really helpful to avoid thinking about typetracing as touching, and more about a system that lets us reason about our arrays whose rules we are choosing consciously.

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Sep 4, 2025

Thanks @agoose77! I will read it carefully tomorrow with a clear brain. What I want to quickly comment though is that there are multiple cases where we have branching like elif not self.is_contiguous in https://github.com/scikit-hep/awkward/blob/main/src/awkward/contents/numpyarray.py and self.is_contiguous will always be True for typetracer so that also doesn't sound fully right to me.

The way I quickly thought about to_packed is "I want to pack this layout to minimize IO before some IO operation, therefore I want ALL the data of this layout".

@ikrommyd
Copy link
Copy Markdown
Collaborator Author

ikrommyd commented Sep 4, 2025

@pfackeldey and @ianna also please comment here and let me know how to proceed. From the user's perspective, the users will just want dak.to_packed not error because we "materialize" placeholders here:

return self.materialize()._to_packed(True)

So if it turns out that is_c_contiguous shouldn't touch data, we are gonna have to make to have layout.materialize() ignore placeholders.

@pfackeldey
Copy link
Copy Markdown
Collaborator

pfackeldey commented Sep 5, 2025

Hi @agoose77,
thanks for your detailed comment.

I agree that contiguousness is in a weird spot where it isn't obvious directly if it should involve touching or not. To me the purpose of typetracing is to infer if an array/buffer needs to be in memory or not during runtime. Checking for contiguousness does need the array/buffer in memory, because contiguousness describes part of the memory layout. Sure, we do not need the numerical values, but we do need it to be in memory. That's why I'd argue that it does make sense to treat operations that involve contiguousness as operations that need to add touching.

We can of course do this shortcut to assume that TypeTracerArrays are C-contiguous, but that raises:

  1. a pretty uncomfortable feeling in me because that means we're following different code-paths during tracing and during runtime, which I think should be avoided as much as possible in order to reason about tracing. And this may lead to potential bugs of course (but I assume this has been thoroughly thought through/tested) if the two branches are not properly 'aligned'.

and

  1. a question: what was the motivation for this special contiguous handling in the first place? Has there been any signs of over-touching in dask-awkward because of this? In my view, we should make sure that we're running the same code-paths during tracing and during runtime - unless this would lead to massive over-touching that we could avoid with some tricks like this one. Currently, the only way this touching is triggered is (afaik) when running to_packed, and I think that this is a reasonable behavior - similarly to how VirtualArrays materialize then in awkward 1 and awkward 2. to_packed is called usually before IO ops that write to disk, and in order to do so those buffers have to be loaded anyway (which they are right now, but through a different touching step).

We're seeing this because we made the decision that VirtualArrays should fully materialize in to_packed in order to call ascontiguous on them (as it was done in awkward 1) and to make sure they're in memory so e.g. a subsequent IO operation can write them to disk. Now, PlaceholderArrays also follow this logic and can be materialized which just throws a RuntimeError. Thus, I see 2 ways forward here as Iason pointed out:

  1. we make this contiguous check here touch the data, so we won't ever have PlaceholderArrays in a layout that we call to_packed on.

or

  1. we explicitly only make VirtualArrays, but not PlaceholderArrays, materialize in to_packed.

Personally, I'd still prefer (1.) because if we go with (2.) we just introduce yet another special handling that makes the codebase and the tracing step harder to reason about in the future. In addition, I'm still lacking the motivation for this special handling in the first place, I do not yet see why or where checking for contiguousness (or maybe even calling ascontiguousarray) would introduce e.g. massive over-touching?

On the other hand, I do not want to delay the release much longer either... I'd be fine with (2.) as well, although it feels like a hack to me 😅

@agoose77
Copy link
Copy Markdown
Collaborator

agoose77 commented Sep 5, 2025

Apologies for the wall of text. It's as much to get my brain on the page as it is to be helpful.

To me the purpose of typetracing is to infer if an array/buffer needs to be in memory or not.

This is where we slightly differ in opinion -- my interpretation is that typetracer is a system the Awkward team (us) have designed that lets us predict certain properties about the resulting array. We have the choice about the extent of that prediction, and it's a trade-off between completeness and performance.

Internally in Awkward, we never test contiguousness in a type-dependent way (IIRC), we only test for performance / kernel preparedness. We assume (know) that real (loaded) non-contiguous data will be properly made contiguous by the high level functions that are re-run after tracing.

we're following different code-paths during tracing and during runtime

Let's say that you make is_contiguous_array touching. You still need to return a concrete boolean for the conditional. That means making an arbitrary choice, which is what we're already doing (except such a change is less well-defined than the status quo, because the present situation explicitly states that contiguousness is not covered by TT). I can't off-hand remember if we have other examples of boolean types that we cover with tracing. The usual pattern is to allow size_t-like members to return unknown values, and we then throw errors if they're depended upon in e.g. a boolean context.

To make this point more clear, if we do touch data when we test contiguousness, we're still going to have to make a data-less decision during type tracing. That's identical to what happens if we don't touch data.

  • In the non-touching case, what happens is that if the buffer is never realised, the is_c_contiguous(placeholder) call returns an arbitrary choice. We choose True because it's simpler (we usually want contiguousness internally).
  • In the touching case, the actual contiguousness of the buffer is returned.

In both cases, the code that tests contiguousness has to do something, and the traced type of the result(s) needs to be independent of the actual value. So we can't reduce the number of code-paths here. The difference is really only that suddenly a lot of calls become touching that previously weren't.

The distinction from data touching is that data is not an implementation detail. If we have an ak operation that wants to e.g. compute ak.count(arr, axis=None), then if no touching occurs then the result depends upon what the count kernels do when they encounter a placeholder. Obviously that's throw an error, although we could always return -1 if we wanted 😆. The existing design means that we can't/shouldn't expose an ak.is_c_contiguous(arr), therefore.

In simpler words, the present situation is

Treat contiguousness as an implementation detail rather than a core part of the high level array.

And that's pretty much true for how we treat it internally -- Awkward freely modifies contiguousness in nearly every operation.


So, regarding ak.to_packedak.to_packed shouldn't necessarily touch data. It should touch data when going from ByteMaskedArray to IndexedOptionArray, etc, but the actual packed-ness is not type visible.

Aside -- the way to fully touch an array to ensure all buffers are present is ak.typetracer.touch_data().


I'm quite conscious that I am not as active in the project as I used to be. I am acutely aware of the tension between providing insight into previous design decisions and patterns, and impeding change through lack of bandwidth. Ultimately the decision rests with you all, but I can make time for a Zoom call etc. if it's helpful

@pfackeldey
Copy link
Copy Markdown
Collaborator

pfackeldey commented Sep 5, 2025

This is where we slightly differ in opinion -- my interpretation is that typetracer is a system the Awkward team (us) have designed that lets us predict certain properties about the resulting array.

Ok, it's just that to me this means then slightly different things for awkward-array alone (what you describe) and for the application with e.g. dask-awkward that uses tracing then also for a column projection (what I describe). The type prediction is of course important for the graph construction. So effectively we're using tracers for both: predicting output types and predicting buffer requirements. If the latter is not true (what I describe), then isn't it wrong to use TypeTracerArrays for column projection?

Let's say that you make is_contiguous_array touching. You still need to return a concrete boolean for the conditional. That means making an arbitrary choice, which is what we're already doing (except such a change is less well-defined than the status quo, because the present situation explicitly states that contiguousness is not covered by TT). I can't off-hand remember if we have other examples of boolean types that we cover with tracing. The usual pattern is to allow size_t-like members to return unknown values, and we then throw errors if they're depended upon in e.g. a boolean context.

To make this point more clear, if we do touch data when we test contiguousness, we're still going to have to make a data-less decision during type tracing. That's identical to what happens if we don't touch data.

That is true, I was rather thinking there's no point in this check at all and remove it entirely 😅. We could just call ascontiguousarray wherever we want to ensure contiguousness (and that operation touches), or not call it if we don't care about contiguousness. This is_contiguous check could vanish entirely, and then we won't have different code paths anymore. To me this whole logic is right now a special case that was introduced to avoid touching for contiguousness, but I haven't seen yet a reason why this should be avoided in the first place, i.e., did it ever introduce over-touching in analyses?

I can accept that this was just a choice at some point, and we're at a point where we don't want to change the typetracer interface anymore. My hope was that we do not develop VirtualArrays, PlaceholderArrays, and TypeTracerArrays as 3 different entities: they share a lot of common logic! And given that we do not see any over-materialization with VirtualArrays when just materializing full layouts in to_packed, I was pretty comfortable that this special casing here may not be needed for tracing either.

@pfackeldey
Copy link
Copy Markdown
Collaborator

pfackeldey commented Sep 5, 2025

In order to not delay a release and have a safe fix, I'd like to propose changing this PR to instead not materialize PlaceholderArrays for to_packed as we can decide they just are C-contiguous - then there's no point in "materializing" them. Then, we're following the current TypeTracer logic and do not make any invasive changes in the existing typetracer interface.

Are you ok with this @ianna for the release?


I still think that maybe we could get rid of the contiguous check entirely, and be able to get away with just calling ascontiguousarray (with touching) wherever we want to ensure contiguousness. I'm pretty sure it is a noop in NumPy if the array is already contiguous. This would eliminate special casing between tracing and runtime code-paths. It is a bit of an 'invasive' change that may yield different necessary columns in physics workflows. However, I am confident that this is not the case given that we do not see this with VirtualNDArrays currently. Anyway, given this uncertainty I'd propose to continue with the solution (2.) of my first reply, and we may want to follow this up at some point in the future.

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd, @agoose77 and @pfackeldey - Thanks for your input! I think we can close this PR as there is an alternative solution in place. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants