Skip to content

fix: string array numpy conversion fails with int32 offsets from parquet#3697

Merged
ianna merged 8 commits intoscikit-hep:mainfrom
DylanModesitt:dcm/fix-parquet-string-int32-offsets
Oct 31, 2025
Merged

fix: string array numpy conversion fails with int32 offsets from parquet#3697
ianna merged 8 commits intoscikit-hep:mainfrom
DylanModesitt:dcm/fix-parquet-string-int32-offsets

Conversation

@DylanModesitt
Copy link
Copy Markdown
Contributor

Closes: #3696

Fixes a bug where converting string arrays to numpy fails after deserializing from parquet with string_to32=True (the default). Upon deserialization, the resulting ListOffsetArray has int32 offsets instead of int64 & the utf8 string conversion kernels only had int64 offset specializations.

Added int32 and uint32 kernel specializations for the three UTF8/padding kernels:

  • awkward_NumpyArray_prepare_utf8_to_utf32_padded
  • awkward_NumpyArray_utf8_to_utf32_padded
  • awkward_NumpyArray_pad_zero_to_length

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.72%. Comparing base (b749e49) to head (d4cd268).
⚠️ Report is 460 commits behind head on main.

Additional details and impacted files

see 200 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DylanModesitt
Copy link
Copy Markdown
Contributor Author

DylanModesitt commented Oct 24, 2025

Think the GPU Test failures are unrelated? Seems like a CMake/compiler configuration issues in the action.

@ianna
Copy link
Copy Markdown
Member

ianna commented Oct 24, 2025

Think the GPU Test failures are unrelated? Seems like a CMake/compiler configuration issues in the action.

I think so too. The CUDA kernels have been implemented correctly already. I'll have a look. Thanks!

@github-actions
Copy link
Copy Markdown

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3697

@ariostas
Copy link
Copy Markdown
Member

The failure on macos-14 was due to a stale cache, so I cleared the cache and re-run it. The GPU failure is pretty confusing because other PRs work fine. I'll look into it some more.

@ariostas
Copy link
Copy Markdown
Member

Oh it seems like it's also a cache issue. In other PRs it's using a cached wheel, but in this one it's not. I'll fix it.

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DylanModesitt - Great! Thanks for fixing it. The tests pass, I'll enable auto-merge. Thanks.

@ianna ianna merged commit 845b92a into scikit-hep:main Oct 31, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet string serialization breaks conversion to numpy

3 participants