GH-39738: [R] Support build against the last three released versions of Arrow#39739
GH-39738: [R] Support build against the last three released versions of Arrow#39739assignUser merged 21 commits intoapache:mainfrom
Conversation
|
|
.github/workflows/r.yml
Outdated
| - { cpp_version: "15.0.0" } | ||
| - { cpp_version: "14.0.2" } | ||
| - { cpp_version: "13.0.0" } |
There was a problem hiding this comment.
I see your R code uses arithmetic 👌 but since this hardcodes the versions, I wonder if we have an automation of some sort that would minimize the chance this list gets out of date without someone noticing.
There was a problem hiding this comment.
I agree that would be ideal! I bet we could do something like query git tags with some R code 👍
There was a problem hiding this comment.
I'll look into that and can put it up in another PR.
There was a problem hiding this comment.
This will also work and is a bit more compact.
| - { cpp_version: "15.0.0" } | |
| - { cpp_version: "14.0.2" } | |
| - { cpp_version: "13.0.0" } | |
| cpp_version: [ "15.0.0" , "14.0.2", "13.0.0" ] |
We can also add updating this to the post release scripts. Do we really want to test against all release versions since minimum version? That will ballon pretty quickly...
Is it possible to have an incompatibility of say R 16.0.0 with libarrow 15.0.0 that doesn't happen with libarrow 13.0.0 (given that all versions/commits in between have been tested? If so maybe testing against last release and minimum version would be enough? If not just testing against minimum version should be sufficent?
There was a problem hiding this comment.
Ah sorry I missed the comment in check-versions.R. The last 3 versions seems like a decent support window and no 🎈 ing :D We should note this in the docs as well.
There was a problem hiding this comment.
I initially was thinking four but I think three is good (since we would also support the in-development version, and at the end of that cycle it would mean ~1 year of supported Arrow versions). I'll work on something to make sure this isn't hard-coded as well.
.github/workflows/r.yml
Outdated
| - name: Build Arrow C++ (${{ matrix.config.cpp_version }}) | ||
| if: steps.cache-arrow-build.outputs.cache-hit != 'true' | ||
| run: | | ||
| curl -L https://github.com/apache/arrow/archive/refs/tags/apache-arrow-${{ matrix.config.cpp_version }}.tar.gz | \ |
There was a problem hiding this comment.
Could you use the official source archive instead of auto generated source archive?
| curl -L https://github.com/apache/arrow/archive/refs/tags/apache-arrow-${{ matrix.config.cpp_version }}.tar.gz | \ | |
| curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/arrow-${{ matrix.config.cpp_version }}/apache-arrow-${{ matrix.config.cpp_version }}.tar.gz" | \ |
There was a problem hiding this comment.
Or how about using the official Ubuntu binaries?
sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dataset-dev=${{ matrix.config.cpp_version }}-1There was a problem hiding this comment.
I'll give the official binaries a shot!
assignUser
left a comment
There was a problem hiding this comment.
Very cool, thanks for working on this. I think we should use pre-compiled binaries otherwise 👍
.github/workflows/r.yml
Outdated
| - { cpp_version: "15.0.0" } | ||
| - { cpp_version: "14.0.2" } | ||
| - { cpp_version: "13.0.0" } |
There was a problem hiding this comment.
This will also work and is a bit more compact.
| - { cpp_version: "15.0.0" } | |
| - { cpp_version: "14.0.2" } | |
| - { cpp_version: "13.0.0" } | |
| cpp_version: [ "15.0.0" , "14.0.2", "13.0.0" ] |
We can also add updating this to the post release scripts. Do we really want to test against all release versions since minimum version? That will ballon pretty quickly...
Is it possible to have an incompatibility of say R 16.0.0 with libarrow 15.0.0 that doesn't happen with libarrow 13.0.0 (given that all versions/commits in between have been tested? If so maybe testing against last release and minimum version would be enough? If not just testing against minimum version should be sufficent?
.github/workflows/r.yml
Outdated
| - { cpp_version: "15.0.0" } | ||
| - { cpp_version: "14.0.2" } | ||
| - { cpp_version: "13.0.0" } |
There was a problem hiding this comment.
Ah sorry I missed the comment in check-versions.R. The last 3 versions seems like a decent support window and no 🎈 ing :D We should note this in the docs as well.
|
Thanks @paleolimbot, in which case it sounds good to me, and I agree that being at this point in the release cycle reduces the risks here. @jonkeane /@nealrichardson Keen to get your thoughts here too, if not for changes to this PR, then more generally things we may want to think about if we go down this path. |
nealrichardson
left a comment
There was a problem hiding this comment.
Interesting idea, I like starting by adding the CI for this.
I think this needs some documentation, which could follow in a separate PR . Writing the docs would help clarify some of the expectations and behaviors that could result from loosening the hard version requirement. For example, if we're now inviting people to install libarrow from system packages, now we're back in the world where someone could upgrade libarrow separately and get undefined symbol problems because the preprocessor statements in the R bindings effectively lock you to the libarrow version you compiled against. Not saying that's a reason not to do this, but it becomes a concern that we've made go away by doing strict version comparison now.
Likewise, I think it would be better to set an explicit version that we support down to, rather than say "last 3", because if things have truly stabilized in the C++ library/ABI, we may not need to drop support for an older version every time we release the next version.
We may also consider listing this arrow C++ >= version in SystemRequirements in DESCRIPTION too.
r/PACKAGING.md
Outdated
| ## Before the release candidate is cut | ||
|
|
||
| - [ ] [Create a GitHub issue](https://github.com/apache/arrow/issues/new/) entitled `[R] CRAN packaging checklist for version X.X.X` and copy this checklist to the issue. | ||
| - [ ] Review deprecated functions to advance their deprecation status, including removing preprocessor directives that no longer apply (e.g., oudated R versions, Arrow C++ versions older than the last three major versions). |
There was a problem hiding this comment.
I'd suggest being more concrete about what to review. For example, search for #if ARROW_VERSION_MAJOR (and whatever the R version one is) in r/src.
.github/workflows/r.yml
Outdated
| strategy: | ||
| matrix: | ||
| # 14.0.2 apt binaries are missing but contain only minor changes | ||
| cpp_version: ["15.0.0" , "14.0.1", "13.0.0"] |
There was a problem hiding this comment.
Where do we remind ourselves to add a new version to this list, after a release?
When do we decide to drop old versions? Provided that this CI job isn't too expensive, seems fine to keep 13.0.0 on there indefinitely, there's no particular maintenance burden associated with it (the only check I see is for < 15.0.0). Since it's in a matrix build, it may be effectively free (to us) to let this list grow.
Is there a reason we can't build on 12.0, or 11.0 for that matter?
There was a problem hiding this comment.
Does this also make assumptions that patch releases are ABI-compatible with the major releases? That's probably usually true.
There was a problem hiding this comment.
Since it's in a matrix build, it may be effectively free (to us) to let this list grow.
Eh technically true but a bit wasteful in regards to community resources (tragedy of the commons etc.) imo. At least as long as there is no specific argument that we actually need to test against all old versions vs. only against the minimum supported version + latest stable or something (see comment above, have no idea if this is the case)
patch releases are ABI-compatible with the major releases?
Otherwise they could not be patch releases (at least in theory...).
There was a problem hiding this comment.
Where do we remind ourselves to add a new version to this list, after a release?
We can make it part of the post release scripts.
There was a problem hiding this comment.
assumptions that patch releases are ABI-compatible with the major releases?
I don't think we make the assumption that any particular Arrow release is ABI compatible with itself (e.g., if it was compiled with different features enabled). If we do, we should not!
We can make it part of the post release scripts.
Where does that happen? I imagine I should do it in this PR?
There was a problem hiding this comment.
At least as long as there is no specific argument that we actually need to test against all old versions vs. only against the minimum supported version + latest stable or something
I'd think you'd want to test against each version you have a conditional check for.
Hopefully these are cheap jobs to run since you don't have to compile libarrow in them. Could always move them to nightly to save compute if that's a concern, as we do for testing against older R versions.
There was a problem hiding this comment.
I'd think you'd want to test against each version you have a conditional check for.
That makes sense, so in this case test against 13.0.0 as the minimum version, 15.0.0 due to the #if and latest release (aka 15.0.0 as well atm)?
Hopefully these are cheap jobs to run
They are pretty quick with ~7 min, maybe I am a bit to zealous in keep total ci time low (after all checking 10 version would be one of the other C++ jobs with a cache miss...). But having this discussion and documenting what we support and test is beneficial in any case :)
Where does that happen? I imagine I should do it in this PR?
The version bump script is run after the release and binary upload and matches thematically: dev/release/post-11-bump-versions.sh unless @raulcd disagrees?
r/tools/check-versions.R
Outdated
|
|
||
| # Currently the last three released versions of Arrow C++ | ||
| # (>= 13.0.0 at the time this check was added) | ||
| major(cpp_version) %in% (major(r_version) - 0:2) |
There was a problem hiding this comment.
This feels brittle and opaque. IMO it would be clearer if we had an explicit version that the C++ library must be greater than or equal to. Then it's just cpp_version >= that.
There was a problem hiding this comment.
I think we should keep a very tight handle on how many releases we support (or at least, I can't commit to maintaining the preprocessor statements for any more than four versions: dev + three previous).
There was a problem hiding this comment.
Right, but right now, there only one preprocessor statement, and it's tied to a specific version (15). So after 16.0 comes out, there's no reason to "drop" support for 13.0, it still works. Even when there are more, and they add maintenance burden, you drop support when you remove the preprocessor statements, which are tied to versions, so you're still talking about a minimum version you support. If you decide to be aggressive about dropping them, that's fine, I just think it's cleaner to express it as "libarrow >= X.Y" than "last three versions". Easier for users to understand, easier to make sure you're updating all the right places, etc.
There was a problem hiding this comment.
I agree that this line is quite opaque. It took me a decent bit of parsing to figure out what it's actually doing here.
jonkeane
left a comment
There was a problem hiding this comment.
Thanks for this. Agreed with a lot of what was said above.
I'm especially curious about
back in the world where someone could upgrade libarrow separately and get undefined symbol problems because the preprocessor statements in the R bindings effectively lock you to the libarrow version you compiled against
Was how we will prevent that covered somewhere?
How do we want to alert folks to changes in behavior or functionality based on differences in libarrow versions? We already have the libarrow version in arrow_info(), but should we also add a section for disabled / different features? Is that something we should have as a message on load (like we do for minimal builds)? We might not explicitly need to do this now (like I commented below, I can't tell if the one preprocessor if statement is actually disabling anything). Even if we don't need this now, it would be good to have at least an idea of how we might deal with this when it inevitably comes up.
Finally, this isn't to say I'm opposed or this is a reason we shouldn't do this, but want to keep ourselves honest: I've heard a lot of arguments in the R package that we should avoid increasing developer friction because there isn't much maintainer time available. This will add to absolutely add friction. Do the benefits outweigh the friction here?
| #else | ||
| template <typename T> | ||
| struct RConverterTrait< | ||
| T, enable_if_t<!is_nested_type<T>::value && !is_interval_type<T>::value && | ||
| !is_extension_type<T>::value>> { | ||
| using type = RPrimitiveConverter<T>; | ||
| }; | ||
| #endif |
There was a problem hiding this comment.
I'm surprised there are no tests this impacts — do we not have tests for the functionality we added after version 15.0.0?
r/tools/check-versions.R
Outdated
|
|
||
| # Currently the last three released versions of Arrow C++ | ||
| # (>= 13.0.0 at the time this check was added) | ||
| major(cpp_version) %in% (major(r_version) - 0:2) |
There was a problem hiding this comment.
I agree that this line is quite opaque. It took me a decent bit of parsing to figure out what it's actually doing here.
Can you clarify what friction you mean? I don't see that much, only in |
|
Regarding the libarrow vs. R version mismatch, our vignette currently has https://github.com/apache/arrow/blob/main/r/vignettes/install.Rmd#L468-L491, which is a bit stale since we currently will not use system libarrow if the versions don't match, so we keep people from getting into this scenario unless they really try hard to. (Also note that this discussion is at L468, after we suggest many many other ways for installing the package, including the airgapped build.) That's not to say that we shouldn't do this--technically, it seems like a good idea to me to be more accepting of libarrow version if possible--but if we start encouraging it, we'll be (re)opening these failure modes, and we should prepare for that and factor that into our "maintenance burden" calculus. |
I'm also thinking about the bumps we'll need to do as part of the release process to move the versions forward, remove old case statements, write messaging around these warning folks if they are in a situation where a feature isn't supported with a specific pairing (both in docs and in the package itself). None of it is major or so much I think we shouldn't do this, but I've heard in other places that folks wanted to absolutely minimize the release tasks because there was limited bandwidth (one specific discussion I remember was around options during the build system reconfiguration back in October and what options we were supporting there). |
|
I reverted the change where check-versions.R allowed this to occur without opt-in...as Jon and Neal noted, this could add maintenance cost we don't have time for without removing the existing build system. Feel free to use or not use what's in this PR...I had a few free cycles and was wondering whether or not it would be hard to support a build against previous Arrow C++ versions. My personal conclusion is that it's not hard and that I'm happy to support the preprocessor stuff required to make it happen. |
|
Thinking some more about this: I think we say that we support building with libarrow >= 13, and we test in CI building current R package with (1) current libarrow and (2) libarrow 13. I don't know that we need to test every libarrow release between there. |
|
Thanks @assignUser! I was taking care of the kids this afternoon and missed the ping. I've lost track of the constraints here and trust you to do what you need to do for a CRAN submission...we can circle back on details when we have the time and energy to worry about them. |
…of Arrow (#39739) Development velocity of the R package has slowed considerably since early versions of Arrow such that the commit-level integration that we once relied on is no longer necessary. The ability to build against older versions of Arrow also opens up more options for our CRAN submissions, since we may be able to work with CRAN to build a version of Arrow C++ they are happy with. This change doesn't require us to *do* anything about it...it just adds a check so that we are aware of the first PR that breaks the ability to build against a previous version. There is a possibility that an accidentally but previously installed version will end up being used via pkg-config, which I believe is how the version checking came into existence in the first place. - An `#if` to guard code that was added to support the string view/binary view - Changes to the version checker script to not error for supported Arrow C++ versions - CI job that checks build against supported Arrow versions Yes, a CI job was added Yes, but I'll wait until there's consensus on this before documenting what our intended support policy will be. * Closes: #39738 Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de> Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 66b41c4. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…sions of Arrow (apache#39739) ### Rationale for this change Development velocity of the R package has slowed considerably since early versions of Arrow such that the commit-level integration that we once relied on is no longer necessary. The ability to build against older versions of Arrow also opens up more options for our CRAN submissions, since we may be able to work with CRAN to build a version of Arrow C++ they are happy with. This change doesn't require us to *do* anything about it...it just adds a check so that we are aware of the first PR that breaks the ability to build against a previous version. There is a possibility that an accidentally but previously installed version will end up being used via pkg-config, which I believe is how the version checking came into existence in the first place. ### What changes are included in this PR? - An `#if` to guard code that was added to support the string view/binary view - Changes to the version checker script to not error for supported Arrow C++ versions - CI job that checks build against supported Arrow versions ### Are these changes tested? Yes, a CI job was added ### Are there any user-facing changes? Yes, but I'll wait until there's consensus on this before documenting what our intended support policy will be. * Closes: apache#39738 Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de> Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
…of Arrow (#39739) ### Rationale for this change Development velocity of the R package has slowed considerably since early versions of Arrow such that the commit-level integration that we once relied on is no longer necessary. The ability to build against older versions of Arrow also opens up more options for our CRAN submissions, since we may be able to work with CRAN to build a version of Arrow C++ they are happy with. This change doesn't require us to *do* anything about it...it just adds a check so that we are aware of the first PR that breaks the ability to build against a previous version. There is a possibility that an accidentally but previously installed version will end up being used via pkg-config, which I believe is how the version checking came into existence in the first place. ### What changes are included in this PR? - An `#if` to guard code that was added to support the string view/binary view - Changes to the version checker script to not error for supported Arrow C++ versions - CI job that checks build against supported Arrow versions ### Are these changes tested? Yes, a CI job was added ### Are there any user-facing changes? Yes, but I'll wait until there's consensus on this before documenting what our intended support policy will be. * Closes: #39738 Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de> Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
…of Arrow (#39739) ### Rationale for this change Development velocity of the R package has slowed considerably since early versions of Arrow such that the commit-level integration that we once relied on is no longer necessary. The ability to build against older versions of Arrow also opens up more options for our CRAN submissions, since we may be able to work with CRAN to build a version of Arrow C++ they are happy with. This change doesn't require us to *do* anything about it...it just adds a check so that we are aware of the first PR that breaks the ability to build against a previous version. There is a possibility that an accidentally but previously installed version will end up being used via pkg-config, which I believe is how the version checking came into existence in the first place. ### What changes are included in this PR? - An `#if` to guard code that was added to support the string view/binary view - Changes to the version checker script to not error for supported Arrow C++ versions - CI job that checks build against supported Arrow versions ### Are these changes tested? Yes, a CI job was added ### Are there any user-facing changes? Yes, but I'll wait until there's consensus on this before documenting what our intended support policy will be. * Closes: #39738 Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com> Co-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de> Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Rationale for this change
Development velocity of the R package has slowed considerably since early versions of Arrow such that the commit-level integration that we once relied on is no longer necessary. The ability to build against older versions of Arrow also opens up more options for our CRAN submissions, since we may be able to work with CRAN to build a version of Arrow C++ they are happy with.
This change doesn't require us to do anything about it...it just adds a check so that we are aware of the first PR that breaks the ability to build against a previous version.
There is a possibility that an accidentally but previously installed version will end up being used via pkg-config, which I believe is how the version checking came into existence in the first place.
What changes are included in this PR?
#ifto guard code that was added to support the string view/binary viewAre these changes tested?
Yes, a CI job was added
Are there any user-facing changes?
Yes, but I'll wait until there's consensus on this before documenting what our intended support policy will be.