-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15989: [R] rbind & cbind for Table & RecordBatch #12751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a little silly to me, so would welcome suggestions for a better way to write this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to do RecordBatch$create(!!! args, schema = schema) (because we use rlang::list2() to ingest ..., which allows 'splicing' via !!!).
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be very cool! I have a few notes here about vctrs::vec_cbind()...it's worth checking out how much of vctrs::vec_cbind() and vctrs::vec_rbind() we can emulate without turning this into a huge rabbit hole. The big features are:
rbind()doesn't care about column order (just names)cbind()can recycle size-1 arguments to the length of the other argumentsrbind()will fill missing columns with NA
Whatever subset of that is easy to achieve is probably worth putting in here (but maybe not all of it since vctrs has a lot of tools to help make all that happen automatically).
r/R/chunked-array.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe
chunked_arraysinstead ofarrays? (Given that both appear here and the difference matters) - If you can rig it so that you do
ChunkedArrays$create(!!! list_of_arrays)it might be cleaner (and might give a nicer error message when it fails) - What happens when the arrays have different types? I think I rigged something in
concat_arrays()so that you can specify the final type which might be worth adopting here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's possible for somebody to do c(this_is_a_chunked_array, 1:10, c("a", "b", "c)) (so you might want to do chunked_arrays <- lapply(chunked_arrays, ChunkedArray$create)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I modify ChunkedArray$create() to accept a chunked array as an argument, it basically becomes it's own c() implementation. It also already has a type parameter the handles coercing to a common type, so I think that's handled.
|
@paleolimbot For the column filling, should that be the behavior of rbind, or an alternative row binding function? For example, I think tibble has separated behavior for each: > library(dplyr)
> rbind(tibble(x = 1:2), tibble(y = 2:4))
Error in match.names(clabs, names(xi)) :
names do not match previous names
> bind_rows(tibble(x = 1:2), tibble(y = 2:4))
# A tibble: 5 × 2
x y
<int> <int>
1 1 NA
2 2 NA
3 NA 2
4 NA 3
5 NA 4 |
|
That's a good point... |
I 100% agree. However, there seems to be an issue on the C++ side blocking this. I've created a new issue to tackle that. https://issues.apache.org/jira/browse/ARROW-16085 |
|
Got the |
|
Seems that S3 dispatch isn't behaving nicely in R 3.6 for |
|
The reason for the odd dispatch is that R 3.6 looks for a |
|
That is wild! A way around that would be to define our own |
That's a good point. I think the R 4.0+ behavior works well, and not sure how much longer we are supporting 3.6, so what I've done for now is just modify the tests to not enforce that particular case in R < 4.0. |
|
FYI @paleolimbot this is ready for re-review |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a really useful new feature!
It's useful for me to do a thorough review of new features like this one to have a reprex to start from (sorry for not mentioning this in the last review). Below is what I came up with, and it exposed a few rough edges (I've tried to put potential fixes next to the relevant lines in the review).
I notice you go back and forth between e.g., chunked_array() and ChunkedArray$create() / RecordBatch$create() and record_batch(). Recently I've seen record_batch() used more frequently in tests although I think either style is fine as long as it matches the other calls in the respective source files.
# remotes::install_github("apache/arrow/r#12751")
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#> [
#> 1,
#> 2
#> ],
#> [
#> 3
#> ],
#> [
#> 4,
#> 5
#> ],
#> [
#> 6
#> ]
#> ]
# RecordBatch doesn't support rbind
rbind(
RecordBatch$create(a = 1:10),
RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use Table$create to combine record batches
# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use concat_arrays() to create a new array, ChunkedArray$create() to create a zero-copy concatenation
# cbind() for Recordbatch handles a variety of input types
cbind(
record_batch(a = 1:2),
record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
record_batch(a = 1:2),
b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
# ...with nice errors for invalid arguments (this error needs some work)
cbind(
record_batch(a = 1:2),
b = Array$create(4:6)
)
#> Error in sprintf("Input ..%d cannot be converted to an Arrow Array", idx): invalid format '%d'; use format %s for character objects
# cbind() for Table works the same
cbind(
arrow_table(a = 1:2),
arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
arrow_table(a = 1:2),
b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
# ...with nice errors for invalid arguments (this error needs some work)
cbind(
arrow_table(a = 1:2),
b = Array$create(4:6)
)
#> Error in sprintf("Input ..%i cannot be converted to an Arrow Array: %s", : invalid format '%i'; use format %s for character objects
# ...rbind works for Tables with a common schema
rbind(
arrow_table(a = 1:2),
arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>
# ...with nice errors for Tables with non-unifiable schemas (this error should use `abort()`)
rbind(
arrow_table(a = 1:2),
arrow_table(b = 3:4)
)
#> Error in rbind(deparse.level, ...): Schema at index 2 does not match the first schema
#> Schema 1:
#> a: int32
#> Schema 2:
#> b: int32
# did I miss anything else?Created on 2022-04-08 by the reprex package (v2.0.1)
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of this is that idx might be an integer (if all arguments are unnamed), "" (if some arguments were named), or "the_arg_name". Your code below assumes some of both, and you probably want lapply(seq_along(inputs), function(i) {...}) so that you can access both the name and the location when you need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm not a fan of how base R handles the case of no names:
> cbind(data.frame(x = 1:2), b = 3:4, 5:6, c("a", "b"))
x b 5:6 c("a", "b")
1 1 3 5 a
2 2 4 6 bSo I opted for using X{i} instead:
> as.data.frame(cbind(record_batch(x = 1:2), b = 3:4, 5:6, c("a", "b")))
x b X3 X4
1 1 3 5 a
2 2 4 6 bLMK what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works! I would hope that nobody relies on this behaviour but it's definitely better than the base R case.
6edd7ff to
df1d233
Compare
|
I've improved the error messages and added support for library(arrow, warn.conflicts = FALSE)
# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#> [
#> 1,
#> 2
#> ],
#> [
#> 3
#> ],
#> [
#> 4,
#> 5
#> ],
#> [
#> 6
#> ]
#> ]
# RecordBatch doesn't support rbind
rbind(
RecordBatch$create(a = 1:10),
RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use `Table$create()` to combine record batches
# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use `concat_arrays()` or `ChunkedArray$create()` instead.
#> ℹ `concat_arrays()` creates a new Array by copying data.
#> ℹ `ChunkedArray$create()` uses the arrays as chunks for zero-copy concatenation.
# cbind() for Recordbatch handles a variety of input types
cbind(
record_batch(a = 1:2),
record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
record_batch(a = 1:2),
data.frame(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
record_batch(a = 1:2),
b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
# ...with nice errors for invalid arguments
cbind(
record_batch(a = 1:2),
b = Array$create(4:6)
)
#> Error in `cbind_check_length()`:
#> ! Non-scalar inputs must have an equal number of rows. ..1 has 2, ..2 has 3
cbind(
record_batch(a = 1:2),
b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector
# cbind() for Table works the same
cbind(
arrow_table(a = 1:2),
arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
arrow_table(a = 1:2),
data.frame(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
arrow_table(a = 1:2),
b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
# ...with nice errors for invalid arguments
cbind(
arrow_table(a = 1:2),
b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error in `cbind_check_length()`:
#> ! Non-scalar inputs must have an equal number of rows. ..1 has 2, ..2 has 3
cbind(
arrow_table(a = 1:2),
b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector
# ...rbind works for Tables with a common schema
rbind(
arrow_table(a = 1:2),
arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>
# ...with nice errors for Tables with non-unifiable schemas
rbind(
arrow_table(a = 1:2),
arrow_table(b = 3:4)
)
#> Error in `rbind()`:
#> ! Schema at index 2 does not match the first schema
#> ℹ Schema 1:
#> a: int32
#> ℹ Schema 2:
#> b: int32 |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few tidbits to polish it up! This will be awesome!
r/R/table.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to move call <- caller_env() to inside the function...it's not clear to me that anybody would ever pass it as an argument. If you do need it as an argument, use .call instead of call for the name (in the off chance anybody uses this to add a column named call...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on re-reading where you use this, I think you can avoid it entirely (I think rlang is smart enough to jump over the tryCatch when signaling an error condition with a parent but definitely check). If you do need it, I think you want sys.call(1) rather than caller_env().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I move caller_env() inside of the tryCatch(), the error comes out as:
#> Error in `tryCatchOne()`:
I can just move the caller_env() just inside the function and that works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just inside the function is what I had in mind.
Are you sure about caller_env() though? I really think that sys.call() is what you're after.
some_fun <- function() {
user_call <- sys.call()
tryCatch(
stop("whoa!"),
error = function(e) {
rlang::abort("whoa revamped!", call = user_call, parent = e)
}
)
}
some_fun()
#> Error in `some_fun()`:
#> ! whoa revamped!
#> Caused by error in `doTryCatch()`:
#> ! whoa!
some_fun2 <- function() {
user_call <- rlang::caller_env()
tryCatch(
stop("whoa!"),
error = function(e) {
rlang::abort("whoa revamped!", call = user_call, parent = e)
}
)
}
some_fun2()
#> Error:
#> ! whoa revamped!
#> Caused by error in `doTryCatch()`:
#> ! whoa!Created on 2022-04-12 by the reprex package (v2.0.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh that's very odd. I get the same thing with your reprex there, but in my reprex below it is working well as is:
cbind(
record_batch(a = 1:2),
b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3
cbind(
record_batch(a = 1:2),
b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "name repair" operation is actually pretty complicated and I think you probably want to avoid it (maybe error for unnamed Array/R vector?). If you do want to support unnamed array-like things, you could use inputs <- rlang::enquos(..., .named = TRUE), base::make.names(), or vctrs::vec_as_names() to avoid rolling your own logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not a fan of the default name repair (not sure if you saw #12751 (comment)). But if it's preferable to just not support rather than use a simpler solution than the canonical one, I'm fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up going with base::make.names(), which was simple enough to use.
|
Errors now show # Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#> [
#> 1,
#> 2
#> ],
#> [
#> 3
#> ],
#> [
#> 4,
#> 5
#> ],
#> [
#> 6
#> ]
#> ]
# RecordBatch doesn't support rbind
rbind(
RecordBatch$create(a = 1:10),
RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use `Table$create()` to combine record batches
# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use `concat_arrays()` or `ChunkedArray$create()` instead.
#> ℹ `concat_arrays()` creates a new Array by copying data.
#> ℹ `ChunkedArray$create()` uses the arrays as chunks for zero-copy concatenation.
# cbind() for Recordbatch handles a variety of input types
cbind(
record_batch(a = 1:2),
record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
record_batch(a = 1:2),
data.frame(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
record_batch(a = 1:2),
b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
# ...with nice errors for invalid arguments
cbind(
record_batch(a = 1:2),
b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3
cbind(
record_batch(a = 1:2),
b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector
# And can make up names for vector input types
cbind(
record_batch(a = 1:2),
b = Array$create(4:5),
c("a", "b")
)
#> RecordBatch
#> 2 rows x 3 columns
#> $a <int32>
#> $b <int32>
#> $X.1 <string>
# cbind() for Table works the same
cbind(
arrow_table(a = 1:2),
arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
arrow_table(a = 1:2),
data.frame(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
cbind(
arrow_table(a = 1:2),
b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>
# ...with nice errors for invalid arguments
cbind(
arrow_table(a = 1:2),
b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3
cbind(
arrow_table(a = 1:2),
b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector
# ...rbind works for Tables with a common schema
rbind(
arrow_table(a = 1:2),
arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>
# ...with nice errors for Tables with non-unifiable schemas
rbind(
arrow_table(a = 1:2),
arrow_table(b = 3:4)
)
#> Error in `rbind()`:
#> ! Schema at index 2 does not match the first schema
#> ℹ Schema 1:
#> a: int32
#> ℹ Schema 2:
#> b: int32 |
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this function? I'm pretty sure the RecordBatch constructor in C++ will check lengths (again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main value of the function is to provide an error message indicating which of you inputs is not the correct length. But if we didn't care about that, we could rely on the RB constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason the RB constructor couldn't tell you which input was the wrong length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RB constructor inputs are always arrays, right? Whereas here, the first input is guaranteed to be a record batch. So there's not a 1:1 correspondence between arguments there that would let us provide a helpful error message.
r/R/record-batch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just confirming: do you need this here or will the C++ constructor do this for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by "this". Repeating the values? Constructing a single column batch? I'll look at the C++ constructor and see what I can simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single value recycling is what I meant, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah maybe I can refactor to not construct a temporary record batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this; we now construct a flattened list of columns out of the inputs.
r/R/table.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a ConcatenateTables C++ function in table.h. Any reason not to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'll switch to using that. That also opens up the possibility of merging schemas since the existing function has an option for that. I think I'll also add a concat_tables(..., unify_schemas = TRUE) function, since I don't think we want that default behavior in rbind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea
r/tests/testthat/test-RecordBatch.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this desirable? FWIW we don't accept this directly in record_batch() (though the error message is not awesome):
> record_batch(a=1:2, 3:4)
Error: Unknown: only data frames are allowed as unnamed arguments to be auto spliced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can disallow if we want; I felt like it was expected behavior in the cbind API, but as you point out there's also differences in the constructor method. If it makes the code a lot simpler, I can remove this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should be consistent, and this is one of those behaviors were R is a little too loose, so I'd drop it and be more strict.
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
2b56769 to
50eb22d
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice contribution.
In addition to the comments and suggestions on specific lines, I suspect we need more tests. There are lots of combinations of inputs out there, and I think the corner cases should be explored too (e.g. cbinding tables with 0 rows or 0 length vectors, cbind/rbind with only 1 argument, c.ChunkedArray with type mismatch, etc.)
r/R/table.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, except in
cbind.Table, chunked arrays and tables are allowed.
Two thoughts here, feel free to ignore or defer to a followup:
- Given how you've simplified, I wonder if that's a problem. If you passed through chunked arrays to RecordBatch$create, it should error, and hopefully with an informative message. In fact if I read the latest version correctly, cbind.RecordBatch isn't asserting that it doesn't get chunked arrays, so you're already relying on that behavior.
- Maybe @paleolimbot's as_* S3 generics PR would help? cbind.RecordBatch would
as_arrow_arrayits input columns, and cbind.Table wouldas_chunked_arrayits inputs.
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
r/R/record-batch.R
Outdated
| as.list(input) | ||
| } else if (inherits(input, "Table") || inherits(input, "ChunkedArray")) { | ||
| abort("Cannot cbind a RecordBatch with Tables or ChunkedArrays", | ||
| i = "Hint: consider converting the record batch into a Table first") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| i = "Hint: consider converting the record batch into a Table first") | |
| i = "Hint: consider converting the RecordBatch into a Table first") |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than adding bindings for ConcatenateTables, this looks good to me. We'll probably want to revisit some of this after #12817 merges, but that's fine.
| tables <- list2(...) | ||
|
|
||
| if (!unify_schemas) { | ||
| # assert they have same schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you don't do this? (Does the C++ library also do this checking?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually get a pretty good error from C++ (as long as R users will recognize the 0-based indexing):
Error: Invalid: Schema at index 1 was different:
a: int32
vs
a: string
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels silly to reimplement error handling, but also suboptimal to have the error provide indexes that are "wrong" (in context). For now I've pushed up a fix that includes the errors. But we could also just delete the error handling and just use the C++ handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your call
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
|
FYI, on second thought, I'd rather do the integration work with the S3 generics in a follow-up PR. |
|
@github-actions crossbow submit -g r |
|
Revision: 32a1d7e Submitted crossbow builds: ursacomputing/crossbow @ actions-1905 |
|
@nealrichardson can we merge this? |
|
Benchmark runs are scheduled for baseline = 61f47a4 and contender = f23965c. f23965c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Summary of Changes
rbindandcbindfor Tablecbindfor RecordBatch.rbindjust redirects the user to useTable$createc.Array()implementation to use eitherconcat_array()orChunkedArray$create()depending on whether the user wants a single array or zero-copy.c.ChunkedArray