ARROW-15989: [R] rbind & cbind for Table & RecordBatch #12751

wjones127 · 2022-03-29T23:07:41Z

Summary of Changes

Added rbind and cbind for Table
Added cbind for RecordBatch. rbind just redirects the user to use Table$create
Changed c.Array() implementation to use either concat_array() or ChunkedArray$create() depending on whether the user wants a single array or zero-copy.
Implemented c.ChunkedArray

github-actions · 2022-03-29T23:08:01Z

https://issues.apache.org/jira/browse/ARROW-15989

wjones127 · 2022-03-29T23:09:25Z

r/R/record-batch.R

This feels a little silly to me, so would welcome suggestions for a better way to write this :)

You should be able to do RecordBatch$create(!!! args, schema = schema) (because we use rlang::list2() to ingest ..., which allows 'splicing' via !!!).

wjones127 · 2022-03-30T20:46:54Z

cc @jonkeane @paleolimbot

paleolimbot

This will be very cool! I have a few notes here about vctrs::vec_cbind()...it's worth checking out how much of vctrs::vec_cbind() and vctrs::vec_rbind() we can emulate without turning this into a huge rabbit hole. The big features are:

rbind() doesn't care about column order (just names)
cbind() can recycle size-1 arguments to the length of the other arguments
rbind() will fill missing columns with NA

Whatever subset of that is easy to achieve is probably worth putting in here (but maybe not all of it since vctrs has a lot of tools to help make all that happen automatically).

r/R/array.R

paleolimbot · 2022-03-31T01:35:38Z

r/R/chunked-array.R

Maybe chunked_arrays instead of arrays? (Given that both appear here and the difference matters)

If you can rig it so that you do ChunkedArrays$create(!!! list_of_arrays) it might be cleaner (and might give a nicer error message when it fails)

What happens when the arrays have different types? I think I rigged something in concat_arrays() so that you can specify the final type which might be worth adopting here too.

Also, it's possible for somebody to do c(this_is_a_chunked_array, 1:10, c("a", "b", "c)) (so you might want to do chunked_arrays <- lapply(chunked_arrays, ChunkedArray$create)).

So if I modify ChunkedArray$create() to accept a chunked array as an argument, it basically becomes it's own c() implementation. It also already has a type parameter the handles coercing to a common type, so I think that's handled.

r/R/record-batch.R

r/R/table.R

r/R/record-batch.R

r/tests/testthat/test-Array.R

r/tests/testthat/test-RecordBatch.R

wjones127 · 2022-03-31T17:19:34Z

@paleolimbot For the column filling, should that be the behavior of rbind, or an alternative row binding function? For example, I think tibble has separated behavior for each:

> library(dplyr)
> rbind(tibble(x = 1:2), tibble(y = 2:4))
Error in match.names(clabs, names(xi)) : 
  names do not match previous names
> bind_rows(tibble(x = 1:2), tibble(y = 2:4))
# A tibble: 5 × 2
      x     y
  <int> <int>
1     1    NA
2     2    NA
3    NA     2
4    NA     3
5    NA     4

paleolimbot · 2022-03-31T19:21:46Z

That's a good point...rbind() doesn't do that for any type of object. But it is really useful and difficult to replicate if there's a good way to do it here!

wjones127 · 2022-03-31T20:27:49Z

But it is really useful and difficult to replicate if there's a good way to do it here!

I 100% agree. However, there seems to be an issue on the C++ side blocking this. I've created a new issue to tackle that. https://issues.apache.org/jira/browse/ARROW-16085

wjones127 · 2022-03-31T22:26:43Z

Got the cbind implementation to work with more types, so it's now more analogous to base cbind.

wjones127 · 2022-04-01T16:21:56Z

Seems that S3 dispatch isn't behaving nicely in R 3.6 for cbind.Table:

> do.call(cbind.Table, inputs)
Table
2 rows x 5 columns
$z <int32>
$a <int32>
$b <int32>
$c <dictionary<values=string, indices=int8>>
$d <int32>
> do.call(cbind, inputs)
          b c d
 [1,] ? ? ? 1 1
 [2,] ? ? ? 2 1
 [3,] ? ? ? 1 1
 [4,] ? ? ? 2 1
 [5,] ? ? ? 1 1
 [6,] ? ? ? 2 1
 [7,] ? ? ? 1 1
 [8,] ? ? ? 2 1
 [9,] ? ? ? 1 1
[10,] ? ? ? 2 1
[11,] ? ? ? 1 1
[12,] ? ? ? 2 1
[13,] ? ? ? 1 1
[14,] ? ? ? 2 1
[15,] ? ? ? 1 1
[16,] ? ? ? 2 1
[17,] ? ? ? 1 1
[18,] ? ? ? 2 1
[19,] ? ? ? 1 1
[20,] ? ? ? 2 1
[21,] ? ? ? 1 1
[22,] ? ? ? 2 1
[23,] ? ? ? 1 1
[24,] ? ? ? 2 1
[25,] ? ? ? 1 1
[26,] ? ? ? 2 1
[27,] ? ? ? 1 1
[28,] ? ? ? 2 1
[29,] ? ? ? 1 1
[30,] ? ? ? 2 1
[31,] ? ? ? 1 1
[32,] ? ? ? 2 1
[33,] ? ? ? 1 1
[34,] ? ? ? 2 1
[35,] ? ? ? 1 1
[36,] ? ? ? 2 1
Warning message:
In (function (..., deparse.level = 1)  :
  number of rows of result is not a multiple of vector length (arg 2)

wjones127 · 2022-04-01T17:15:42Z

The reason for the odd dispatch is that R 3.6 looks for a cbind implementation for each argument, and if it finds multiple then it falls back to the built-in cbind. Our cbind.RecordBatch test only has one input with a cbind method (the record batch), but our cbind.Table test has two (record batch + table).

paleolimbot · 2022-04-01T17:31:47Z

That is wild! A way around that would be to define our own arrow_rbind() and arrow_cbind() and error for built-in rbind() and cbind() since it's harder to predict exactly which method will get called.

wjones127 · 2022-04-01T17:38:46Z

That is wild! A way around that would be to define our own arrow_rbind() and arrow_cbind() and error for built-in rbind() and cbind() since it's harder to predict exactly which method will get called.

That's a good point. I think the R 4.0+ behavior works well, and not sure how much longer we are supporting 3.6, so what I've done for now is just modify the tests to not enforce that particular case in R < 4.0.

wjones127 · 2022-04-08T17:08:13Z

FYI @paleolimbot this is ready for re-review

paleolimbot

This will be a really useful new feature!

It's useful for me to do a thorough review of new features like this one to have a reprex to start from (sorry for not mentioning this in the last review). Below is what I came up with, and it exposed a few rough edges (I've tried to put potential fixes next to the relevant lines in the review).

I notice you go back and forth between e.g., chunked_array() and ChunkedArray$create() / RecordBatch$create() and record_batch(). Recently I've seen record_batch() used more frequently in tests although I think either style is fine as long as it matches the other calls in the respective source files.

# remotes::install_github("apache/arrow/r#12751")
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features

# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#>   [
#>     1,
#>     2
#>   ],
#>   [
#>     3
#>   ],
#>   [
#>     4,
#>     5
#>   ],
#>   [
#>     6
#>   ]
#> ]

# RecordBatch doesn't support rbind
rbind(
  RecordBatch$create(a = 1:10),
  RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use Table$create to combine record batches

# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use concat_arrays() to create a new array, ChunkedArray$create() to create a zero-copy concatenation

# cbind() for Recordbatch handles a variety of input types
cbind(
  record_batch(a = 1:2),
  record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments (this error needs some work)
cbind(
  record_batch(a = 1:2),
  b = Array$create(4:6)
)
#> Error in sprintf("Input ..%d cannot be converted to an Arrow Array", idx): invalid format '%d'; use format %s for character objects

# cbind() for Table works the same
cbind(
  arrow_table(a = 1:2),
  arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments (this error needs some work)
cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:6)
)
#> Error in sprintf("Input ..%i cannot be converted to an Arrow Array: %s", : invalid format '%i'; use format %s for character objects

# ...rbind works for Tables with a common schema
rbind(
  arrow_table(a = 1:2),
  arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>

# ...with nice errors for Tables with non-unifiable schemas (this error should use `abort()`)
rbind(
  arrow_table(a = 1:2),
  arrow_table(b = 3:4)
)
#> Error in rbind(deparse.level, ...): Schema at index 2 does not match the first schema
#> Schema 1:
#> a: int32
#> Schema 2:
#> b: int32

# did I miss anything else?

^{Created on 2022-04-08 by the reprex package (v2.0.1)}

r/R/record-batch.R

paleolimbot · 2022-04-08T18:19:56Z

r/R/record-batch.R

My reading of this is that idx might be an integer (if all arguments are unnamed), "" (if some arguments were named), or "the_arg_name". Your code below assumes some of both, and you probably want lapply(seq_along(inputs), function(i) {...}) so that you can access both the name and the location when you need it.

Thanks. I'm not a fan of how base R handles the case of no names:

> cbind(data.frame(x = 1:2), b = 3:4, 5:6, c("a", "b")) x b 5:6 c("a", "b") 1 1 3 5 a 2 2 4 6 b

So I opted for using X{i} instead:

> as.data.frame(cbind(record_batch(x = 1:2), b = 3:4, 5:6, c("a", "b"))) x b X3 X4 1 1 3 5 a 2 2 4 6 b

LMK what you think

I think that works! I would hope that nobody relies on this behaviour but it's definitely better than the base R case.

r/tests/testthat/test-RecordBatch.R

r/tests/testthat/test-Table.R

r/tests/testthat/test-RecordBatch.R

r/tests/testthat/test-Table.R

r/tests/testthat/test-chunked-array.R

wjones127 · 2022-04-11T21:58:24Z

I've improved the error messages and added support for data.frame in cbind.

library(arrow, warn.conflicts = FALSE)

# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#>   [
#>     1,
#>     2
#>   ],
#>   [
#>     3
#>   ],
#>   [
#>     4,
#>     5
#>   ],
#>   [
#>     6
#>   ]
#> ]

# RecordBatch doesn't support rbind
rbind(
  RecordBatch$create(a = 1:10),
  RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use `Table$create()` to combine record batches

# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use `concat_arrays()` or `ChunkedArray$create()` instead.
#> ℹ `concat_arrays()` creates a new Array by copying data.
#> ℹ `ChunkedArray$create()` uses the arrays as chunks for zero-copy concatenation.

# cbind() for Recordbatch handles a variety of input types
cbind(
  record_batch(a = 1:2),
  record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  data.frame(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  record_batch(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind_check_length()`:
#> ! Non-scalar inputs must have an equal number of rows. ..1 has 2, ..2 has 3

cbind(
  record_batch(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# cbind() for Table works the same
cbind(
  arrow_table(a = 1:2),
  arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  data.frame(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error in `cbind_check_length()`:
#> ! Non-scalar inputs must have an equal number of rows. ..1 has 2, ..2 has 3

cbind(
  arrow_table(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# ...rbind works for Tables with a common schema
rbind(
  arrow_table(a = 1:2),
  arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>

# ...with nice errors for Tables with non-unifiable schemas 
rbind(
  arrow_table(a = 1:2),
  arrow_table(b = 3:4)
)
#> Error in `rbind()`:
#> ! Schema at index 2 does not match the first schema
#> ℹ Schema 1:
#> a: int32
#> ℹ Schema 2:
#> b: int32

paleolimbot

Just a few tidbits to polish it up! This will be awesome!

paleolimbot · 2022-04-12T11:56:45Z

r/R/table.R

I think you want to move call <- caller_env() to inside the function...it's not clear to me that anybody would ever pass it as an argument. If you do need it as an argument, use .call instead of call for the name (in the off chance anybody uses this to add a column named call...).

Actually, on re-reading where you use this, I think you can avoid it entirely (I think rlang is smart enough to jump over the tryCatch when signaling an error condition with a parent but definitely check). If you do need it, I think you want sys.call(1) rather than caller_env().

If I move caller_env() inside of the tryCatch(), the error comes out as:

#> Error in `tryCatchOne()`:

I can just move the caller_env() just inside the function and that works.

Just inside the function is what I had in mind.

Are you sure about caller_env() though? I really think that sys.call() is what you're after.

some_fun <- function() { user_call <- sys.call() tryCatch( stop("whoa!"), error = function(e) { rlang::abort("whoa revamped!", call = user_call, parent = e) } ) } some_fun() #> Error in `some_fun()`: #> ! whoa revamped! #> Caused by error in `doTryCatch()`: #> ! whoa! some_fun2 <- function() { user_call <- rlang::caller_env() tryCatch( stop("whoa!"), error = function(e) { rlang::abort("whoa revamped!", call = user_call, parent = e) } ) } some_fun2() #> Error: #> ! whoa revamped! #> Caused by error in `doTryCatch()`: #> ! whoa!

^{Created on 2022-04-12 by the reprex package (v2.0.1)}

Huh that's very odd. I get the same thing with your reprex there, but in my reprex below it is working well as is:

cbind( record_batch(a = 1:2), b = Array$create(4:6) ) #> Error in `cbind()`: #> ! Non-scalar inputs must have an equal number of rows. #> ℹ ..1 has 2, ..2 has 3 cbind( record_batch(a = 1:2), b = c(1 + 4i, 2 + 3i) ) #> Error in `cbind()`: #> ! Error occurred when trying to convert input ..b to an Arrow Array #> Caused by error: #> ! Cannot infer type from vector

r/R/record-batch.R

paleolimbot · 2022-04-12T12:02:57Z

r/R/record-batch.R

The "name repair" operation is actually pretty complicated and I think you probably want to avoid it (maybe error for unnamed Array/R vector?). If you do want to support unnamed array-like things, you could use inputs <- rlang::enquos(..., .named = TRUE), base::make.names(), or vctrs::vec_as_names() to avoid rolling your own logic.

Yeah I'm not a fan of the default name repair (not sure if you saw #12751 (comment)). But if it's preferable to just not support rather than use a simpler solution than the canonical one, I'm fine with that.

I ended up going with base::make.names(), which was simple enough to use.

r/R/record-batch.R

wjones127 · 2022-04-12T16:02:24Z

Errors now show cbind() or rbind() as calls instead of showing internal helper functions.

# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#>   [
#>     1,
#>     2
#>   ],
#>   [
#>     3
#>   ],
#>   [
#>     4,
#>     5
#>   ],
#>   [
#>     6
#>   ]
#> ]

# RecordBatch doesn't support rbind
rbind(
  RecordBatch$create(a = 1:10),
  RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use `Table$create()` to combine record batches

# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use `concat_arrays()` or `ChunkedArray$create()` instead.
#> ℹ `concat_arrays()` creates a new Array by copying data.
#> ℹ `ChunkedArray$create()` uses the arrays as chunks for zero-copy concatenation.

# cbind() for Recordbatch handles a variety of input types
cbind(
  record_batch(a = 1:2),
  record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  data.frame(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  record_batch(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3

cbind(
  record_batch(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# And can make up names for vector input types

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5),
  c("a", "b")
)
#> RecordBatch
#> 2 rows x 3 columns
#> $a <int32>
#> $b <int32>
#> $X.1 <string>

# cbind() for Table works the same
cbind(
  arrow_table(a = 1:2),
  arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  data.frame(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3

cbind(
  arrow_table(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# ...rbind works for Tables with a common schema
rbind(
  arrow_table(a = 1:2),
  arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>

# ...with nice errors for Tables with non-unifiable schemas 
rbind(
  arrow_table(a = 1:2),
  arrow_table(b = 3:4)
)
#> Error in `rbind()`:
#> ! Schema at index 2 does not match the first schema
#> ℹ Schema 1:
#> a: int32
#> ℹ Schema 2:
#> b: int32

nealrichardson · 2022-04-18T13:34:16Z

r/R/record-batch.R

Do you need this function? I'm pretty sure the RecordBatch constructor in C++ will check lengths (again).

The main value of the function is to provide an error message indicating which of you inputs is not the correct length. But if we didn't care about that, we could rely on the RB constructor.

Any reason the RB constructor couldn't tell you which input was the wrong length?

RB constructor inputs are always arrays, right? Whereas here, the first input is guaranteed to be a record batch. So there's not a 1:1 correspondence between arguments there that would let us provide a helpful error message.

nealrichardson · 2022-04-18T13:35:10Z

r/R/record-batch.R

Also just confirming: do you need this here or will the C++ constructor do this for you?

Not sure what you mean by "this". Repeating the values? Constructing a single column batch? I'll look at the C++ constructor and see what I can simplify.

The single value recycling is what I meant, sorry.

Yeah maybe I can refactor to not construct a temporary record batch.

Did this; we now construct a flattened list of columns out of the inputs.

nealrichardson · 2022-04-18T13:41:34Z

r/R/table.R

There is a ConcatenateTables C++ function in table.h. Any reason not to use it?

No, I'll switch to using that. That also opens up the possibility of merging schemas since the existing function has an option for that. I think I'll also add a concat_tables(..., unify_schemas = TRUE) function, since I don't think we want that default behavior in rbind.

r/tests/testthat/test-RecordBatch.R

nealrichardson · 2022-04-18T13:47:20Z

r/tests/testthat/test-RecordBatch.R

Is this desirable? FWIW we don't accept this directly in record_batch() (though the error message is not awesome):

> record_batch(a=1:2, 3:4) Error: Unknown: only data frames are allowed as unnamed arguments to be auto spliced

We can disallow if we want; I felt like it was expected behavior in the cbind API, but as you point out there's also differences in the constructor method. If it makes the code a lot simpler, I can remove this behavior.

IMO we should be consistent, and this is one of those behaviors were R is a little too loose, so I'd drop it and be more strict.

r/tests/testthat/test-Table.R

Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>

nealrichardson

This is a nice contribution.

In addition to the comments and suggestions on specific lines, I suspect we need more tests. There are lots of combinations of inputs out there, and I think the corner cases should be explored too (e.g. cbinding tables with 0 rows or 0 length vectors, cbind/rbind with only 1 argument, c.ChunkedArray with type mismatch, etc.)

r/R/array.R

r/R/record-batch.R

nealrichardson · 2022-04-19T13:39:51Z

r/R/table.R

Yes, except in cbind.Table, chunked arrays and tables are allowed.

Two thoughts here, feel free to ignore or defer to a followup:

Given how you've simplified, I wonder if that's a problem. If you passed through chunked arrays to RecordBatch$create, it should error, and hopefully with an informative message. In fact if I read the latest version correctly, cbind.RecordBatch isn't asserting that it doesn't get chunked arrays, so you're already relying on that behavior.

Maybe @paleolimbot's as_* S3 generics PR would help? cbind.RecordBatch would as_arrow_array its input columns, and cbind.Table would as_chunked_array its inputs.

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

nealrichardson · 2022-04-19T18:43:07Z

r/R/record-batch.R

+      as.list(input)
+    } else if (inherits(input, "Table") || inherits(input, "ChunkedArray")) {
+      abort("Cannot cbind a RecordBatch with Tables or ChunkedArrays",
+            i = "Hint: consider converting the record batch into a Table first")


Suggested change

i = "Hint: consider converting the record batch into a Table first")

i = "Hint: consider converting the RecordBatch into a Table first")

nealrichardson

Other than adding bindings for ConcatenateTables, this looks good to me. We'll probably want to revisit some of this after #12817 merges, but that's fine.

nealrichardson · 2022-04-21T18:51:05Z

r/R/table.R

+  tables <- list2(...)
+
+  if (!unify_schemas) {
+    # assert they have same schema


What happens if you don't do this? (Does the C++ library also do this checking?)

We actually get a pretty good error from C++ (as long as R users will recognize the 0-based indexing):

Error: Invalid: Schema at index 1 was different: a: int32 vs a: string

What do you think?

It feels silly to reimplement error handling, but also suboptimal to have the error provide indexes that are "wrong" (in context). For now I've pushed up a fix that includes the errors. But we could also just delete the error handling and just use the C++ handling.

r/R/table.R

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

r/tests/testthat/test-Table.R

nealrichardson

Nice work!

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

wjones127 · 2022-04-21T21:37:26Z

FYI, on second thought, I'd rather do the integration work with the S3 generics in a follow-up PR.

nealrichardson · 2022-04-22T11:48:23Z

@github-actions crossbow submit -g r

github-actions · 2022-04-22T12:24:22Z

Revision: 32a1d7e

Submitted crossbow builds: ursacomputing/crossbow @ actions-1905

Task	Status
conda-linux-gcc-py37-cpu-r40
conda-linux-gcc-py37-cpu-r41
conda-osx-clang-py37-r40
conda-osx-clang-py37-r41
conda-win-vs2017-py37-r40
conda-win-vs2017-py37-r41
homebrew-r-autobrew
homebrew-r-brew
test-fedora-r-clang-sanitizer
test-r-arrow-backwards-compatibility
test-r-depsource-bundled
test-r-depsource-system
test-r-dev-duckdb
test-r-devdocs
test-r-gcc-11
test-r-install-local
test-r-linux-as-cran
test-r-linux-rchk
test-r-linux-valgrind
test-r-minimal-build
test-r-offline-maximal
test-r-offline-minimal
test-r-rhub-debian-gcc-devel-lto-latest
test-r-rhub-ubuntu-gcc-release-latest
test-r-rocker-r-base-latest
test-r-rstudio-r-base-4.1-centos7-devtoolset-8
test-r-rstudio-r-base-4.1-focal
test-r-rstudio-r-base-4.1-opensuse15
test-r-rstudio-r-base-4.1-opensuse42
test-r-ubuntu-22.04
test-r-versions
test-ubuntu-18.04-r-sanitizer

kszucs · 2022-04-22T17:04:24Z

@nealrichardson can we merge this?

ursabot · 2022-04-26T14:05:13Z

Benchmark runs are scheduled for baseline = 61f47a4 and contender = f23965c. f23965c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/593| f23965ce ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/581| f23965ce test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/579| f23965ce ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/591| f23965ce ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/592| 61f47a40 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/580| 61f47a40 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/578| 61f47a40 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/590| 61f47a40 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: R label Mar 29, 2022

wjones127 commented Mar 29, 2022

View reviewed changes

pitrou changed the title ~~ARROW-15989: rbind & cbind for Table & RecordBatch~~ ARROW-15989: [R] rbind & cbind for Table & RecordBatch Mar 30, 2022

paleolimbot reviewed Mar 31, 2022

View reviewed changes

paleolimbot reviewed Apr 8, 2022

View reviewed changes

wjones127 force-pushed the ARROW-15989-rbind-table branch from 6edd7ff to df1d233 Compare April 8, 2022 20:16

paleolimbot reviewed Apr 12, 2022

View reviewed changes

nealrichardson requested changes Apr 18, 2022

View reviewed changes

wjones127 and others added 9 commits April 18, 2022 13:23

Implement rbind and cbind for Table

40287fc

Simplify implementation by implementing c() for chunked array

cc298fd

Add RecordBatch implementations

d03be85

Remove old test

0eee29a

Try expect_equal

fcf6d6b

Use abort not stop

9954f4e

Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>

Change ChunkedArray so it can take chunked arrays

d88f4b2

Use splice operator

3756ea8

Test redirection errors

453bd50

wjones127 force-pushed the ARROW-15989-rbind-table branch from 2b56769 to 50eb22d Compare April 18, 2022 20:44

wjones127 requested a review from nealrichardson April 18, 2022 21:29

nealrichardson reviewed Apr 19, 2022

View reviewed changes

wjones127 and others added 5 commits April 19, 2022 07:29

Apply suggestions from code review

5722442

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Simplify checks

4771645

Test more cases with cbind and rbind

4e518d1

Fix issue with mismatched chunk types

6ae6225

Check case of zero rows

5e9d37b

wjones127 requested a review from nealrichardson April 19, 2022 18:35

nealrichardson reviewed Apr 19, 2022

View reviewed changes

Small fixes

e4cf4ee

wjones127 requested a review from nealrichardson April 20, 2022 00:04

nealrichardson reviewed Apr 21, 2022

View reviewed changes

wjones127 added 2 commits April 21, 2022 08:43

Implement concat_tables

ddd519d

Add s

a0bbf06

nealrichardson reviewed Apr 21, 2022

View reviewed changes

Handle no tables

3ed8c4d

nealrichardson reviewed Apr 21, 2022

View reviewed changes

r/R/table.R Outdated Show resolved Hide resolved

Update r/R/table.R

9c0e338

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

nealrichardson reviewed Apr 21, 2022

View reviewed changes

r/tests/testthat/test-Table.R Outdated Show resolved Hide resolved

nealrichardson approved these changes Apr 21, 2022

View reviewed changes

wjones127 and others added 3 commits April 21, 2022 13:17

Update r/tests/testthat/test-Table.R

02ff9b7

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Add to pkgdown site

40d86a6

Fix tests

32a1d7e

nealrichardson closed this in f23965c Apr 22, 2022

	i = "Hint: consider converting the record batch into a Table first")
	i = "Hint: consider converting the RecordBatch into a Table first")

ARROW-15989: [R] rbind & cbind for Table & RecordBatch #12751

ARROW-15989: [R] rbind & cbind for Table & RecordBatch #12751

Uh oh!

Conversation

wjones127 commented Mar 29, 2022

Summary of Changes

Uh oh!

github-actions bot commented Mar 29, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjones127 commented Mar 30, 2022

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjones127 commented Mar 31, 2022

Uh oh!

paleolimbot commented Mar 31, 2022

Uh oh!

wjones127 commented Mar 31, 2022

Uh oh!

wjones127 commented Mar 31, 2022

Uh oh!

wjones127 commented Apr 1, 2022

Uh oh!

wjones127 commented Apr 1, 2022

Uh oh!

paleolimbot commented Apr 1, 2022

Uh oh!

wjones127 commented Apr 1, 2022

Uh oh!

wjones127 commented Apr 8, 2022

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wjones127 commented Apr 11, 2022

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment