Skip to content

Conversation

@wjones127
Copy link
Member

Summary of Changes

  • Added rbind and cbind for Table
  • Added cbind for RecordBatch. rbind just redirects the user to use Table$create
  • Changed c.Array() implementation to use either concat_array() or ChunkedArray$create() depending on whether the user wants a single array or zero-copy.
  • Implemented c.ChunkedArray

@github-actions
Copy link

Comment on lines 220 to 223
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little silly to me, so would welcome suggestions for a better way to write this :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to do RecordBatch$create(!!! args, schema = schema) (because we use rlang::list2() to ingest ..., which allows 'splicing' via !!!).

@pitrou pitrou changed the title ARROW-15989: rbind & cbind for Table & RecordBatch ARROW-15989: [R] rbind & cbind for Table & RecordBatch Mar 30, 2022
@wjones127
Copy link
Member Author

cc @jonkeane @paleolimbot

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very cool! I have a few notes here about vctrs::vec_cbind()...it's worth checking out how much of vctrs::vec_cbind() and vctrs::vec_rbind() we can emulate without turning this into a huge rabbit hole. The big features are:

  • rbind() doesn't care about column order (just names)
  • cbind() can recycle size-1 arguments to the length of the other arguments
  • rbind() will fill missing columns with NA

Whatever subset of that is easy to achieve is probably worth putting in here (but maybe not all of it since vctrs has a lot of tools to help make all that happen automatically).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maybe chunked_arrays instead of arrays? (Given that both appear here and the difference matters)
  • If you can rig it so that you do ChunkedArrays$create(!!! list_of_arrays) it might be cleaner (and might give a nicer error message when it fails)
  • What happens when the arrays have different types? I think I rigged something in concat_arrays() so that you can specify the final type which might be worth adopting here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's possible for somebody to do c(this_is_a_chunked_array, 1:10, c("a", "b", "c)) (so you might want to do chunked_arrays <- lapply(chunked_arrays, ChunkedArray$create)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I modify ChunkedArray$create() to accept a chunked array as an argument, it basically becomes it's own c() implementation. It also already has a type parameter the handles coercing to a common type, so I think that's handled.

@wjones127
Copy link
Member Author

@paleolimbot For the column filling, should that be the behavior of rbind, or an alternative row binding function? For example, I think tibble has separated behavior for each:

> library(dplyr)
> rbind(tibble(x = 1:2), tibble(y = 2:4))
Error in match.names(clabs, names(xi)) : 
  names do not match previous names
> bind_rows(tibble(x = 1:2), tibble(y = 2:4))
# A tibble: 5 × 2
      x     y
  <int> <int>
1     1    NA
2     2    NA
3    NA     2
4    NA     3
5    NA     4

@paleolimbot
Copy link
Member

That's a good point...rbind() doesn't do that for any type of object. But it is really useful and difficult to replicate if there's a good way to do it here!

@wjones127
Copy link
Member Author

But it is really useful and difficult to replicate if there's a good way to do it here!

I 100% agree. However, there seems to be an issue on the C++ side blocking this. I've created a new issue to tackle that. https://issues.apache.org/jira/browse/ARROW-16085

@wjones127
Copy link
Member Author

Got the cbind implementation to work with more types, so it's now more analogous to base cbind.

@wjones127
Copy link
Member Author

Seems that S3 dispatch isn't behaving nicely in R 3.6 for cbind.Table:

> do.call(cbind.Table, inputs)
Table
2 rows x 5 columns
$z <int32>
$a <int32>
$b <int32>
$c <dictionary<values=string, indices=int8>>
$d <int32>
> do.call(cbind, inputs)
          b c d
 [1,] ? ? ? 1 1
 [2,] ? ? ? 2 1
 [3,] ? ? ? 1 1
 [4,] ? ? ? 2 1
 [5,] ? ? ? 1 1
 [6,] ? ? ? 2 1
 [7,] ? ? ? 1 1
 [8,] ? ? ? 2 1
 [9,] ? ? ? 1 1
[10,] ? ? ? 2 1
[11,] ? ? ? 1 1
[12,] ? ? ? 2 1
[13,] ? ? ? 1 1
[14,] ? ? ? 2 1
[15,] ? ? ? 1 1
[16,] ? ? ? 2 1
[17,] ? ? ? 1 1
[18,] ? ? ? 2 1
[19,] ? ? ? 1 1
[20,] ? ? ? 2 1
[21,] ? ? ? 1 1
[22,] ? ? ? 2 1
[23,] ? ? ? 1 1
[24,] ? ? ? 2 1
[25,] ? ? ? 1 1
[26,] ? ? ? 2 1
[27,] ? ? ? 1 1
[28,] ? ? ? 2 1
[29,] ? ? ? 1 1
[30,] ? ? ? 2 1
[31,] ? ? ? 1 1
[32,] ? ? ? 2 1
[33,] ? ? ? 1 1
[34,] ? ? ? 2 1
[35,] ? ? ? 1 1
[36,] ? ? ? 2 1
Warning message:
In (function (..., deparse.level = 1)  :
  number of rows of result is not a multiple of vector length (arg 2)

@wjones127
Copy link
Member Author

The reason for the odd dispatch is that R 3.6 looks for a cbind implementation for each argument, and if it finds multiple then it falls back to the built-in cbind. Our cbind.RecordBatch test only has one input with a cbind method (the record batch), but our cbind.Table test has two (record batch + table).

@paleolimbot
Copy link
Member

That is wild! A way around that would be to define our own arrow_rbind() and arrow_cbind() and error for built-in rbind() and cbind() since it's harder to predict exactly which method will get called.

@wjones127
Copy link
Member Author

That is wild! A way around that would be to define our own arrow_rbind() and arrow_cbind() and error for built-in rbind() and cbind() since it's harder to predict exactly which method will get called.

That's a good point. I think the R 4.0+ behavior works well, and not sure how much longer we are supporting 3.6, so what I've done for now is just modify the tests to not enforce that particular case in R < 4.0.

@wjones127
Copy link
Member Author

FYI @paleolimbot this is ready for re-review

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a really useful new feature!

It's useful for me to do a thorough review of new features like this one to have a reprex to start from (sorry for not mentioning this in the last review). Below is what I came up with, and it exposed a few rough edges (I've tried to put potential fixes next to the relevant lines in the review).

I notice you go back and forth between e.g., chunked_array() and ChunkedArray$create() / RecordBatch$create() and record_batch(). Recently I've seen record_batch() used more frequently in tests although I think either style is fine as long as it matches the other calls in the respective source files.

# remotes::install_github("apache/arrow/r#12751")
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features

# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#>   [
#>     1,
#>     2
#>   ],
#>   [
#>     3
#>   ],
#>   [
#>     4,
#>     5
#>   ],
#>   [
#>     6
#>   ]
#> ]

# RecordBatch doesn't support rbind
rbind(
  RecordBatch$create(a = 1:10),
  RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use Table$create to combine record batches

# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use concat_arrays() to create a new array, ChunkedArray$create() to create a zero-copy concatenation

# cbind() for Recordbatch handles a variety of input types
cbind(
  record_batch(a = 1:2),
  record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments (this error needs some work)
cbind(
  record_batch(a = 1:2),
  b = Array$create(4:6)
)
#> Error in sprintf("Input ..%d cannot be converted to an Arrow Array", idx): invalid format '%d'; use format %s for character objects

# cbind() for Table works the same
cbind(
  arrow_table(a = 1:2),
  arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments (this error needs some work)
cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:6)
)
#> Error in sprintf("Input ..%i cannot be converted to an Arrow Array: %s", : invalid format '%i'; use format %s for character objects

# ...rbind works for Tables with a common schema
rbind(
  arrow_table(a = 1:2),
  arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>

# ...with nice errors for Tables with non-unifiable schemas (this error should use `abort()`)
rbind(
  arrow_table(a = 1:2),
  arrow_table(b = 3:4)
)
#> Error in rbind(deparse.level, ...): Schema at index 2 does not match the first schema
#> Schema 1:
#> a: int32
#> Schema 2:
#> b: int32

# did I miss anything else?

Created on 2022-04-08 by the reprex package (v2.0.1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of this is that idx might be an integer (if all arguments are unnamed), "" (if some arguments were named), or "the_arg_name". Your code below assumes some of both, and you probably want lapply(seq_along(inputs), function(i) {...}) so that you can access both the name and the location when you need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm not a fan of how base R handles the case of no names:

> cbind(data.frame(x = 1:2), b = 3:4, 5:6, c("a", "b"))
  x b 5:6 c("a", "b")
1 1 3   5           a
2 2 4   6           b

So I opted for using X{i} instead:

> as.data.frame(cbind(record_batch(x = 1:2), b = 3:4, 5:6, c("a", "b")))
  x b X3 X4
1 1 3  5  a
2 2 4  6  b

LMK what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works! I would hope that nobody relies on this behaviour but it's definitely better than the base R case.

@wjones127 wjones127 force-pushed the ARROW-15989-rbind-table branch from 6edd7ff to df1d233 Compare April 8, 2022 20:16
@wjones127
Copy link
Member Author

I've improved the error messages and added support for data.frame in cbind.

library(arrow, warn.conflicts = FALSE)

# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#>   [
#>     1,
#>     2
#>   ],
#>   [
#>     3
#>   ],
#>   [
#>     4,
#>     5
#>   ],
#>   [
#>     6
#>   ]
#> ]

# RecordBatch doesn't support rbind
rbind(
  RecordBatch$create(a = 1:10),
  RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use `Table$create()` to combine record batches

# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use `concat_arrays()` or `ChunkedArray$create()` instead.
#> ℹ `concat_arrays()` creates a new Array by copying data.
#> ℹ `ChunkedArray$create()` uses the arrays as chunks for zero-copy concatenation.

# cbind() for Recordbatch handles a variety of input types
cbind(
  record_batch(a = 1:2),
  record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  data.frame(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  record_batch(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind_check_length()`:
#> ! Non-scalar inputs must have an equal number of rows. ..1 has 2, ..2 has 3

cbind(
  record_batch(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# cbind() for Table works the same
cbind(
  arrow_table(a = 1:2),
  arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  data.frame(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error in `cbind_check_length()`:
#> ! Non-scalar inputs must have an equal number of rows. ..1 has 2, ..2 has 3

cbind(
  arrow_table(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occured when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# ...rbind works for Tables with a common schema
rbind(
  arrow_table(a = 1:2),
  arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>

# ...with nice errors for Tables with non-unifiable schemas 
rbind(
  arrow_table(a = 1:2),
  arrow_table(b = 3:4)
)
#> Error in `rbind()`:
#> ! Schema at index 2 does not match the first schema
#> ℹ Schema 1:
#> a: int32
#> ℹ Schema 2:
#> b: int32

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few tidbits to polish it up! This will be awesome!

r/R/table.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to move call <- caller_env() to inside the function...it's not clear to me that anybody would ever pass it as an argument. If you do need it as an argument, use .call instead of call for the name (in the off chance anybody uses this to add a column named call...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on re-reading where you use this, I think you can avoid it entirely (I think rlang is smart enough to jump over the tryCatch when signaling an error condition with a parent but definitely check). If you do need it, I think you want sys.call(1) rather than caller_env().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I move caller_env() inside of the tryCatch(), the error comes out as:

#> Error in `tryCatchOne()`:

I can just move the caller_env() just inside the function and that works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just inside the function is what I had in mind.

Are you sure about caller_env() though? I really think that sys.call() is what you're after.

some_fun <- function() {
  user_call <- sys.call()
  tryCatch(
    stop("whoa!"),
    error = function(e) {
      rlang::abort("whoa revamped!", call = user_call, parent = e)
    }
  )
}

some_fun()
#> Error in `some_fun()`:
#> ! whoa revamped!
#> Caused by error in `doTryCatch()`:
#> ! whoa!

some_fun2 <- function() {
  user_call <- rlang::caller_env()
  tryCatch(
    stop("whoa!"),
    error = function(e) {
      rlang::abort("whoa revamped!", call = user_call, parent = e)
    }
  )
}

some_fun2()
#> Error:
#> ! whoa revamped!
#> Caused by error in `doTryCatch()`:
#> ! whoa!

Created on 2022-04-12 by the reprex package (v2.0.1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh that's very odd. I get the same thing with your reprex there, but in my reprex below it is working well as is:

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3

cbind(
  record_batch(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "name repair" operation is actually pretty complicated and I think you probably want to avoid it (maybe error for unnamed Array/R vector?). If you do want to support unnamed array-like things, you could use inputs <- rlang::enquos(..., .named = TRUE), base::make.names(), or vctrs::vec_as_names() to avoid rolling your own logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not a fan of the default name repair (not sure if you saw #12751 (comment)). But if it's preferable to just not support rather than use a simpler solution than the canonical one, I'm fine with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up going with base::make.names(), which was simple enough to use.

@wjones127
Copy link
Member Author

Errors now show cbind() or rbind() as calls instead of showing internal helper functions.

# Concatenating chunked arrays returns a new chunked array containing all chunks
a <- chunked_array(c(1, 2), 3)
b <- chunked_array(c(4, 5), 6)
c(a, b)
#> ChunkedArray
#> [
#>   [
#>     1,
#>     2
#>   ],
#>   [
#>     3
#>   ],
#>   [
#>     4,
#>     5
#>   ],
#>   [
#>     6
#>   ]
#> ]

# RecordBatch doesn't support rbind
rbind(
  RecordBatch$create(a = 1:10),
  RecordBatch$create(a = 2:4)
)
#> Error in `rbind()`:
#> ! Use `Table$create()` to combine record batches

# Array no longer supports c()
a <- Array$create(c(1, 2))
b <- Array$create(c(4, 5))
c(a, b)
#> Error in `c()`:
#> ! Use `concat_arrays()` or `ChunkedArray$create()` instead.
#> ℹ `concat_arrays()` creates a new Array by copying data.
#> ℹ `ChunkedArray$create()` uses the arrays as chunks for zero-copy concatenation.

# cbind() for Recordbatch handles a variety of input types
cbind(
  record_batch(a = 1:2),
  record_batch(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  data.frame(b = 4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5)
)
#> RecordBatch
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  record_batch(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3

cbind(
  record_batch(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# And can make up names for vector input types

cbind(
  record_batch(a = 1:2),
  b = Array$create(4:5),
  c("a", "b")
)
#> RecordBatch
#> 2 rows x 3 columns
#> $a <int32>
#> $b <int32>
#> $X.1 <string>

# cbind() for Table works the same
cbind(
  arrow_table(a = 1:2),
  arrow_table(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  data.frame(b = 4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:5)
)
#> Table
#> 2 rows x 2 columns
#> $a <int32>
#> $b <int32>

# ...with nice errors for invalid arguments
cbind(
  arrow_table(a = 1:2),
  b = Array$create(4:6)
)
#> Error in `cbind()`:
#> ! Non-scalar inputs must have an equal number of rows.
#> ℹ ..1 has 2, ..2 has 3

cbind(
  arrow_table(a = 1:2),
  b = c(1 + 4i, 2 + 3i)
)
#> Error in `cbind()`:
#> ! Error occurred when trying to convert input ..b to an Arrow Array
#> Caused by error:
#> ! Cannot infer type from vector

# ...rbind works for Tables with a common schema
rbind(
  arrow_table(a = 1:2),
  arrow_table(a = 3:4)
)
#> Table
#> 4 rows x 1 columns
#> $a <int32>

# ...with nice errors for Tables with non-unifiable schemas 
rbind(
  arrow_table(a = 1:2),
  arrow_table(b = 3:4)
)
#> Error in `rbind()`:
#> ! Schema at index 2 does not match the first schema
#> ℹ Schema 1:
#> a: int32
#> ℹ Schema 2:
#> b: int32

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this function? I'm pretty sure the RecordBatch constructor in C++ will check lengths (again).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main value of the function is to provide an error message indicating which of you inputs is not the correct length. But if we didn't care about that, we could rely on the RB constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason the RB constructor couldn't tell you which input was the wrong length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RB constructor inputs are always arrays, right? Whereas here, the first input is guaranteed to be a record batch. So there's not a 1:1 correspondence between arguments there that would let us provide a helpful error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just confirming: do you need this here or will the C++ constructor do this for you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by "this". Repeating the values? Constructing a single column batch? I'll look at the C++ constructor and see what I can simplify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single value recycling is what I meant, sorry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe I can refactor to not construct a temporary record batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this; we now construct a flattened list of columns out of the inputs.

r/R/table.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a ConcatenateTables C++ function in table.h. Any reason not to use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll switch to using that. That also opens up the possibility of merging schemas since the existing function has an option for that. I think I'll also add a concat_tables(..., unify_schemas = TRUE) function, since I don't think we want that default behavior in rbind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this desirable? FWIW we don't accept this directly in record_batch() (though the error message is not awesome):

> record_batch(a=1:2, 3:4)
Error: Unknown: only data frames are allowed as unnamed arguments to be auto spliced

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can disallow if we want; I felt like it was expected behavior in the cbind API, but as you point out there's also differences in the constructor method. If it makes the code a lot simpler, I can remove this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should be consistent, and this is one of those behaviors were R is a little too loose, so I'd drop it and be more strict.

@wjones127 wjones127 force-pushed the ARROW-15989-rbind-table branch from 2b56769 to 50eb22d Compare April 18, 2022 20:44
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice contribution.

In addition to the comments and suggestions on specific lines, I suspect we need more tests. There are lots of combinations of inputs out there, and I think the corner cases should be explored too (e.g. cbinding tables with 0 rows or 0 length vectors, cbind/rbind with only 1 argument, c.ChunkedArray with type mismatch, etc.)

r/R/table.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, except in cbind.Table, chunked arrays and tables are allowed.

Two thoughts here, feel free to ignore or defer to a followup:

  1. Given how you've simplified, I wonder if that's a problem. If you passed through chunked arrays to RecordBatch$create, it should error, and hopefully with an informative message. In fact if I read the latest version correctly, cbind.RecordBatch isn't asserting that it doesn't get chunked arrays, so you're already relying on that behavior.
  2. Maybe @paleolimbot's as_* S3 generics PR would help? cbind.RecordBatch would as_arrow_array its input columns, and cbind.Table would as_chunked_array its inputs.

as.list(input)
} else if (inherits(input, "Table") || inherits(input, "ChunkedArray")) {
abort("Cannot cbind a RecordBatch with Tables or ChunkedArrays",
i = "Hint: consider converting the record batch into a Table first")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
i = "Hint: consider converting the record batch into a Table first")
i = "Hint: consider converting the RecordBatch into a Table first")

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than adding bindings for ConcatenateTables, this looks good to me. We'll probably want to revisit some of this after #12817 merges, but that's fine.

tables <- list2(...)

if (!unify_schemas) {
# assert they have same schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you don't do this? (Does the C++ library also do this checking?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually get a pretty good error from C++ (as long as R users will recognize the 0-based indexing):

Error: Invalid: Schema at index 1 was different: 
a: int32
vs
a: string

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels silly to reimplement error handling, but also suboptimal to have the error provide indexes that are "wrong" (in context). For now I've pushed up a fix that includes the errors. But we could also just delete the error handling and just use the C++ handling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your call

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

wjones127 and others added 3 commits April 21, 2022 13:17
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
@wjones127
Copy link
Member Author

FYI, on second thought, I'd rather do the integration work with the S3 generics in a follow-up PR.

@nealrichardson
Copy link
Member

@github-actions crossbow submit -g r

@github-actions
Copy link

Revision: 32a1d7e

Submitted crossbow builds: ursacomputing/crossbow @ actions-1905

Task Status
conda-linux-gcc-py37-cpu-r40 Azure
conda-linux-gcc-py37-cpu-r41 Azure
conda-osx-clang-py37-r40 Azure
conda-osx-clang-py37-r41 Azure
conda-win-vs2017-py37-r40 Azure
conda-win-vs2017-py37-r41 Azure
homebrew-r-autobrew Github Actions
homebrew-r-brew Github Actions
test-fedora-r-clang-sanitizer Azure
test-r-arrow-backwards-compatibility Github Actions
test-r-depsource-bundled Azure
test-r-depsource-system Github Actions
test-r-dev-duckdb Github Actions
test-r-devdocs Github Actions
test-r-gcc-11 Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-linux-rchk Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-offline-maximal Github Actions
test-r-offline-minimal Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-4.1-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-4.1-focal Azure
test-r-rstudio-r-base-4.1-opensuse15 Azure
test-r-rstudio-r-base-4.1-opensuse42 Azure
test-r-ubuntu-22.04 Github Actions
test-r-versions Github Actions
test-ubuntu-18.04-r-sanitizer Azure

@kszucs
Copy link
Member

kszucs commented Apr 22, 2022

@nealrichardson can we merge this?

@ursabot
Copy link

ursabot commented Apr 26, 2022

Benchmark runs are scheduled for baseline = 61f47a4 and contender = f23965c. f23965c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/593| f23965ce ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/581| f23965ce test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/579| f23965ce ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/591| f23965ce ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/592| 61f47a40 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/580| 61f47a40 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/578| 61f47a40 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/590| 61f47a40 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants