-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #36912 has finished for PR 7318 at commit
|
R/pkg/inst/tests/test_sparkSQL.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_identical(expected, ordered)
|
Test build #36946 has finished for PR 7318 at commit
|
|
Jents! slow test please |
|
Test build #4 has finished for PR 7318 at commit
|
R/pkg/inst/tests/test_sparkSQL.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I figured out what is going on here -- The expected data.frame creates the column a_b as factors. You can pass in stringsAsFactors=F to the data.frame constructor to avoid this.
Also I think the order command above should probably be ordered <- ct[order(ct$a_b),] to get all the rows back out of it. It still associates row names which makes it not identical (you can do row.names(ordered) <- NULL).
A simpler check might be to just get one or two rows from ct and then compare them with expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally I made the unit test ran on my local. I upgraded R to 3.2.0 from 3.0.2. I don't know whether this is the cause or not. The unit test should work now.
|
@shivaram This should be ready for review after Jenkins. |
|
Test build #37688 has finished for PR 7318 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor, but it might be good to have a test case where we get NULL to just make sure that code path works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the help message says that Pairs that have no occurrences will have null as their counts. I just wanted to make sure that case worked correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we need to update the doc. The behavior changes in 6396cc0. We output 0 instead of null. I will send a follow-up PR for this because it also touches Scala and Python code.
I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-9243
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay - sounds good. LGTM
|
@mengxr One question about the function naming convention -- What are your thoughts on using |
|
I don't have strong preference about the name. We use |
|
Hmm okay. Lets leave it as |
|
@mengxr I guess we will have a new PR for the documentation update, so this PR LGTM. I will merge this unless you have anything else to add |
|
Yes, please merge it. Thanks for reviewing! |
Add
crosstabto SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar totablein R. However,tablein SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table forcrosstabto be compatible with Scala/Python.I couldn't run R tests successfully on my local. Many unit tests failed. So let's try Jenkins.