Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Apr 13, 2018

No description provided.

@wesm
Copy link
Member

wesm commented May 17, 2018

@kszucs this is odd, it looks like there might be a conflict with the new Anaconda 5 compilers? Can you rebase?

@codecov-io
Copy link

codecov-io commented Jul 4, 2018

Codecov Report

Merging #1889 into master will decrease coverage by 0.01%.
The diff coverage is 20%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1889      +/-   ##
==========================================
- Coverage   84.39%   84.38%   -0.02%     
==========================================
  Files         293      293              
  Lines       44789    44786       -3     
==========================================
- Hits        37799    37791       -8     
- Misses       6963     6964       +1     
- Partials       27       31       +4
Impacted Files Coverage Δ
cpp/src/arrow/io/hdfs.cc 0.33% <0%> (ø) ⬆️
python/pyarrow/parquet.py 91.08% <100%> (+0.56%) ⬆️
go/arrow/math/int64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/uint64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_avx2_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_amd64.go 28.57% <0%> (-14.29%) ⬇️
go/arrow/math/math_amd64.go 31.57% <0%> (-5.27%) ⬇️
go/arrow/math/float64_amd64.go 33.33% <0%> (ø) ⬆️
go/arrow/math/int64_amd64.go 33.33% <0%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 491114b...6dbbfb5. Read the comment docs.

@kszucs
Copy link
Member Author

kszucs commented Jul 5, 2018

@wesm I could use some hints:

With libhdfs3 it still segfaults when setting HdfsPathInfo string members.

In the ListDirectory test case the first iteration goes well, but afterwards char* pointers have bogus addresses. For a single file, the test case passes.

There must be something obvious I can't find. Maybe this bad_alloc handling causes it?

@wesm
Copy link
Member

wesm commented Jul 5, 2018

Oof, sorry that you've run into this. libhdfs3 2.3.x. on conda-forge introduced an ABI incompatibility, see

https://issues.apache.org/jira/browse/ARROW-1465

In the meantime, you can pin to version 2.2.31 to make the problem go away

@wesm
Copy link
Member

wesm commented Jul 5, 2018

The discussion of the underlying problem is in https://issues.apache.org/jira/browse/ARROW-1445

@kszucs
Copy link
Member Author

kszucs commented Jul 5, 2018

Hmm, sooo these are the symptoms of ABI incompatibility. Pinning it, thanks!

@kszucs
Copy link
Member Author

kszucs commented Jul 5, 2018

Another error happens, but this one seems fixable. I'll investigate it tomorrow.

@kszucs
Copy link
Member Author

kszucs commented Jul 9, 2018

libhdfs3 works now. Currently I'm adding libhdfs too to the image.

@wesm
Copy link
Member

wesm commented Jul 19, 2018

Thank you @kszucs. I will take this for a spin locally and then merge

@wesm
Copy link
Member

wesm commented Jul 22, 2018

Checked this out and got this error:

~/code/arrow/python/testing  (ARROW-2300)$ ./test_hdfs.sh 
+ docker build -t arrow-hdfs-test -f hdfs/Dockerfile .
Sending build context to Docker daemon  36.86kB
Step 1/6 : FROM cpcloud86/impala:metastore
manifest for cpcloud86/impala:metastore not found

@kszucs
Copy link
Member Author

kszucs commented Jul 22, 2018

Something must went wrong during checkout, because this no longer uses the Impala image.

@wesm
Copy link
Member

wesm commented Jul 22, 2018

Ok, let me nuke my branch and start over

@kszucs
Copy link
Member Author

kszucs commented Jul 22, 2018

If I recall correctly a couple of tests are failing, but the setup works properly.

@kszucs
Copy link
Member Author

kszucs commented Jul 22, 2018

[  FAILED  ] 2 tests, listed below:
[  FAILED  ] TestHadoopFileSystem/0.MultipleClients, where TypeParam = arrow::io::JNIDriver
[  FAILED  ] TestHadoopFileSystem/0.MakeDirectory, where TypeParam = arrow::io::JNIDriver

@kszucs
Copy link
Member Author

kszucs commented Jul 23, 2018

I've fixed the python errors, but the previously pastedlibhdfs tests fail with errno: 2, however the directory exists.
AFAIK errno isn't handled precisely in libhdfs, I'm using hadoop 2.6.0

@kszucs
Copy link
Member Author

kszucs commented Jul 23, 2018

@wesm errno == 2 here, however the directory exists - only occurs with libhdfs

Should We wrap hdfsListDirectory with a closure or store the driver/adapter type in LibHdfsShim with the following condition?

if (result == nullptr && errno == 2 && this->hdfsExists(fs, path)) {
    // if libhdfs reports errno 2: no such file or directory
    // however the path exists, we set errno to zero
    errno = 0;
}

It's a super hacky solution, I don't like it, but can't think of an alternative. Perhaps We should pin a minimal libhdfs version - assuming it has been fixed.

@wesm
Copy link
Member

wesm commented Jul 23, 2018

Let me have a closer look today -- I thought I had run this with Hadoop 2.6 before and didn't hit these issues so

@wesm
Copy link
Member

wesm commented Jul 23, 2018

@kszucs do you see this?

https://github.com/kszucs/arrow/blob/ARROW-2300/python/testing/hdfs/Dockerfile#L19

The scope of the JIRA issue was to fix python/testing/test_hdfs.sh, so we should either fix that or remove it. Can you add instructions how to run the HDFS tests?

@kszucs
Copy link
Member Author

kszucs commented Jul 23, 2018

Ohh sorry @wesm !

I've ported the hdfs integration tests from arrow/python/testing to arrow/dev/ next to spark_integration because it fits more naturally with the docker-compose setup. It also has a broader scope because it runs cpp tests too.
I've planned to port dask integration as well and remove arrow/python/testing eventually (the best place would be arrow/integration though)

Please use the following command: arrow/dev/run_docker_compose.sh hdfs_integration

@wesm
Copy link
Member

wesm commented Jul 23, 2018

Sweet thanks =)

@kszucs
Copy link
Member Author

kszucs commented Jul 23, 2018

After We have all of the integration tests under the same roof (dask, spark, hdfs) We can simply just submit them as crossbow tasks (similarly like the build tasks, but defined in a separate tests.yml).
Nightly integration tests would run in the same fashion like the nightly builds.

Change-Id: I5290d60c15d51271c51df7565eb5fb1cadd4ff5e
@wesm
Copy link
Member

wesm commented Jul 23, 2018

@kszucs I applied your workaround for the errno 2 thing. I fiddled with this a little bit (including setting errno = 0 before the hdfsListDirectory call (which made one of the errors go away, I guess errno was in a bad state from elsewhere) but couldn't get it completely working without the workaround.

+1. I will merge this once the build completes to make sure I didn't mess up the lint

@wesm
Copy link
Member

wesm commented Jul 23, 2018

thanks @kszucs!

@wesm wesm closed this in eaa6053 Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants