NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

Extract Python and Dask `Executor` classes from `Workflow`

Open karlhigley opened this issue 3 years ago • 13 comments

We'd like to re-use some of the mechanics of graph execution (both local and distributed) in other parts of Merlin, so this is a step in the direction of disentangling graph execution from Workflow itself. It removes direct dependencies on Dask from Workflow and centralizes them in MerlinDaskExecutor, which Workflow can then use in conjunction with a Merlin operator DAG to run distributed computations.

In the future, we'd like to use these Executor classes in Merlin Systems too, so that we can run the full process of generating recommendations (also represented as a Merlin DAG) interchangeably either in Triton (using MerlinPythonExecutor) or on Dask (using MerlinDaskExecutor.)

karlhigley avatar Jul 12 '22 16:07 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 4f3e941e62750333eccd6899cccf6181575b9b1e, no merge conflicts.
Running as SYSTEM
Setting status of 4f3e941e62750333eccd6899cccf6181575b9b1e to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4573/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 4f3e941e62750333eccd6899cccf6181575b9b1e^{commit} # timeout=10
Checking out Revision 4f3e941e62750333eccd6899cccf6181575b9b1e (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 4f3e941e62750333eccd6899cccf6181575b9b1e # timeout=10
Commit message: "Clean up `MerlinDaskExecutor.fit()`"
 > git rev-list --no-walk 1be6d8849ce7ced685fb755e168766b150e37536 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins4082524424769022190.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1428 items

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_s3.py .. [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 62%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_s3.py: 2 warnings tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========== 1427 passed, 1 skipped, 619 warnings in 710.53s (0:11:50) =========== Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins16256471248351385987.sh

nvidia-merlin-bot avatar Jul 12 '22 16:07 nvidia-merlin-bot

Click to view CI Results
GitHub pull request #1609 of commit 64914a5f8965c646133e4417b807717ebfde610f, no merge conflicts.
Running as SYSTEM
Setting status of 64914a5f8965c646133e4417b807717ebfde610f to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4583/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 64914a5f8965c646133e4417b807717ebfde610f^{commit} # timeout=10
Checking out Revision 64914a5f8965c646133e4417b807717ebfde610f (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 64914a5f8965c646133e4417b807717ebfde610f # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk d5d379101ec42f6ba7b7f31fc9f3237f29d1b5fb # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins1193956549961660074.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1428 items

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_s3.py .. [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 62%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_s3.py: 2 warnings tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========== 1427 passed, 1 skipped, 619 warnings in 699.43s (0:11:39) =========== Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins7725628539816512595.sh

nvidia-merlin-bot avatar Jul 15 '22 13:07 nvidia-merlin-bot

I suppose this would (partially) intersect with https://github.com/NVIDIA-Merlin/core/issues/70

rjzamora avatar Jul 15 '22 20:07 rjzamora

Yeah, good point @rjzamora. I would like to be able to do Dask computations across all the Merlin libraries, and also use Merlin graphs to run computations without Dask in some contexts (e.g. in Triton), so I ended up with a somewhat different design.

karlhigley avatar Jul 20 '22 21:07 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 7ca7c0def80043f81602f0400142d8e866a5d562, no merge conflicts.
Running as SYSTEM
Setting status of 7ca7c0def80043f81602f0400142d8e866a5d562 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4600/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 7ca7c0def80043f81602f0400142d8e866a5d562^{commit} # timeout=10
Checking out Revision 7ca7c0def80043f81602f0400142d8e866a5d562 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7ca7c0def80043f81602f0400142d8e866a5d562 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 54c0038e16bfb8603e3f6ec7cbebb8ae5a4dc4a9 # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins2239481737267509604.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1428 items

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_s3.py .. [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 62%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_s3.py: 2 warnings tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========== 1427 passed, 1 skipped, 619 warnings in 697.38s (0:11:37) =========== Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins4461961831957264715.sh

nvidia-merlin-bot avatar Jul 20 '22 21:07 nvidia-merlin-bot

arbitration: which initiative is this under ?

viswa-nvidia avatar Jul 29 '22 22:07 viswa-nvidia

😃 This PR is a great example of separating changes into well defined commits that makes reviewing a refactor like this easy to follow. 🚀 It looks like a great step in the direction toward being able to run these transforms in different modes.

I imagine we may identify further changes as we try to use this in Systems. In the interest of keeping the changes relatively small, it seems like in a merge-able state to me.

oliverholworthy avatar Aug 02 '22 11:08 oliverholworthy

@viswa-nvidia This PR was opened on the premise that we'd be working on offline batch recs generation in 22.08, as we'd planned before session-based bumped it out of the way. Since we still plan to work on offline batch (albeit later than we'd originally hoped), this PR is still relevant but not tied to one of the pieces of work slated for 22.08.

karlhigley avatar Aug 02 '22 14:08 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 242fc3657c847d7ed026dc657dc5a331c73ca015, no merge conflicts.
Running as SYSTEM
Setting status of 242fc3657c847d7ed026dc657dc5a331c73ca015 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4612/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 242fc3657c847d7ed026dc657dc5a331c73ca015^{commit} # timeout=10
Checking out Revision 242fc3657c847d7ed026dc657dc5a331c73ca015 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 242fc3657c847d7ed026dc657dc5a331c73ca015 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 302f7c355a27bd485f293a4494785ea89d29949e # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins2058300991048675202.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items

tests/unit/test_dask_nvt.py ..........................F..F....F......F.. [ 3%] F.................................................................FFF... [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_s3.py FF [ 8%] tests/unit/test_tf4rec.py . [ 9%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 21%] ........................................s.. [ 24%] tests/unit/loader/test_torch_dataloader.py ............................. [ 26%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py FFFFFF [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=================================== FAILURES =================================== ____ test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] ____

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr26') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:10:53,272 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-a117a5c3563047ab7c1e46c936b45b04', 1) Function: subgraph_callable-eec4959e-4b83-466a-b446-9bb87151 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr26/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1] ___

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr29') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:10:55,307 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-ac98640b3fa44ac29eff10c91786542c', 1) Function: subgraph_callable-723280cd-5667-4086-b5a7-3509cc3a args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr29/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

_________ test_dask_workflow_api_dlrm[True-None-True-None-150-csv-0.1] _________

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr34') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:10:58,216 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-41153e01c8fc6f5939c438d5c8bb0aed', 0) Function: subgraph_callable-2e6d8883-6283-40d5-8469-02fc19d6 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr34/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr41') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:11:02,469 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-d4101e91f9873c58557cd7d56b525793', 1) Function: subgraph_callable-04a503ca-8711-4a43-ad5f-9be72c3e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr41/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

____ test_dask_workflow_api_dlrm[True-None-False-None-0-csv-no-header-0.1] _____

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr44') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = None, on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:11:04,529 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-e697312ba72ed052bd71ceb256da36a4', 1) Function: subgraph_callable-a9e4f333-d96d-439a-9565-49dec56a args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr44/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___________________ test_dask_preproc_cpu[True-None-parquet] ___________________

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} engine = 'parquet', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-02 14:11:46,515 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-38892a42e6efb5a7f77e9e32dd415ba5', 14) Function: subgraph_callable-d3152863-1a2f-4a95-aeea-22c61b92 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-02 14:11:46,516 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-38892a42e6efb5a7f77e9e32dd415ba5', 15) Function: subgraph_callable-d3152863-1a2f-4a95-aeea-22c61b92 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:46,519 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-38892a42e6efb5a7f77e9e32dd415ba5', 12) Function: subgraph_callable-d3152863-1a2f-4a95-aeea-22c61b92 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

_____________________ test_dask_preproc_cpu[True-None-csv] _____________________

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} engine = 'csv', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:11:47,479 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 12) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,480 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 18) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,481 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 21) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,481 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 14) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,482 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 2) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,482 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 10) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,482 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 16) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,483 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 1) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,484 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 0) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,485 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 15) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,486 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 13) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,487 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 11) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,487 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 17) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,487 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 20) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,488 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 19) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,488 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 22) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

--------------------------- Captured stderr teardown --------------------------- 2022-08-02 14:11:47,495 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 8) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,498 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 6) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,499 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 5) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,500 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 3) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,507 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 4) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,511 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 7) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:47,514 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 9) Function: subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________

client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} engine = 'csv-no-header', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:11:48,171 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-3b9570b799cadec73fd64f5f4d9b0c9e', 13) Function: subgraph_callable-1b15a093-7e0c-45e7-9a1f-a74b059f args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:48,174 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-3b9570b799cadec73fd64f5f4d9b0c9e', 11) Function: subgraph_callable-1b15a093-7e0c-45e7-9a1f-a74b059f args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2/processed/part_2.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:11:48,176 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-3b9570b799cadec73fd64f5f4d9b0c9e', 15) Function: subgraph_callable-1b15a093-7e0c-45e7-9a1f-a74b059f args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___________________________ test_s3_dataset[parquet] ___________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7fe918ad2b20> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe918b8e460> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/parquet', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0> _stacktrace = <traceback object at 0x7fe9114049c0>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe918b8e460> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> method = 'PUT', url = '/parquet' timeout = <urllib3.util.timeout.Timeout object at 0x7fe918b8e460> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91afa2d30>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: bb55e11d-7809-400d-99db-753fa4d71a84\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: bb55e11d-7809-400d-99db-753fa4d71a84\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0> data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: bb55e11d-7809-400d-99db-753fa4d71a84\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-12/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-12/parquet0/dataset-1.parquet'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} engine = 'parquet' df = name-cat name-string id label x y 0 Bob Frank 977 1039 0.430966 0.771394 ...la 935 975 -0.258980 0.125659 4320 Alice Oliver 988 1060 -0.785203 0.746451

[4321 rows x 6 columns] patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7fe918ad2b20> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError ---------------------------- Captured stderr setup ----------------------------- Traceback (most recent call last): File "/usr/local/bin/moto_server", line 5, in from moto.server import main File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in from moto.moto_server.werkzeug_app import ( File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in from flask import Flask File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in from . import json as json File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in from ..globals import current_app File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in app_ctx: "AppContext" = LocalProxy( # type: ignore[assignment] TypeError: init() got an unexpected keyword argument 'unbound_message' _____________________________ test_s3_dataset[csv] _____________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7fe9114914f0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91afa44c0> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/csv', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0> _stacktrace = <traceback object at 0x7fe918831d40>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91afa44c0> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> method = 'PUT', url = '/csv' timeout = <urllib3.util.timeout.Timeout object at 0x7fe91afa44c0> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91af2e970>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d3fec743-d9f5-40fe-ada7-4db95610b271\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d3fec743-d9f5-40fe-ada7-4db95610b271\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90> data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d3fec743-d9f5-40fe-ada7-4db95610b271\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-12/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-12/csv0/dataset-1.csv'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')} engine = 'csv' df = name-string id label x y 0 Frank 977 1039 0.430966 0.771394 1 Bob ... Ursula 935 975 -0.258980 0.125659 2160 Oliver 988 1060 -0.785203 0.746451

[4321 rows x 5 columns] patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7fe9114914f0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError _____________________ test_cpu_workflow[True-True-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_pa0') df = name-cat name-string id label x y 0 Bob Frank 977 1039 0.430966 0.771394 ...la 935 975 -0.258980 0.125659 4320 Alice Oliver 988 1060 -0.785203 0.746451

[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe8905f5160>, cpu = True engine = 'parquet', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid _______________________ test_cpu_workflow[True-True-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs0') df = name-string id label x y 0 Frank 977 1039 0.430966 0.771394 1 Bob ... Ursula 935 975 -0.258980 0.125659 2160 Oliver 988 1060 -0.785203 0.746451

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe8e83beeb0>, cpu = True engine = 'csv', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid __________________ test_cpu_workflow[True-True-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs1') df = name-string id label x y 0 Frank 977 1039 0.430966 0.771394 1 Bob ... Ursula 935 975 -0.258980 0.125659 2160 Oliver 988 1060 -0.785203 0.746451

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe8c064d040>, cpu = True engine = 'csv-no-header', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid ____________________ test_cpu_workflow[True-False-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_p0') df = name-cat name-string id label x y 0 Bob Frank 977 1039 0.430966 0.771394 ...la 935 975 -0.258980 0.125659 4320 Alice Oliver 988 1060 -0.785203 0.746451

[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe8c8512ee0>, cpu = True engine = 'parquet', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid ______________________ test_cpu_workflow[True-False-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c0') df = name-string id label x y 0 Frank 977 1039 0.430966 0.771394 1 Bob ... Ursula 935 975 -0.258980 0.125659 2160 Oliver 988 1060 -0.785203 0.746451

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe8c07c7220>, cpu = True engine = 'csv', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid _________________ test_cpu_workflow[True-False-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c1') df = name-string id label x y 0 Frank 977 1039 0.430966 0.771394 1 Bob ... Ursula 935 975 -0.258980 0.125659 2160 Oliver 988 1060 -0.785203 0.746451

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe8c8621e80>, cpu = True engine = 'csv-no-header', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid =============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-None-150-csv-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-0-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet] FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv] - py... FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header] FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions.... FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp... FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header] ===== 16 failed, 1415 passed, 1 skipped, 617 warnings in 736.74s (0:12:16) ===== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins4796112231564463528.sh

nvidia-merlin-bot avatar Aug 02 '22 14:08 nvidia-merlin-bot

I'm not able to reproduce these test failures locally, even in the merlin_ci_runner image. Going to try a re-run 🤷🏻

karlhigley avatar Aug 02 '22 14:08 karlhigley

rerun tests

karlhigley avatar Aug 02 '22 14:08 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 242fc3657c847d7ed026dc657dc5a331c73ca015, no merge conflicts.
GitHub pull request #1609 of commit 242fc3657c847d7ed026dc657dc5a331c73ca015, no merge conflicts.
Running as SYSTEM
Setting status of 242fc3657c847d7ed026dc657dc5a331c73ca015 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4613/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 242fc3657c847d7ed026dc657dc5a331c73ca015^{commit} # timeout=10
Checking out Revision 242fc3657c847d7ed026dc657dc5a331c73ca015 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 242fc3657c847d7ed026dc657dc5a331c73ca015 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 242fc3657c847d7ed026dc657dc5a331c73ca015 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins7892443554037532412.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items

tests/unit/test_dask_nvt.py ..........................F....F............ [ 3%] ...F...............................................................F.... [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_s3.py FF [ 8%] tests/unit/test_tf4rec.py . [ 9%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 21%] ........................................s.. [ 24%] tests/unit/loader/test_torch_dataloader.py ............................. [ 26%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py FFFFFF [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=================================== FAILURES =================================== ____ test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] ____

client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr26') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:50:46,240 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-59cbff4bfa9b201755371def3a4a8ee0', 1) Function: subgraph_callable-7e8dc1fb-908b-45ec-a6cb-e042825e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr26/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

__________ test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1] __________

client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr31') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None on_host = True, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:50:49,413 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4fa23bb4f606e99d8314e594eb4d3c5d', 0) Function: subgraph_callable-432f28e8-49bd-45c5-869f-248b0670 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr31/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___ test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1] ____

client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr47') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = None, on_host = False, shuffle = None, cpu = True

@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:
      df_disk = dd_read_parquet(output_path).compute()

tests/unit/test_dask_nvt.py:130:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-02 14:50:58,418 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-f624361c9960e8bfe9f17d1c64ec291a', 1) Function: subgraph_callable-66c6180a-b49f-4f3d-9cea-eba316c3 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr47/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

_____________________ test_dask_preproc_cpu[True-None-csv] _____________________

client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')} engine = 'csv', shuffle = None, cpu = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result
  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()

tests/unit/test_dask_nvt.py:277:


/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???


??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-02 14:51:39,276 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 10) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,277 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 14) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-02 14:51:39,280 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 17) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,281 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 20) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,281 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 22) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,282 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 19) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,282 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 18) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

--------------------------- Captured stderr teardown --------------------------- 2022-08-02 14:51:39,312 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 21) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,315 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 16) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,315 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 23) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,316 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 25) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,319 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 27) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,323 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 26) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,324 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 28) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_7.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

2022-08-02 14:51:39,326 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c93258fabc7094400b097695615335f6', 24) Function: subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"

___________________________ test_s3_dataset[parquet] ___________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7f1e9d7ec7c0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bcee0> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/parquet', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0> _stacktrace = <traceback object at 0x7f1e6bb84ac0>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bcee0> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> method = 'PUT', url = '/parquet' timeout = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bcee0> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bc280>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: a1d8cf96-eda6-4e6a-b090-537d987ca6eb\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: a1d8cf96-eda6-4e6a-b090-537d987ca6eb\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0> data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: a1d8cf96-eda6-4e6a-b090-537d987ca6eb\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-14/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-14/parquet0/dataset-1.parquet'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')} engine = 'parquet' df = name-cat name-string id label x y 0 Yvonne Xavier 991 986 0.157298 -0.169087 ...ry 995 1027 0.992783 -0.835742 4320 Zelda Gary 996 973 0.665933 -0.646899

[4321 rows x 6 columns] patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7f1e9d7ec7c0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError ---------------------------- Captured stderr setup ----------------------------- Traceback (most recent call last): File "/usr/local/bin/moto_server", line 5, in from moto.server import main File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in from moto.moto_server.werkzeug_app import ( File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in from flask import Flask File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in from . import json as json File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in from ..globals import current_app File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in app_ctx: "AppContext" = LocalProxy( # type: ignore[assignment] TypeError: init() got an unexpected keyword argument 'unbound_message' _____________________________ test_s3_dataset[csv] _____________________________

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
      conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:
      raise err

/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:


address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]

def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
          sock.connect(sa)

E ConnectionRefusedError: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError

During handling of the above exception, another exception occurred:

self = <botocore.httpsession.URLLib3Session object at 0x7f1e69705610> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
      urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e9d29c670> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)
      retries = retries.increment(
            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:


self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/csv', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80> _stacktrace = <traceback object at 0x7f1e6bb99e80>

def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.
      raise six.reraise(type(error), error, _stacktrace)

/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:


tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None

def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
      raise value

../../../.local/lib/python3.8/site-packages/six.py:703:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e9d29c670> is_new_proxy_conn = False

def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
      httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:


self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> method = 'PUT', url = '/csv' timeout = <urllib3.util.timeout.Timeout object at 0x7f1e9d29c670> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e9d488760>

def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:
      conn.request(method, url, **httplib_request_kw)

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""
  self._send_request(method, url, body, headers, encode_chunked)

/usr/lib/python3.8/http/client.py:1256:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}

def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls
  rval = super()._send_request(
        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False

def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')
  self.endheaders(body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1302:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> message_body = None

def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()
  self._send_output(message_body, encode_chunked=encode_chunked)

/usr/lib/python3.8/http/client.py:1251:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: de9dcb34-2b34-4b02-8870-9d8cd162c54c\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None
  self.send(msg)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: de9dcb34-2b34-4b02-8870-9d8cd162c54c\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return
  return super().send(str)

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0> data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: de9dcb34-2b34-4b02-8870-9d8cd162c54c\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'

def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:
          self.connect()

/usr/lib/python3.8/http/client.py:951:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

def connect(self):
  conn = self._new_conn()

/usr/lib/python3/dist-packages/urllib3/connection.py:187:


self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:
      raise NewConnectionError(
            self, "Failed to establish a new connection: %s" % e
        )

E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>: Failed to establish a new connection: [Errno 111] Connection refused

/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError

During handling of the above exception, another exception occurred:

s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-14/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-14/csv0/dataset-1.csv'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')} engine = 'csv' df = name-string id label x y 0 Xavier 991 986 0.157298 -0.169087 1 Jerry ... Jerry 995 1027 0.992783 -0.835742 2160 Gary 996 973 0.665933 -0.646899

[4321 rows x 5 columns] patch_aiobotocore = None

@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)
  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:

tests/unit/test_s3.py:97:


/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)


self = <botocore.httpsession.URLLib3Session object at 0x7f1e69705610> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>

def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:
      raise EndpointConnectionError(endpoint_url=request.url, error=e)

E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError _____________________ test_cpu_workflow[True-True-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_pa0') df = name-cat name-string id label x y 0 Yvonne Xavier 991 986 0.157298 -0.169087 ...ry 995 1027 0.992783 -0.835742 4320 Zelda Gary 996 973 0.665933 -0.646899

[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f1dd8781f70>, cpu = True engine = 'parquet', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid _______________________ test_cpu_workflow[True-True-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs0') df = name-string id label x y 0 Xavier 991 986 0.157298 -0.169087 1 Jerry ... Jerry 995 1027 0.992783 -0.835742 2160 Gary 996 973 0.665933 -0.646899

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f1e14793ee0>, cpu = True engine = 'csv', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid __________________ test_cpu_workflow[True-True-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs1') df = name-string id label x y 0 Xavier 991 986 0.157298 -0.169087 1 Jerry ... Jerry 995 1027 0.992783 -0.835742 2160 Gary 996 973 0.665933 -0.646899

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f1d2cf76fd0>, cpu = True engine = 'csv-no-header', dump = True

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid ____________________ test_cpu_workflow[True-False-parquet] _____________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_p0') df = name-cat name-string id label x y 0 Yvonne Xavier 991 986 0.157298 -0.169087 ...ry 995 1027 0.992783 -0.835742 4320 Zelda Gary 996 973 0.665933 -0.646899

[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f1d2cf1c760>, cpu = True engine = 'parquet', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid ______________________ test_cpu_workflow[True-False-csv] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c0') df = name-string id label x y 0 Xavier 991 986 0.157298 -0.169087 1 Jerry ... Jerry 995 1027 0.992783 -0.835742 2160 Gary 996 973 0.665933 -0.646899

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f1e147b62e0>, cpu = True engine = 'csv', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid _________________ test_cpu_workflow[True-False-csv-no-header] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c1') df = name-string id label x y 0 Xavier 991 986 0.157298 -0.169087 1 Jerry ... Jerry 995 1027 0.992783 -0.835742 2160 Gary 996 973 0.665933 -0.646899

[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f1dd0f569a0>, cpu = True engine = 'csv-no-header', dump = False

@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )
  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)

tests/unit/workflow/test_cpu_workflow.py:76:


/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???


??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid =============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv] - py... FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions.... FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp... FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header] ===== 12 failed, 1419 passed, 1 skipped, 617 warnings in 709.70s (0:11:49) ===== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins9203679199988082363.sh

nvidia-merlin-bot avatar Aug 02 '22 15:08 nvidia-merlin-bot

Click to view CI Results
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
Running as SYSTEM
Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4626/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10
Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 8bd1260ba233898308f1416f79cefbd75013f4ff # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins15266300073057526636.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...F.. [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=================================== FAILURES =================================== ____________________________ test_movielens_example ____________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0')

def test_movielens_example(tmpdir):
    _get_random_movielens_data(tmpdir, 10000, dataset="movie")
    _get_random_movielens_data(tmpdir, 10000, dataset="ratings")
    _get_random_movielens_data(tmpdir, 5000, dataset="ratings", valid=True)

    triton_model_path = os.path.join(tmpdir, "models")
    os.environ["INPUT_DATA_DIR"] = str(tmpdir)
    os.environ["MODEL_PATH"] = triton_model_path

    notebook_path = os.path.join(
        dirname(TEST_PATH),
        "examples/getting-started-movielens/",
        "02-ETL-with-NVTabular.ipynb",
    )
    _run_notebook(tmpdir, notebook_path)

    def _modify_tf_nb(line):
        return line.replace(
            # don't require graphviz/pydot
            "tf.keras.utils.plot_model(model)",
            "# tf.keras.utils.plot_model(model)",
        )

    def _modify_tf_triton(line):
        # models are already preloaded
        line = line.replace("triton_client.load_model", "# triton_client.load_model")
        line = line.replace("triton_client.unload_model", "# triton_client.unload_model")
        return line

    notebooks = []
    try:
        import torch  # noqa

        notebooks.append("03-Training-with-PyTorch.ipynb")
    except Exception:
        pass
    try:
        import nvtabular.inference.triton  # noqa
        import nvtabular.loader.tensorflow  # noqa

        notebooks.append("03-Training-with-TF.ipynb")
        has_tf = True

    except Exception:
        has_tf = False

    for notebook in notebooks:
        notebook_path = os.path.join(
            dirname(TEST_PATH),
            "examples/getting-started-movielens/",
            notebook,
        )
        if notebook == "03-Training-with-TF.ipynb":
            _run_notebook(tmpdir, notebook_path, transform=_modify_tf_nb)
        else:
            _run_notebook(tmpdir, notebook_path)

    # test out the TF inference movielens notebook if appropriate
    if has_tf and TRITON_SERVER_PATH:
        notebook = "04-Triton-Inference-with-TF.ipynb"
        notebook_path = os.path.join(
            dirname(TEST_PATH),
            "examples/getting-started-movielens/",
            notebook,
        )
        with run_triton_server(triton_model_path):
          _run_notebook(tmpdir, notebook_path, transform=_modify_tf_triton)

tests/unit/test_notebooks.py:224:


tests/unit/test_notebooks.py:307: in _run_notebook subprocess.check_output([sys.executable, script_path]) /usr/lib/python3.8/subprocess.py:415: in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,


input = None, capture_output = False, timeout = None, check = True popenargs = (['/usr/bin/python3', '/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/notebook.py'],) kwargs = {'stdout': -1}, process = <subprocess.Popen object at 0x7f8489dd5160> stdout = b"client created.\nGET /v2/health/live, headers None\n<HTTPSocketPoolResponse status=400 headers={'content-length': '0', 'content-type': 'text/plain'}>\nPOST /v2/repository/index, headers None\n\n" stderr = None, retcode = 1

def run(*popenargs,
        input=None, capture_output=False, timeout=None, check=False, **kwargs):
    """Run command with arguments and return a CompletedProcess instance.

    The returned instance will have attributes args, returncode, stdout and
    stderr. By default, stdout and stderr are not captured, and those attributes
    will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.

    If check is True and the exit code was non-zero, it raises a
    CalledProcessError. The CalledProcessError object will have the return code
    in the returncode attribute, and output & stderr attributes if those streams
    were captured.

    If timeout is given, and the process takes too long, a TimeoutExpired
    exception will be raised.

    There is an optional argument "input", allowing you to
    pass bytes or a string to the subprocess's stdin.  If you use this argument
    you may not also use the Popen constructor's "stdin" argument, as
    it will be used internally.

    By default, all communication is in bytes, and therefore any "input" should
    be bytes, and the stdout and stderr will be bytes. If in text mode, any
    "input" should be a string, and stdout and stderr will be strings decoded
    according to locale encoding, or by "encoding" if set. Text mode is
    triggered by setting any of text, encoding, errors or universal_newlines.

    The other arguments are the same as for the Popen constructor.
    """
    if input is not None:
        if kwargs.get('stdin') is not None:
            raise ValueError('stdin and input arguments may not both be used.')
        kwargs['stdin'] = PIPE

    if capture_output:
        if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
            raise ValueError('stdout and stderr arguments may not be used '
                             'with capture_output.')
        kwargs['stdout'] = PIPE
        kwargs['stderr'] = PIPE

    with Popen(*popenargs, **kwargs) as process:
        try:
            stdout, stderr = process.communicate(input, timeout=timeout)
        except TimeoutExpired as exc:
            process.kill()
            if _mswindows:
                # Windows accumulates the output in a single blocking
                # read() call run on child threads, with the timeout
                # being done in a join() on those threads.  communicate()
                # _after_ kill() is required to collect that and add it
                # to the exception.
                exc.stdout, exc.stderr = process.communicate()
            else:
                # POSIX _communicate already populated the output so
                # far into the TimeoutExpired exception.
                process.wait()
            raise
        except:  # Including KeyboardInterrupt, communicate handled that.
            process.kill()
            # We don't call process.wait() as .__exit__ does that for us.
            raise
        retcode = process.poll()
        if check and retcode:
          raise CalledProcessError(retcode, process.args,
                                     output=stdout, stderr=stderr)

E subprocess.CalledProcessError: Command '['/usr/bin/python3', '/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/notebook.py']' returned non-zero exit status 1.

/usr/lib/python3.8/subprocess.py:516: CalledProcessError ----------------------------- Captured stderr call ----------------------------- /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-15 13:55:34.039352: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-08-15 13:55:35.023527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0 2022-08-15 13:55:35.024322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14532 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0 /usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.11) or chardet (3.0.4) doesn't match a supported version! warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported " /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( WARNING:absl:Function _wrapped_model contains input name(s) movieId, userId with unsupported characters which will be renamed to movieid, userid in the SavedModel. WARNING:absl:<nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7fa291fd2be0> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function. WARNING:absl:Function _wrapped_model contains input name(s) movieId, userId with unsupported characters which will be renamed to movieid, userid in the SavedModel. WARNING:absl:<nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7fa291fd2be0> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function. I0815 13:55:43.015590 13149 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f151e000000' with size 268435456 I0815 13:55:43.016394 13149 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0815 13:55:43.019651 13149 model_repository_manager.cc:1191] loading: movielens_tf:1 I0815 13:55:43.119889 13149 model_repository_manager.cc:1191] loading: movielens_nvt:1 I0815 13:55:43.402054 13149 tensorflow.cc:2204] TRITONBACKEND_Initialize: tensorflow I0815 13:55:43.402090 13149 tensorflow.cc:2214] Triton TRITONBACKEND API version: 1.10 I0815 13:55:43.402097 13149 tensorflow.cc:2220] 'tensorflow' TRITONBACKEND API version: 1.10 I0815 13:55:43.402103 13149 tensorflow.cc:2244] backend configuration: {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","version":"2","default-max-batch-size":"4"}} I0815 13:55:43.402139 13149 tensorflow.cc:2310] TRITONBACKEND_ModelInitialize: movielens_tf (version 1) I0815 13:55:43.406327 13149 backend.cc:46] TRITONBACKEND_Initialize: nvtabular I0815 13:55:43.406368 13149 backend.cc:53] Triton TRITONBACKEND API version: 1.10 I0815 13:55:43.406385 13149 backend.cc:56] 'nvtabular' TRITONBACKEND API version: 1.10 I0815 13:55:43.406630 13149 backend.cc:76] Loaded libpython successfully I0815 13:55:43.619111 13149 backend.cc:89] Python interpreter is initialized I0815 13:55:43.619191 13149 tensorflow.cc:2359] TRITONBACKEND_ModelInstanceInitialize: movielens_tf (GPU device 0) 2022-08-15 13:55:44.030716: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models/movielens_tf/1/model.savedmodel 2022-08-15 13:55:44.033884: I tensorflow/cc/saved_model/reader.cc:81] Reading meta graph with tags { serve } 2022-08-15 13:55:44.036136: I tensorflow/cc/saved_model/reader.cc:122] Reading SavedModel debug info (if present) from: /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models/movielens_tf/1/model.savedmodel 2022-08-15 13:55:44.036262: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-08-15 13:55:44.075003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11486 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0 2022-08-15 13:55:44.105698: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled 2022-08-15 13:55:44.107504: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle. 2022-08-15 13:55:44.157853: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models/movielens_tf/1/model.savedmodel 2022-08-15 13:55:44.184293: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 153598 microseconds. I0815 13:55:44.184515 13149 model_repository_manager.cc:1345] successfully loaded 'movielens_tf' version 1 I0815 13:55:44.185598 13149 model_inst_state.hpp:58] Loading TritonPythonModel from module 'nvtabular.inference.triton.workflow_model' I0815 13:55:47.035470 13149 model_repository_manager.cc:1345] successfully loaded 'movielens_nvt' version 1 I0815 13:55:47.035885 13149 model_repository_manager.cc:1191] loading: movielens:1 I0815 13:55:47.136385 13149 model_repository_manager.cc:1345] successfully loaded 'movielens' version 1 I0815 13:55:47.136538 13149 server.cc:556] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0815 13:55:47.136652 13149 server.cc:583] +------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | tensorflow | /opt/tritonserver/backends/tensorflow2/libtriton_tensorflow2.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","version":"2","default-max-batch-size":"4"}} | | nvtabular | /opt/tritonserver/backends/nvtabular/libtriton_nvtabular.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} | +------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0815 13:55:47.136769 13149 server.cc:626] +---------------+---------+--------+ | Model | Version | Status | +---------------+---------+--------+ | movielens | 1 | READY | | movielens_nvt | 1 | READY | | movielens_tf | 1 | READY | +---------------+---------+--------+

I0815 13:55:47.195489 13149 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB I0815 13:55:47.196424 13149 tritonserver.cc:2159] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.23.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace | | model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

E0815 13:55:47.197036931 13149 server_chttp2.cc:40] {"created":"@1660571747.196983913","description":"No address added out of total 1 resolved","file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1660571747.196982030","description":"Failed to add any wildcard listeners","file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1660571747.196960677","description":"Address family not supported by protocol","errno":97,"file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":395,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:8001"},{"created":"@1660571747.196981667","description":"Unable to configure socket","fd":43,"file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1660571747.196978868","description":"Address already in use","errno":98,"file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]} E0815 13:55:47.197111 13149 main.cc:825] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use W0815 13:55:48.222600 13149 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0 /usr/local/lib/python3.8/dist-packages/tritonhttpclient/init.py:31: DeprecationWarning: The package tritonhttpclient is deprecated and will be removed in a future version. Please use instead tritonclient.http warnings.warn( Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 163, in get_socket return self._socket_queue.get(block=False) File "src/gevent/queue.py", line 335, in gevent._gevent_cqueue.Queue.get File "src/gevent/queue.py", line 350, in gevent._gevent_cqueue.Queue.get File "src/gevent/queue.py", line 319, in gevent._gevent_cqueue.Queue._Queue__get_or_peek

_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/notebook.py", line 43, in triton_client.get_model_repository_index() File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 619, in get_model_repository_index response = self._post(request_uri=request_uri, File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 313, in _post response = self._client_stub.post(request_uri=request_uri, File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 272, in post return self.request(METHOD_POST, request_uri, body=body, headers=headers) File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 226, in request sock = self._connection_pool.get_socket() File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 166, in get_socket return self._create_socket() File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 127, in _create_socket raise first_error File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 114, in _create_socket sock = self._connect_socket(sock, sock_info[-1]) File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 136, in _connect_socket sock.connect(address) File "/usr/local/lib/python3.8/dist-packages/gevent/_socketcommon.py", line 607, in connect raise _SocketError(err, strerror(err)) ConnectionRefusedError: [Errno 111] Connection refused =============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1] /usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first self.make_current()

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/unit/test_notebooks.py::test_movielens_example - subprocess.Call... ===== 1 failed, 1428 passed, 2 skipped, 618 warnings in 764.65s (0:12:44) ====== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins14802151835763005993.sh

nvidia-merlin-bot avatar Aug 15 '22 14:08 nvidia-merlin-bot

rerun tests

karlhigley avatar Aug 15 '22 14:08 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
Running as SYSTEM
Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4627/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10
Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins5374540968505043348.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ..............................FF [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=================================== FAILURES =================================== _________________________ test_groupby_model[pytorch] __________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_groupby_model_pytorch_0') output_model = 'pytorch'

@pytest.mark.skipif(TRITON_SERVER_PATH is None, reason="Requires tritonserver on the path")
@pytest.mark.parametrize("output_model", ["tensorflow", "pytorch"])
def test_groupby_model(tmpdir, output_model):
    size = 20
    df = make_df(
        {
            "id": np.random.choice([0, 1], size=size),
            "ts": np.linspace(0.0, 10.0, num=size),
            "x": np.arange(size),
            "y": np.linspace(0.0, 10.0, num=size),
        }
    )

    groupby_features = ColumnSelector(["id", "ts", "x", "y"]) >> ops.Groupby(
        groupby_cols=["id"],
        sort_cols=["ts"],
        aggs={
            "x": ["sum"],
            "y": ["first"],
        },
        name_sep="-",
    )
    workflow = nvt.Workflow(groupby_features)
  _verify_workflow_on_tritonserver(
        tmpdir, workflow, df, "groupby", output_model, cats=["id", "y-first"], conts=["x-sum"]
    )

tests/unit/test_triton_inference.py:379:


tests/unit/test_triton_inference.py:112: in _verify_workflow_on_tritonserver response = client.infer(model_name, inputs, outputs=outputs) /usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:1322: in infer raise_error_grpc(rpc_error)


rpc_error = <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Socket closed" debug_err....0.0.1:8001","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"

def raise_error_grpc(rpc_error):
  raise get_error_grpc(rpc_error) from None

E tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed

/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:62: InferenceServerException ----------------------------- Captured stderr call ----------------------------- I0815 14:14:56.462962 26696 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f3044000000' with size 268435456 I0815 14:14:56.463717 26696 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0815 14:14:56.466128 26696 model_repository_manager.cc:1191] loading: groupby:1 I0815 14:14:56.573468 26696 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: groupby (GPU device 0) I0815 14:14:58.798412 26696 model_repository_manager.cc:1345] successfully loaded 'groupby' version 1 I0815 14:14:58.798619 26696 server.cc:556] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0815 14:14:58.798723 26696 server.cc:583] +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} | +---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0815 14:14:58.798770 26696 server.cc:626] +---------+---------+--------+ | Model | Version | Status | +---------+---------+--------+ | groupby | 1 | READY | +---------+---------+--------+

I0815 14:14:58.863077 26696 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB I0815 14:14:58.863960 26696 tritonserver.cc:2159] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.23.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace | | model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-13/test_groupby_model_pytorch_0 | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0815 14:14:58.864892 26696 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001 I0815 14:14:58.865437 26696 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000 I0815 14:14:58.906788 26696 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002 W0815 14:14:59.883731 26696 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0 Signal (11) received. 0# 0x000055B88900E699 in /opt/tritonserver/bin/tritonserver 1# 0x00007F308B67C090 in /usr/lib/x86_64-linux-gnu/libc.so.6 2# 0x00007F30811D68C2 in /opt/tritonserver/backends/python/libtriton_python.so 3# 0x00007F30811A2F10 in /opt/tritonserver/backends/python/libtriton_python.so 4# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/python/libtriton_python.so 5# 0x00007F308BF2C5CA in /opt/tritonserver/lib/libtritonserver.so 6# 0x00007F308BF2CCF7 in /opt/tritonserver/lib/libtritonserver.so 7# 0x00007F308BFECE11 in /opt/tritonserver/lib/libtritonserver.so 8# 0x00007F308BF26C47 in /opt/tritonserver/lib/libtritonserver.so 9# 0x00007F308BA6BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6 10# 0x00007F308CC7C609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0 11# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

______________________ test_seq_etl_tf_model[tensorflow] _______________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_seq_etl_tf_model_tensorfl0') output_model = 'tensorflow'

@pytest.mark.skipif(TRITON_SERVER_PATH is None, reason="Requires tritonserver on the path")
@pytest.mark.parametrize("output_model", ["tensorflow"])
def test_seq_etl_tf_model(tmpdir, output_model):
    size = 100
    max_length = 10
    df = make_df(
        {
            "id": np.random.choice([0, 1], size=size),
            "item_id": np.random.randint(1, 10, size),
            "ts": np.linspace(0.0, 10.0, num=size).astype(np.float32),
            "y": np.linspace(0.0, 10.0, num=size).astype(np.float32),
        }
    )

    groupby_features = ColumnSelector(["id", "item_id", "ts", "y"]) >> ops.Groupby(
        groupby_cols=["id"],
        sort_cols=["ts"],
        aggs={
            "item_id": ["list"],
            "y": ["list"],
        },
        name_sep="-",
    )
    feats_list = groupby_features["item_id-list", "y-list"]
    feats_trim = feats_list >> ops.ListSlice(0, max_length, pad=True)
    selected_features = groupby_features["id"] + feats_trim

    workflow = nvt.Workflow(selected_features)

    sparse_max = {"item_id-list": max_length, "y-list": max_length}
  _verify_workflow_on_tritonserver(
        tmpdir,
        workflow,
        df,
        "groupby",
        output_model,
        sparse_max,
        cats=["id", "item_id-list"],
        conts=["y-list"],
    )

tests/unit/test_triton_inference.py:415:


tests/unit/test_triton_inference.py:111: in _verify_workflow_on_tritonserver with run_triton_server(tmpdir) as client: /usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen)


modelpath = local('/tmp/pytest-of-jenkins/pytest-13/test_seq_etl_tf_model_tensorfl0')

@contextlib.contextmanager
def run_triton_server(modelpath):
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        "--backend-config=tensorflow,version=2",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                        raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

                    try:
                        ready = client.is_server_ready()
                    except tritonclient.utils.InferenceServerException:
                        ready = False

                    if ready:
                        yield client
                        return

                    time.sleep(1)
              raise RuntimeError("Timed out waiting for tritonserver to become ready")

E RuntimeError: Timed out waiting for tritonserver to become ready

tests/unit/test_triton_inference.py:62: RuntimeError ----------------------------- Captured stderr call ----------------------------- 0815 14:15:00.865426 26705 pb_stub.cc:1006] Non-graceful termination detected. I0815 14:15:01.120920 26916 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6136000000' with size 268435456 I0815 14:15:01.121618 26916 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0815 14:15:01.123866 26916 model_repository_manager.cc:1191] loading: groupby:1 I0815 14:15:01.231143 26916 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: groupby (GPU device 0) =============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1] /usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first self.make_current()

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/unit/test_triton_inference.py::test_groupby_model[pytorch] - tri... FAILED tests/unit/test_triton_inference.py::test_seq_etl_tf_model[tensorflow] ===== 2 failed, 1427 passed, 2 skipped, 618 warnings in 803.76s (0:13:23) ====== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins9494905553368145523.sh

nvidia-merlin-bot avatar Aug 15 '22 14:08 nvidia-merlin-bot

rerun tests

karlhigley avatar Aug 15 '22 14:08 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
Running as SYSTEM
Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4628/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10
Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins16619067910149829981.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...FBuild was aborted Aborted by [8mha:////4I6AZwo/1Z8Fal8AhZTEatjIwqNwCcqT21311HdysuK+AAAAlx+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAzWEgZu/dLi1CL9xJTczDwAj6GcLcAAAAA=[0madmin Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins15746326830681704448.sh

nvidia-merlin-bot avatar Aug 15 '22 15:08 nvidia-merlin-bot

rerun tests

karlhigley avatar Aug 15 '22 17:08 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
Running as SYSTEM
Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4632/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10
Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins3170339596225298332.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ......FFF....................Build was aborted Aborted by [8mha:////4I6AZwo/1Z8Fal8AhZTEatjIwqNwCcqT21311HdysuK+AAAAlx+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAzWEgZu/dLi1CL9xJTczDwAj6GcLcAAAAA=[0madmin Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins13035443765677779958.sh

nvidia-merlin-bot avatar Aug 15 '22 18:08 nvidia-merlin-bot

The tests for this keep hanging on the multi-GPU Jenkins machine. Not sure if it's an issue with this PR specifically, or NVTabular PRs in general...

karlhigley avatar Aug 15 '22 18:08 karlhigley

Click to view CI Results
GitHub pull request #1609 of commit 35f7c158c6023ef878644de0b65dbdfa3d28b609, no merge conflicts.
Running as SYSTEM
Setting status of 35f7c158c6023ef878644de0b65dbdfa3d28b609 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4633/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 35f7c158c6023ef878644de0b65dbdfa3d28b609^{commit} # timeout=10
Checking out Revision 35f7c158c6023ef878644de0b65dbdfa3d28b609 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 35f7c158c6023ef878644de0b65dbdfa3d28b609 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins5945207459896974934.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]

=============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)

nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader. warnings.warn(

tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1] /usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first self.make_current()

tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(

tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========== 1429 passed, 2 skipped, 618 warnings in 699.38s (0:11:39) =========== Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins3684335047674136531.sh

nvidia-merlin-bot avatar Aug 15 '22 18:08 nvidia-merlin-bot