Skip to content

Check if input_file is directory and then add all files within that directory#336

Merged
hrfuller merged 6 commits intobazel-contrib:masterfrom
AutomatedTester:check_directory
Jan 26, 2021
Merged

Check if input_file is directory and then add all files within that directory#336
hrfuller merged 6 commits intobazel-contrib:masterfrom
AutomatedTester:check_directory

Conversation

@AutomatedTester
Copy link
Copy Markdown
Contributor

If in a dependency tree a directory is used as out this is passed as
input_file to wheelmaker which then generates an unhandled error as
it tries to write the directory as a file.

@thundergolfer
Copy link
Copy Markdown

thundergolfer commented Jul 10, 2020

Adding links as extra context/breadcrumb:

(Unfortunately these links will decay as the Slack workspace is on a free Slack plan with limited message history).

# it will be passed in as an input file and we need to
# ignore.
if os.path.isdir(real_filename):
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a TODO here to actually copy in the directory contents? this PR is just a temp workaround right?
"we need to ignore" seems surprising from a user perspective

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see a case where people will want an empty directory as part of their build. They may have another step that populates it but that case is already taken care of

@AutomatedTester
Copy link
Copy Markdown
Contributor Author

AutomatedTester commented Jul 10, 2020 via email

@alexeagle
Copy link
Copy Markdown
Contributor

I don't think there should be a collision because the bazel-out folder doesn't have one

@AutomatedTester
Copy link
Copy Markdown
Contributor Author

The reason I hit this situation is I have a py_library that has a step that generates code and is copied in it's data attribute.

When we get to the py_wheel stage, according to --sandbox_debug the files in the directory have already been added as --input_file and then near the bottom of the list there is a --input_file=OUTS/dir/from/another/target. Since this is after the files, in my case, it feels like I would overwrite them. If you remove this patch from the WORKSPACE and then run https://github.com/SeleniumHQ/selenium/blob/trunk/py/BUILD.bazel#L69-L87 you will be able to recreate my issue.

@pstradomski
Copy link
Copy Markdown
Contributor

pstradomski commented Jul 11, 2020 via email

@AutomatedTester
Copy link
Copy Markdown
Contributor Author

@alexeagle or @thundergolfer could you have a look and see what is left for me to do here?

@aaliddell
Copy link
Copy Markdown
Contributor

From a user's perspective, silently dropping directories seems like a pretty big footgun; either it should add support for directories or the action should fail completely with a descriptive error.

Also, a directory is a perfectly valid output from a rule using declare_directory. The proper way of handling a directory as a rule input is to use an Args object, which can optionally expand a directory to it's constituent parts using add_all:

Each directory File item is replaced by all Files recursively contained in that directory.

Alternatively, you can pass the directory to the action script and unpack the directory yourself, as was previously suggested. Technically a path collision shouldn't be possible, but perhaps with the addition of prefix stripping/prepending it could happen: perhaps that case is rare enough that we just bail out with an error at that point.

@AutomatedTester
Copy link
Copy Markdown
Contributor Author

@aaliddell Could you give me an example rule for this?

As mentioned before I am not sure why someone would want an empty directory in a wheel. It's always going to have some file in the directory likely so it will be created anyway.

If we are wanting to accept empty directories, should we be adding a new argument to the wheelmaker because this issue was because a directory was being passed to input_file. We could add a new one called input and input_list

@pstradomski
Copy link
Copy Markdown
Contributor

pstradomski commented Oct 29, 2020 via email

@aaliddell
Copy link
Copy Markdown
Contributor

I agree that a completely empty directory may be valid to skip within a wheel; it's the dropping all directories regardless of contents that's my concern.

For an example of where this sort of thing comes up, I often run into this when dealing with gRPC protoc plugins, where the output of a plugin is non-deterministic from file names, which requires capturing the output as a directory rather than a set of files. As an example, we may capture a folder full of generated .py files that should all be included in the py_library, wheel etc (the python plugin is actually well behaved, but this is just an example).

From Bazel's perspective, it treats the directory as a single File unit, meaning you shouldn't simultaneously have overlapping references to the directory and a file contained within it whilst in the Bazel starlark world. Therefore, if wheelmaker is being passed both a file and a directory that contains that file, then it appears there is a bug in the way the command line args are being constructed.

Using Args.add_all may again be the suitable route, since if a directory is empty it will unpack to no args, so in that case wheelmaker shouldn't receive a directory at all. That way wheelmaker needs no updates, since it is then guarenteed to only receive valid files. Although we'd have to check how a genrule generated folder behaves though, since it may be different from one created by declare_directory.

@aaliddell
Copy link
Copy Markdown
Contributor

It appears a genrule that creates a folder is actually invalid, since it should enumerate every file it creates: https://docs.bazel.build/versions/master/be/general.html#genrule.outs

Avoid creating symlinks and directories. Bazel doesn't copy over the directory/symlink structure created by genrules and its dependency checking of directories is unsound.

So Args.add_all could be used to handle the declare_directory directories by unpacking them to files and if wheelmaker still receives a non-file then something quite broken has happened (e.g. invalid genrule); in which case bailing out with an error seems sensible.

@AutomatedTester
Copy link
Copy Markdown
Contributor Author

I agree that a completely empty directory may be valid to skip within a wheel; it's the dropping all directories regardless of contents that's my concern.

This patch only drops if there is --input-file=some/directory but if you then pass in --input-file=some/directory/some_file.py that file, and the directory that was dropped originally, will be added.

For an example of where this sort of thing comes up, I often run into this when dealing with gRPC protoc plugins, where the output of a plugin is non-deterministic from file names, which requires capturing the output as a directory rather than a set of files. As an example, we may capture a folder full of generated .py files that should all be included in the py_library, wheel etc (the python plugin is actually well behaved, but this is just an example).

This patch handles this case. If you see this genrule creates a directory and populates it with .py files and is added to a py_package via a py_library.

Without this patch, I can't generate a wheel because somewhere it passes --input-file=directory/created/by/genrule and then add_file in wheelmaker.py tries to write the directory as a file.

We can (and this is only for wheels)

  • ignore the directory creation but they will be created if there are files within that directory (this PR)

OR

  • Add the potentially empty directory to the wheel and don't care that there may be a random empty directory

@pstradomski
Copy link
Copy Markdown
Contributor

pstradomski commented Oct 29, 2020 via email

@aaliddell
Copy link
Copy Markdown
Contributor

aaliddell commented Oct 29, 2020

Here's a demo of what I'm getting at, based on this PR's latest commit: aaliddell@7b1f2cd

It declares a directory that contains a single python file called dir_generated/demo.py and also declares a single file file_generated.py; these are added as deps to the wheel. Here's the input file list content:

experimental/examples/wheel/lib/module_with_data.py;experimental/examples/wheel/lib/module_with_data.py
experimental/examples/wheel/lib/simple_module.py;experimental/examples/wheel/lib/simple_module.py
experimental/examples/wheel/dir_generated;bazel-out/k8-fastbuild/bin/experimental/examples/wheel/dir_generated
experimental/examples/wheel/file_generated.py;bazel-out/k8-fastbuild/bin/experimental/examples/wheel/file_generated.py

As you can see, the file list contains the generated directory and the generated file, but demo.py itself within the dir is not there. So 'ignore the directory creation but they will be created if there are files within that directory' does not hold under the case of a declared directory containing files. As a result of the dirs being dropped, the file demo.py is silently not in the generated wheel file, but the file file_generated.py is. This is what I meant by the footgun.

Prior to the commits in the PR, running the same patch causes the error you mentioned in the initial PR comment, which is at least a more visible failure than silently missing files in the wheel.

Without this patch, I can't generate a wheel because somewhere it passes --input-file=directory/created/by/genrule and then add_file in wheelmaker.py tries to write the directory as a file.

I think it would be useful to establish why both the file and the directory are being passed to wheelmaker. That seems like the source of this issue for your use case. If the files are being passed, the directory containing them shouldn't be in the input file list in the first place.

If you see this genrule creates a directory

As mentioned above, a genrule that creates a dir without enumerating the contained files is not recommended behaviour, which might be the source of you seeing the overlapping file and dir in the input file list.

After re-reading PEP 427, I think we should be safe to just ignore all directories when build a wheel.

The dirs themselves shouldn't appear in the RECORD file, but the contents of these dirs should be enumerated there and copied to the wheel.

@AutomatedTester
Copy link
Copy Markdown
Contributor Author

I have changed this patch to now walk the directory it is given and add those files. It should do this recursively. I have tested this out on the Selenium Project and it works

@AutomatedTester AutomatedTester changed the title Check if input_file is directory and ignore Check if input_file is directory and then add all files within that directory Dec 16, 2020
If in a dependency tree a directory is used as out this is passed as
input_file to wheelmaker which then generates an unhandled error as
it tries to write the directory as a file.
Copy link
Copy Markdown
Contributor

@hrfuller hrfuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation and tests look good. Thanks for taking the time to fix this.


return normalized_arcname

if os.path.isdir(real_filename):
Copy link
Copy Markdown
Contributor

@hrfuller hrfuller Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to know your rationale for doing this here instead of in the starlark rules here as @aaliddell discussed in the comments.

Edit: I realized the starlark api doesn't give a way to iterate through a directories children, except with Args.add_all which isn't useful here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to say, I couldn't do it there.

@hrfuller
Copy link
Copy Markdown
Contributor

@thundergolfer would appreciate your eyes before this gets merged.

@hrfuller hrfuller merged commit 0cd570e into bazel-contrib:master Jan 26, 2021
@AutomatedTester AutomatedTester deleted the check_directory branch January 27, 2021 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants