The Wayback Machine - https://web.archive.org/web/20210104141422/https://github.com/bazelbuild/rules_python/pull/336
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if input_file is directory and then add all files within that directory #336

Open
wants to merge 5 commits into
base: master
from

Conversation

@AutomatedTester
Copy link
Contributor

@AutomatedTester AutomatedTester commented Jul 9, 2020

If in a dependency tree a directory is used as out this is passed as
input_file to wheelmaker which then generates an unhandled error as
it tries to write the directory as a file.

@thundergolfer
Copy link
Collaborator

@thundergolfer thundergolfer commented Jul 10, 2020

Adding links as extra context/breadcrumb:

(Unfortunately these links will decay as the Slack workspace is on a free Slack plan with limited message history).

experimental/examples/wheel/BUILD Outdated Show resolved Hide resolved
experimental/rules_python/wheelmaker.py Outdated Show resolved Hide resolved
# it will be passed in as an input file and we need to
# ignore.
if os.path.isdir(real_filename):
return
arcname = arcname_from(package_filename)

self._zipfile.write(real_filename, arcname=arcname)

This comment has been minimized.

@thundergolfer

thundergolfer Jul 10, 2020
Collaborator

Double checking that we actually don't want to write the directory as a file? That seems right.

This comment has been minimized.

@AutomatedTester

AutomatedTester Jul 10, 2020
Author Contributor

I don't think we do as we would likely be creating the directory in another test with files in it. I am not sure people would want an empty directory placed into the wheel but have been suprised before by what people do.

# it will be passed in as an input file and we need to
# ignore.
if os.path.isdir(real_filename):
return

This comment has been minimized.

@alexeagle

alexeagle Jul 10, 2020
Collaborator

shouldn't there be a TODO here to actually copy in the directory contents? this PR is just a temp workaround right?
"we need to ignore" seems surprising from a user perspective

This comment has been minimized.

@AutomatedTester

AutomatedTester Oct 27, 2020
Author Contributor

I can't see a case where people will want an empty directory as part of their build. They may have another step that populates it but that case is already taken care of

@AutomatedTester
Copy link
Contributor Author

@AutomatedTester AutomatedTester commented Jul 10, 2020

@alexeagle
Copy link
Collaborator

@alexeagle alexeagle commented Jul 10, 2020

I don't think there should be a collision because the bazel-out folder doesn't have one

@AutomatedTester
Copy link
Contributor Author

@AutomatedTester AutomatedTester commented Jul 10, 2020

The reason I hit this situation is I have a py_library that has a step that generates code and is copied in it's data attribute.

When we get to the py_wheel stage, according to --sandbox_debug the files in the directory have already been added as --input_file and then near the bottom of the list there is a --input_file=OUTS/dir/from/another/target. Since this is after the files, in my case, it feels like I would overwrite them. If you remove this patch from the WORKSPACE and then run https://github.com/SeleniumHQ/selenium/blob/trunk/py/BUILD.bazel#L69-L87 you will be able to recreate my issue.

@pstradomski
Copy link
Collaborator

@pstradomski pstradomski commented Jul 11, 2020

@AutomatedTester AutomatedTester force-pushed the AutomatedTester:check_directory branch from acb10ad to aee2615 Oct 27, 2020
@AutomatedTester
Copy link
Contributor Author

@AutomatedTester AutomatedTester commented Oct 28, 2020

@alexeagle or @thundergolfer could you have a look and see what is left for me to do here?

@aaliddell
Copy link

@aaliddell aaliddell commented Oct 28, 2020

From a user's perspective, silently dropping directories seems like a pretty big footgun; either it should add support for directories or the action should fail completely with a descriptive error.

Also, a directory is a perfectly valid output from a rule using declare_directory. The proper way of handling a directory as a rule input is to use an Args object, which can optionally expand a directory to it's constituent parts using add_all:

Each directory File item is replaced by all Files recursively contained in that directory.

Alternatively, you can pass the directory to the action script and unpack the directory yourself, as was previously suggested. Technically a path collision shouldn't be possible, but perhaps with the addition of prefix stripping/prepending it could happen: perhaps that case is rare enough that we just bail out with an error at that point.

@AutomatedTester
Copy link
Contributor Author

@AutomatedTester AutomatedTester commented Oct 29, 2020

@aaliddell Could you give me an example rule for this?

As mentioned before I am not sure why someone would want an empty directory in a wheel. It's always going to have some file in the directory likely so it will be created anyway.

If we are wanting to accept empty directories, should we be adding a new argument to the wheelmaker because this issue was because a directory was being passed to input_file. We could add a new one called input and input_list

@pstradomski
Copy link
Collaborator

@pstradomski pstradomski commented Oct 29, 2020

@aaliddell
Copy link

@aaliddell aaliddell commented Oct 29, 2020

I agree that a completely empty directory may be valid to skip within a wheel; it's the dropping all directories regardless of contents that's my concern.

For an example of where this sort of thing comes up, I often run into this when dealing with gRPC protoc plugins, where the output of a plugin is non-deterministic from file names, which requires capturing the output as a directory rather than a set of files. As an example, we may capture a folder full of generated .py files that should all be included in the py_library, wheel etc (the python plugin is actually well behaved, but this is just an example).

From Bazel's perspective, it treats the directory as a single File unit, meaning you shouldn't simultaneously have overlapping references to the directory and a file contained within it whilst in the Bazel starlark world. Therefore, if wheelmaker is being passed both a file and a directory that contains that file, then it appears there is a bug in the way the command line args are being constructed.

Using Args.add_all may again be the suitable route, since if a directory is empty it will unpack to no args, so in that case wheelmaker shouldn't receive a directory at all. That way wheelmaker needs no updates, since it is then guarenteed to only receive valid files. Although we'd have to check how a genrule generated folder behaves though, since it may be different from one created by declare_directory.

@aaliddell
Copy link

@aaliddell aaliddell commented Oct 29, 2020

It appears a genrule that creates a folder is actually invalid, since it should enumerate every file it creates: https://docs.bazel.build/versions/master/be/general.html#genrule.outs

Avoid creating symlinks and directories. Bazel doesn't copy over the directory/symlink structure created by genrules and its dependency checking of directories is unsound.

So Args.add_all could be used to handle the declare_directory directories by unpacking them to files and if wheelmaker still receives a non-file then something quite broken has happened (e.g. invalid genrule); in which case bailing out with an error seems sensible.

@AutomatedTester
Copy link
Contributor Author

@AutomatedTester AutomatedTester commented Oct 29, 2020

I agree that a completely empty directory may be valid to skip within a wheel; it's the dropping all directories regardless of contents that's my concern.

This patch only drops if there is --input-file=some/directory but if you then pass in --input-file=some/directory/some_file.py that file, and the directory that was dropped originally, will be added.

For an example of where this sort of thing comes up, I often run into this when dealing with gRPC protoc plugins, where the output of a plugin is non-deterministic from file names, which requires capturing the output as a directory rather than a set of files. As an example, we may capture a folder full of generated .py files that should all be included in the py_library, wheel etc (the python plugin is actually well behaved, but this is just an example).

This patch handles this case. If you see this genrule creates a directory and populates it with .py files and is added to a py_package via a py_library.

Without this patch, I can't generate a wheel because somewhere it passes --input-file=directory/created/by/genrule and then add_file in wheelmaker.py tries to write the directory as a file.

We can (and this is only for wheels)

  • ignore the directory creation but they will be created if there are files within that directory (this PR)

OR

  • Add the potentially empty directory to the wheel and don't care that there may be a random empty directory
@pstradomski
Copy link
Collaborator

@pstradomski pstradomski commented Oct 29, 2020

@aaliddell
Copy link

@aaliddell aaliddell commented Oct 29, 2020

Here's a demo of what I'm getting at, based on this PR's latest commit: aaliddell@7b1f2cd

It declares a directory that contains a single python file called dir_generated/demo.py and also declares a single file file_generated.py; these are added as deps to the wheel. Here's the input file list content:

experimental/examples/wheel/lib/module_with_data.py;experimental/examples/wheel/lib/module_with_data.py
experimental/examples/wheel/lib/simple_module.py;experimental/examples/wheel/lib/simple_module.py
experimental/examples/wheel/dir_generated;bazel-out/k8-fastbuild/bin/experimental/examples/wheel/dir_generated
experimental/examples/wheel/file_generated.py;bazel-out/k8-fastbuild/bin/experimental/examples/wheel/file_generated.py

As you can see, the file list contains the generated directory and the generated file, but demo.py itself within the dir is not there. So 'ignore the directory creation but they will be created if there are files within that directory' does not hold under the case of a declared directory containing files. As a result of the dirs being dropped, the file demo.py is silently not in the generated wheel file, but the file file_generated.py is. This is what I meant by the footgun.

Prior to the commits in the PR, running the same patch causes the error you mentioned in the initial PR comment, which is at least a more visible failure than silently missing files in the wheel.

Without this patch, I can't generate a wheel because somewhere it passes --input-file=directory/created/by/genrule and then add_file in wheelmaker.py tries to write the directory as a file.

I think it would be useful to establish why both the file and the directory are being passed to wheelmaker. That seems like the source of this issue for your use case. If the files are being passed, the directory containing them shouldn't be in the input file list in the first place.

If you see this genrule creates a directory

As mentioned above, a genrule that creates a dir without enumerating the contained files is not recommended behaviour, which might be the source of you seeing the overlapping file and dir in the input file list.

After re-reading PEP 427, I think we should be safe to just ignore all directories when build a wheel.

The dirs themselves shouldn't appear in the RECORD file, but the contents of these dirs should be enumerated there and copied to the wheel.

If in a dependency tree a directory is used as out this is passed as
input_file to wheelmaker which then generates an unhandled error as
it tries to write the directory as a file.
@AutomatedTester AutomatedTester force-pushed the AutomatedTester:check_directory branch from 7d696c8 to 1bb9080 Dec 16, 2020
@AutomatedTester
Copy link
Contributor Author

@AutomatedTester AutomatedTester commented Dec 16, 2020

I have changed this patch to now walk the directory it is given and add those files. It should do this recursively. I have tested this out on the Selenium Project and it works

@AutomatedTester AutomatedTester changed the title Check if input_file is directory and ignore Check if input_file is directory and then add all files within that directory Dec 16, 2020
@AutomatedTester AutomatedTester requested a review from alexeagle Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.