Skip to content

fix(py_wheel): produce deterministic wheel files#1453

Merged
aignas merged 3 commits intobazel-contrib:mainfrom
dizzy57:py_wheel_deterministic
Oct 17, 2023
Merged

fix(py_wheel): produce deterministic wheel files#1453
aignas merged 3 commits intobazel-contrib:mainfrom
dizzy57:py_wheel_deterministic

Conversation

@dizzy57
Copy link
Copy Markdown
Contributor

@dizzy57 dizzy57 commented Oct 3, 2023

Current implementation does not produce deterministic output because:

  • ZipFile.writestr() leaks current date and time
  • ZipFile.write() leaks the source file's mtime and mode bits (permissions) into the resulting zip archive.

By manually creating our own ZipInfo objects we can explicitly set date and time fields to Jan 1, 1980, 00:00 (minimum value allowed by the zip file standard), and ensure that other file attributes are uniform across all entries in a zip file.

@dizzy57 dizzy57 requested a review from rickeylev as a code owner October 3, 2023 14:25
@google-cla
Copy link
Copy Markdown

google-cla bot commented Oct 3, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@dizzy57 dizzy57 force-pushed the py_wheel_deterministic branch 4 times, most recently from fcea928 to 7524345 Compare October 5, 2023 08:31
@rickeylev
Copy link
Copy Markdown
Collaborator

I'm not a wheel expert, but dropping the permissions might be problematic? My basic thinking is if a +x bit gets dropped, that might be a problem. Can't wheel contains arbitrary data, and someone might intend for a data file to have the executable bit?

Maybe mask such that we keep the executable bit if set, but normalize to read only?

@jvolkman
Copy link
Copy Markdown
Contributor

jvolkman commented Oct 6, 2023

If it helps, I've got code that canonicalizes wheels here.

Another thing to watch out for is ZipInfo.create_system which differs based on whether you're running on windows or linux/unix.

@rickeylev
Copy link
Copy Markdown
Collaborator

I talked to a Bazel dev about how permissions are handled

  1. Bazel doesn't make any guarantees about file permissions
  2. It ignores permissions when considering if the file changed
  3. Action outputs are set to rxrxrx
  4. Source files don't have permission changed
  5. They pointed out symlinks might be weird (no one was sure off-hand how symlinks + permissions would behave)

Pretty much any of those are a good reason for us to not preserve permissions. I find the second and third one particularly compelling.

So I think its really just a question of: do we set rx or rwx?

My vote is for rwx; it seems what is commonly done, and it "preserves" the writability of any files.

@rickeylev
Copy link
Copy Markdown
Collaborator

(Ignas posted this, but it only shows up in my email not on the pr for some reason)

Whilst playing with the code, I have realized that the proposed solution does not fully solve determinism within wheels. Depending on the order that the files get passed to the wheelmaker we may end up with a different zip from time to time.

nit: I think we should follow the convention of sorting the RECORD file entries in a particular way - first all of the files, then all of the dist-info files and then the RECORD file last in order to ensure that it is somewhat consistent with what repairwheel is doing.

Is the order coming from depsets? Or is it coming from e.g. a Python dict/set re-ordering things?

If it's coming from a depset, then, insofar as Bazel is concerned, this is fine. Depsets (and input lists) have ordering and it's reflected in how bazel computes what is affected/changed (e.g. the argv of the action's command line). I don't think that can be avoided. The zip itself is still deterministic, just not "predictable".

If it's coming from Python doing a set/dict call somewhere, then we should fix that -- that's a source of non-determinism Bazel isn't aware of. Having that in a separate PR is fine; this one has been out for awhile.

But yes, I agree that sorting the zip entries is a good idea regardless.

@dizzy57
Copy link
Copy Markdown
Contributor Author

dizzy57 commented Oct 11, 2023

Is the order coming from depsets? Or is it coming from e.g. a Python dict/set re-ordering things?

We're sorting file lists in main() here and here.
The files end up getting added to the zip archive in deterministic order:

  • sorted files from all_files
  • WHEEL
  • METADATA
  • optional entry_points.txt
  • sorted files from extra_distinfo_file
  • RECORD

@aignas
Copy link
Copy Markdown
Collaborator

aignas commented Oct 11, 2023

Yeah, my previous comment (that I have deleted) was all wrong, I think all is good with the sorting of the contents within the wheelmaker, so the only things that I would love to see in this PR would be:

  • A test asserting that the sha256 of the wheel remains the same.
  • The rwx permission setting as @philsc has suggested.
  • (And as @jvolkman suggested below) .create_system = 3

@aignas
Copy link
Copy Markdown
Collaborator

aignas commented Oct 12, 2023

I have now realized what had confused me in the beginning - when writing the RECORD file we change the order of the entries in this line and given that the order of files going into the zip file is deterministic and the order of the slice is also deterministic, I would vote for removing the entries.sort() on that line because it may be unnecessary. I am not sure if it is harmless. Maybe someone else with more knowledge into how things should be can comment.

Copy link
Copy Markdown
Collaborator

@groodt groodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Current implementation does not produce deterministic output because:
- `ZipFile.writestr()` leaks current date and time
- `ZipFile.write()` leaks the source file's mtime and mode bits (permissions) into the resulting zip archive.

By manually creating our own `ZipInfo` objects we can explicitly set date and time fields to `Jan 1, 1980, 00:00` (minimum value allowed by the zip file standard), and ensure that other file attributes are uniform across all entries in a zip file.
@dizzy57 dizzy57 force-pushed the py_wheel_deterministic branch from cbf36ff to 66be710 Compare October 16, 2023 23:03
@dizzy57
Copy link
Copy Markdown
Contributor Author

dizzy57 commented Oct 16, 2023

Rebased on top of main.

Copy link
Copy Markdown
Collaborator

@aignas aignas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@aignas aignas enabled auto-merge October 17, 2023 04:55
@aignas aignas added this pull request to the merge queue Oct 17, 2023
Merged via the queue into bazel-contrib:main with commit 87a9cf1 Oct 17, 2023
@dizzy57 dizzy57 deleted the py_wheel_deterministic branch October 17, 2023 09:15
github-merge-queue bot pushed a commit that referenced this pull request Jan 31, 2024
#1453 added logic to manually set zipinfo for each file in the wheel
archive. When wheel archives generated by `_Whlfile` are installed by
the `installer` module (or `pip`), the file mode is inspected as such:

* `installer`:
https://github.com/pypa/installer/blob/fcc0d6f14f99974316c2e490cead07a9a0b7a6ac/src/installer/sources.py#L318-L321
* `pip`:
https://github.com/pypa/pip/blob/0f21fb92/src/pip/_internal/utils/unpacking.py#L96-L100

as you can tell, if the regular file bit is not set, the installer
thinks that the file being installed is not an executable and therefore
the executable bit will not be preserved when the file from wheel is
extracted onto the host filesystem.

Since all files being archived into Whlfile are regular files anyway, we
set `S_IFREG` on the file mode for all files in the zip archive.

Fixes #1711

---------

Signed-off-by: Thomas Lam <thomaslam@canva.com>
Co-authored-by: Ignas Anikevicius <240938+aignas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants