Skip to content

filter-process: Ignore file size in working copy when cleaning#6011

Open
cynix wants to merge 5 commits intogit-lfs:mainfrom
cynix:fix_clean
Open

filter-process: Ignore file size in working copy when cleaning#6011
cynix wants to merge 5 commits intogit-lfs:mainfrom
cynix:fix_clean

Conversation

@cynix
Copy link

@cynix cynix commented Mar 12, 2025

Sometimes when Git sends us the clean command, the target file in the
working copy does not have the correct size (e.g. it's empty). Since
Git will feed us the blob to clean via stdin, we have to read it all
anyway, especially if we're invoked via filter-process where leaving
any data unread will cause our Pktline's reader to not be positioned
correctly for parsing the next command. Don't trust the file size on
disk when determining how much to copy.

Resolves #4974 (specifically, the secondary problem reported in that issue).

@cynix cynix requested a review from a team as a code owner March 12, 2025 13:21
@cynix cynix changed the title filter-process: Ignore file size in working copy when cleaning. filter-process: Ignore file size in working copy when cleaning. Fixes #4974 Mar 12, 2025
cynix added 2 commits March 13, 2025 00:58
Sometimes when Git sends us the clean command, the target file in the
working copy does not have the correct size (e.g. it's empty). Since
Git will feed us the blob to clean via stdin, we have to read it all
anyway, especially if we're invoked via filter-process where leaving
any data unread will cause our Pktline's reader to not be positioned
correctly for parsing the next command. Don't trust the file size on
disk when determining how much to copy.
//
// If we simply truncate the file to 0 bytes in situ, Git
// may not be able to finish reading its full contents, and
// might send us fewer packets than actually needed.
Copy link
Author

@cynix cynix Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to actually reproduce what Git is doing when it invokes our filter-process and send us clean commands with incorrect files in the working copy, so I'm using this util to simulate it. The GIT_TRACE_PACKET output looks sensible and matches what I see when the issue actually occurs. Let me know if you figure out a way to reproduce this natively.

@chrisd8088
Copy link
Member

Hey, thanks very much for this PR and for the investigation into the problems reported #4974, which have proven challenging to replicate! We really appreciate this kind of contribution from our community of users.

I think this PR deserves a thoughtful review and so I hope to devote a solid chunk of time to it as soon as I can. My apologies in advance if that takes a week or two; I can assure you it's not because we're not interested in this potential fix for a longstanding issue!

@chrisd8088 chrisd8088 changed the title filter-process: Ignore file size in working copy when cleaning. Fixes #4974 filter-process: Ignore file size in working copy when cleaning Mar 22, 2025
@chrisd8088 chrisd8088 added the bug label Mar 22, 2025
@chrisd8088
Copy link
Member

Thank you again for this PR and for your patience, @cynix!

I've started working on a review, and will get that done as soon as possible so you can see my suggested revisions, which are for the most part quite minor, at least as regards the work in this PR at the moment.

However, there are some ancillary changes I think we'll need to make as a consequence of this PR, and so I may try to implement those in another PR which we could merge ahead of this one, or ask if I can contribute some commits to this PR (for which I'd need an invitation to collaborate on your repository).

My apologies for the delay in replying, and thanks again very much for this bug fix!

@cynix
Copy link
Author

cynix commented Mar 28, 2025

@chrisd8088 Thanks for the update. I've invited you as a collaborator, feel free to add commits as you see fit. I'm guessing you're thinking of finding a way to still report progress to the user even though we can't trust the file size? (e.g. use the file size on disk as a hint for progress display only, not for determining how much to copy, or something like that) :D

@chrisd8088
Copy link
Member

I'm guessing you're thinking of finding a way to still report progress to the user even though we can't trust the file size?

Yes, that's one of the issues, indeed. At the moment, the file size in the progress log (when enabled with GIT_LFS_PROGRESS) will have -1 as the denominator in the "number of bytes written so far" field, so I worked up some changes to provide an alternative (along with a bunch of tests), but then noticed a more subtle issue, which is that when total is -1, we inadvertently defeat the 200ms throttling of writes to the progress log, so we write to the log on every callback.

Since noticing that I've been pulled off in another direction, but will return to this topic shortly, I very much hope!

@jabidahscreationssystems

This comment was marked as spam.

@chrisd8088
Copy link
Member

Hey, I just wanted to note that although I haven't had time to circle back to this PR and the outstanding progress log issues, I will do so in time because I think this is a valuable change and I'd like to get it merged.

Thanks again for your contribution and your patience!

@chrisd8088
Copy link
Member

It's taken me much longer than I expected to circle back to this PR, but I have not forgotten about it!

For my own reference, I wanted to note that the progress log (when enabled with GIT_LFS_PROGRESS) will already have a -1 in the denominator of the bytes-received field if a server replies to an object transfer request with a Content-Encoding: gzip header and a gzipped response body.

The Go HTTP client automatically inserts an Accept-Encoding: gzip request header, as described in #6193, so a server is within its rights to reply with a gzipped response. When the net/http package from the Go standard library has inserted the Accept-Encoding: gzip request header on its own, and the server returns a gzipped response with a Content-Encoding: gzip header, the ContentLength value of the Response structure is set to -1, and we then just pass that on to our progress meter.

Under these conditions, we'll also bypass our throttling of progress log messages, as I noted above in #6011 (comment).

Enabling a progress log with GIT_LFS_PROGRESS="$TRASHDIR/progress.log" in our storage upload with compression test is sufficient to demonstrate this behaviour.

@cynix
Copy link
Author

cynix commented Feb 9, 2026

It's taken me much longer than I expected to circle back to this PR, but I have not forgotten about it!

Thanks for keeping this on your radar. Keen to see it merged eventually :)

chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Mar 2, 2026
In prior commits in this PR we introduced a new configuration option
named "lfs.transfer.httpDownloadEncoding", which may be set to either
of the values "gzip" or "zstd", with a default value of "gzip".

When our new option is not defined or is set to "gzip", the Git LFS
client simply follows its legacy behaviour when making object transfer
download requests over HTTP.  Specifically, the client should allow the
"net/http" package in the Go standard library to insert an
"Accept-Encoding: gzip" header into the requests, and if the server
chooses to respond with gzip-encoded data, the client should expect
the "net/http" package to automatically decode the body of the response.

By contrast, when our new option is set to the value "zstd", the
Git LFS client will explicitly add an "Accept-Encoding: zstd" header
when it makes an object transfer download HTTP request.  If the server
chooses to encode the body of its response, it should include a
"Content-Encoding: zstd" header which the client will recognize and
then attempt to decode the response body using the zstd algorithm.

In either case, the server may also send a Content-Length header, in
which case the value of that header will be the length of the encoded
content and not the length of the original, uncompressed data.

For this reason, when the "net/http" package handles gzip-encoded
response data, it returns a Response structure whose ContentLength
field is set to -1.  This informs any reader of the data stream in
the structure's Body field that the length of the stream is unknown:

  https://github.com/golang/go/blob/go1.25.7/src/net/http/response.go#L90-L92
  https://github.com/golang/go/blob/go1.25.7/src/net/http/transport.go#L2385

The same issue pertains when we handle a zstd-encoded response to an
object download request, except that the "net/http" package is not
responsible for decoding the data in the response body.  Instead, we
decode this data in the download() method of the basicDownloadAdapter
structure in our "tq" package.

Later in its operation, the download() method passes the Response
structure's ContentLength field to the CopyWithCallback() function
from our "tools" package.  Thus when handling a response with a
Content-Length header and a zstd-encoded body, we currently pass an
inaccurate value to the CopyWithCallback() function, one which
represents the length of the encoded object data and not its
uncompressed actual size.

We therefore update the download() to set the ContentLength field
to -1 after decoding a zstd-encoded response body, to match the
behaviour of the "net/http" package when it decodes a gzip-encoded
response body.

One consequence of this change will be visible if a user has set the
GIT_LFS_PROGRESS environment variable to an absolute file path.
In this case, the progress log messages we write to that path will
contain a misleading "-1" value as the denominator of the fraction
which indicates how much object data has been received so far.

However, this is a known issue which already occurs when a server
returns a gzip-encoded response body, and one which we expect to
address comprehensively in a future PR.  See, for reference:

  git-lfs#6011 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Help triaging git-lfs checkout failure

3 participants