filter-process: Ignore file size in working copy when cleaning#6011
filter-process: Ignore file size in working copy when cleaning#6011cynix wants to merge 5 commits intogit-lfs:mainfrom
Conversation
Sometimes when Git sends us the clean command, the target file in the working copy does not have the correct size (e.g. it's empty). Since Git will feed us the blob to clean via stdin, we have to read it all anyway, especially if we're invoked via filter-process where leaving any data unread will cause our Pktline's reader to not be positioned correctly for parsing the next command. Don't trust the file size on disk when determining how much to copy.
| // | ||
| // If we simply truncate the file to 0 bytes in situ, Git | ||
| // may not be able to finish reading its full contents, and | ||
| // might send us fewer packets than actually needed. |
There was a problem hiding this comment.
I don't know how to actually reproduce what Git is doing when it invokes our filter-process and send us clean commands with incorrect files in the working copy, so I'm using this util to simulate it. The GIT_TRACE_PACKET output looks sensible and matches what I see when the issue actually occurs. Let me know if you figure out a way to reproduce this natively.
|
Hey, thanks very much for this PR and for the investigation into the problems reported #4974, which have proven challenging to replicate! We really appreciate this kind of contribution from our community of users. I think this PR deserves a thoughtful review and so I hope to devote a solid chunk of time to it as soon as I can. My apologies in advance if that takes a week or two; I can assure you it's not because we're not interested in this potential fix for a longstanding issue! |
|
Thank you again for this PR and for your patience, @cynix! I've started working on a review, and will get that done as soon as possible so you can see my suggested revisions, which are for the most part quite minor, at least as regards the work in this PR at the moment. However, there are some ancillary changes I think we'll need to make as a consequence of this PR, and so I may try to implement those in another PR which we could merge ahead of this one, or ask if I can contribute some commits to this PR (for which I'd need an invitation to collaborate on your repository). My apologies for the delay in replying, and thanks again very much for this bug fix! |
|
@chrisd8088 Thanks for the update. I've invited you as a collaborator, feel free to add commits as you see fit. I'm guessing you're thinking of finding a way to still report progress to the user even though we can't trust the file size? (e.g. use the file size on disk as a hint for progress display only, not for determining how much to copy, or something like that) :D |
Yes, that's one of the issues, indeed. At the moment, the file size in the progress log (when enabled with Since noticing that I've been pulled off in another direction, but will return to this topic shortly, I very much hope! |
This comment was marked as spam.
This comment was marked as spam.
|
Hey, I just wanted to note that although I haven't had time to circle back to this PR and the outstanding progress log issues, I will do so in time because I think this is a valuable change and I'd like to get it merged. Thanks again for your contribution and your patience! |
|
It's taken me much longer than I expected to circle back to this PR, but I have not forgotten about it! For my own reference, I wanted to note that the progress log (when enabled with The Go HTTP client automatically inserts an Under these conditions, we'll also bypass our throttling of progress log messages, as I noted above in #6011 (comment). Enabling a progress log with |
Thanks for keeping this on your radar. Keen to see it merged eventually :) |
In prior commits in this PR we introduced a new configuration option named "lfs.transfer.httpDownloadEncoding", which may be set to either of the values "gzip" or "zstd", with a default value of "gzip". When our new option is not defined or is set to "gzip", the Git LFS client simply follows its legacy behaviour when making object transfer download requests over HTTP. Specifically, the client should allow the "net/http" package in the Go standard library to insert an "Accept-Encoding: gzip" header into the requests, and if the server chooses to respond with gzip-encoded data, the client should expect the "net/http" package to automatically decode the body of the response. By contrast, when our new option is set to the value "zstd", the Git LFS client will explicitly add an "Accept-Encoding: zstd" header when it makes an object transfer download HTTP request. If the server chooses to encode the body of its response, it should include a "Content-Encoding: zstd" header which the client will recognize and then attempt to decode the response body using the zstd algorithm. In either case, the server may also send a Content-Length header, in which case the value of that header will be the length of the encoded content and not the length of the original, uncompressed data. For this reason, when the "net/http" package handles gzip-encoded response data, it returns a Response structure whose ContentLength field is set to -1. This informs any reader of the data stream in the structure's Body field that the length of the stream is unknown: https://github.com/golang/go/blob/go1.25.7/src/net/http/response.go#L90-L92 https://github.com/golang/go/blob/go1.25.7/src/net/http/transport.go#L2385 The same issue pertains when we handle a zstd-encoded response to an object download request, except that the "net/http" package is not responsible for decoding the data in the response body. Instead, we decode this data in the download() method of the basicDownloadAdapter structure in our "tq" package. Later in its operation, the download() method passes the Response structure's ContentLength field to the CopyWithCallback() function from our "tools" package. Thus when handling a response with a Content-Length header and a zstd-encoded body, we currently pass an inaccurate value to the CopyWithCallback() function, one which represents the length of the encoded object data and not its uncompressed actual size. We therefore update the download() to set the ContentLength field to -1 after decoding a zstd-encoded response body, to match the behaviour of the "net/http" package when it decodes a gzip-encoded response body. One consequence of this change will be visible if a user has set the GIT_LFS_PROGRESS environment variable to an absolute file path. In this case, the progress log messages we write to that path will contain a misleading "-1" value as the denominator of the fraction which indicates how much object data has been received so far. However, this is a known issue which already occurs when a server returns a gzip-encoded response body, and one which we expect to address comprehensively in a future PR. See, for reference: git-lfs#6011 (comment)
Sometimes when Git sends us the clean command, the target file in the
working copy does not have the correct size (e.g. it's empty). Since
Git will feed us the blob to clean via stdin, we have to read it all
anyway, especially if we're invoked via filter-process where leaving
any data unread will cause our Pktline's reader to not be positioned
correctly for parsing the next command. Don't trust the file size on
disk when determining how much to copy.
Resolves #4974 (specifically, the secondary problem reported in that issue).