chunked: store cache as binary and use a bloom filter#1870
chunked: store cache as binary and use a bloom filter#1870openshift-merge-bot[bot] merged 8 commits intocontainers:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f4f29dd to
ba6657b
Compare
|
@kolyshkin @mtrmac @rhatdan some more improvements to the cache file |
fb5a6e7 to
de1dff9
Compare
mtrmac
left a comment
There was a problem hiding this comment.
Thanks so much for splitting this into smaller commits. That made the review very enjoyable.
| } | ||
|
|
||
| func getBinaryDigest(stringDigest string) ([]byte, error) { | ||
| d, err := digest.Parse(stringDigest) |
There was a problem hiding this comment.
Consider having this function accept a digest.Digest instead, and pushing the digest.Parse to callers; on various code paths that compute the digest, that can avoid a digest.Validate (regex evaluation).
OTOH then the other code paths must call Parse or Validate if dealing with untrusted data.
pkg/chunked/cache_linux.go
Outdated
| if err != nil { | ||
| return nil, err | ||
| } | ||
| digest := append([]byte(d.Algorithm()+":"), digestBytes...) |
There was a problem hiding this comment.
(More space could be saved by using a single byte to index into a table of algorithm names, or something like that, at the cost of even more complexity. Possibly not worth it, and definitely not blocking this PR.)
| bloomFilter: bloomFilter, | ||
| digestLen: int(digestLen), | ||
| fnames: fnames, | ||
| fnamesLen: int(fnamesLen), |
There was a problem hiding this comment.
(Nit: I’d aesthetically prefer if the type declaration, and the two constructors, were all listing the fields in the same order — and if that order were somehow consistent. Here we have (data, len) for file names, but (len, data) for tags, for example.)
pkg/chunked/cache_linux_test.go
Outdated
| t.Error("wrong digest found") | ||
| } | ||
| expectedLocation := generateFileLocation(r.Name, uint64(r.ChunkOffset), uint64(r.ChunkSize)) | ||
| expectedLocation, err := generateFileLocation(0, uint64(r.ChunkOffset), uint64(r.ChunkSize)) |
There was a problem hiding this comment.
Non-blocking: Testing this with at least two distinct paths, to truly exercise the indexing logic, would be nice.
de1dff9 to
9f2b6a3
Compare
|
the PR is ready for review |
|
@mtrmac needs another review. |
mtrmac
left a comment
There was a problem hiding this comment.
I’m sorry about the delayed response.
9f2b6a3 to
d986219
Compare
|
I've fixed your comments, except #1870 (comment). What would you like me to do here? |
| // are stored. $DIGEST has length digestLen stored in the cache file file header. | ||
| func generateTag(digest string, offset, len uint64) string { | ||
| return fmt.Sprintf("%s%.20d@%.20d", digest, offset, len) | ||
| func appendTag(digest []byte, offset, len uint64) ([]byte, error) { |
There was a problem hiding this comment.
(Absolutely non-blocking: the error value is always nil now.)
https://github.com/containers/storage/pull/1870/files#r1557856169 , or perhaps I’m missing something. |
|
@giuseppe This is waiting on you now? |
kolyshkin
left a comment
There was a problem hiding this comment.
Left some very minor comments. Overall it looks like a good case for protobuf, or is that an overkill?
It looks like appendBinaryDigest's first argument is always []byte{}. First, it could be nil, second, if it's not ever used maybe drop it (and rename the function to getBinaryDigest).
pkg/chunked/cache_linux.go
Outdated
| var vdata bytes.Buffer | ||
| var tagsBuffer bytes.Buffer |
There was a problem hiding this comment.
nit:
var vdata, tagsBuffer bytes.Buffer| nElements := len(cacheFile.tags) / cacheFile.tagLen | ||
|
|
||
| i := sort.Search(nElements, func(i int) bool { | ||
| d := byteSliceAsString(cacheFile.tags[i*cacheFile.tagLen : i*cacheFile.tagLen+cacheFile.digestLen]) |
There was a problem hiding this comment.
nit: with this change, func byteSliceAsString (defined above) can also be removed.
pkg/chunked/cache_linux.go
Outdated
| func generateTag(digest string, offset, len uint64) string { | ||
| return fmt.Sprintf("%s%.20d@%.20d", digest, offset, len) | ||
| func generateTag(digest []byte, offset, len uint64) []byte { | ||
| tag := append(digest[:], []byte(fmt.Sprintf("%.20d@%.20d", offset, len))...) |
There was a problem hiding this comment.
Can you please explain the need for [:] after digest here? I am staring at it and can't see why it's needed.
There was a problem hiding this comment.
Also, []byte() conversion is redundant here, in Go you can append a string... to a []byte slice.
There was a problem hiding this comment.
this code is replaced by a next commit. I'll squash the two patches so it will just disappear
| sort.Slice(tags, func(i, j int) bool { | ||
| return bytes.Compare(tags[i], tags[j]) == -1 | ||
| }) |
There was a problem hiding this comment.
Easier to use slices.SortFunc(tags, bytes.Compare) here
There was a problem hiding this comment.
had to revert this change as we are using go 1.20
There was a problem hiding this comment.
(BTW we are now agreed on updating to Go 1.21 (around containers/skopeo#2297 ). That would probably be a separate PR.)
pkg/chunked/cache_linux.go
Outdated
| var tags [][]byte | ||
| for _, k := range toc { | ||
| if k.Digest != "" { | ||
| digest, err := appendBinaryDigest([]byte{}, k.Digest) |
There was a problem hiding this comment.
nit: you can use nil instead of []byte{} here -- appending to nil slice is fine.
pkg/chunked/bloom_filter_test.go
Outdated
| var tagsBuffer bytes.Buffer | ||
| var vdata bytes.Buffer | ||
| var fnames bytes.Buffer |
There was a problem hiding this comment.
nit:
var tagsBuffer, vdata, fnames bytes.Buffer(unless you don't like that style; to me it's less repetition)
pkg/chunked/cache_linux.go
Outdated
| nameBytes := []byte(name) | ||
|
|
||
| if err := binary.Write(&fnames, binary.LittleEndian, uint32(len(nameBytes))); err != nil { | ||
| return 0, err | ||
| } | ||
| if _, err := fnames.Write(nameBytes); err != nil { | ||
| return 0, err | ||
| } |
There was a problem hiding this comment.
You can avoid []byte conversion here, using name directly (as in len(name) and fnames.WriteString(name))
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
d986219 to
0820a2a
Compare
|
thanks @mtrmac and @kolyshkin. I've addressed your last comments and pushed a new version |
0820a2a to
6d8fc8e
Compare
use the binary representation for a given digest, it helps reducing the file size by ~25%. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
it helps reducing the cache file size by ~25%. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
use a bloom filter to speed up lookup of digests in a cache file. The biggest advantage is that it reduces page faults with the mmap'ed cache file. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
so that the same file path is stored only once in the cache file.
After this change, the cache file measured on the fedora:{38,39,40}
images is in average ~6% smaller.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
it reduces the cache file size by ~3%. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
6d8fc8e to
065a2f3
Compare
|
/lgtm Thanks! |
The bloom filter itself is useful to reduce page faults with the mmap'ed cache files, as it reduces lookups.
Storing the file as a binary instead reduces the file size considerably, with the
quay.io/giuseppe/zstd-chunked:fedora-{38,39,40}{,-updated}images I see:before:
after:
so it is ~50% size reduction