Skip to content

Fix hashes for empty files#4620

Merged
kzantow merged 2 commits intoanchore:mainfrom
ppalucha:main
Feb 24, 2026
Merged

Fix hashes for empty files#4620
kzantow merged 2 commits intoanchore:mainfrom
ppalucha:main

Conversation

@ppalucha
Copy link
Contributor

Description

Calculate digest hashes also for empty files. Even empty files should have proper checksums.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

Issue references

Fixes #2307

@ppalucha
Copy link
Contributor Author

Testing on empty file as in #2307 (comment):

{
      "fileName": "boto3/resources/__init__.py",
      "SPDXID": "SPDXRef-File-boto3-resources---init--.py-a9eaa292d3d663e8",
      "fileTypes": [
        "OTHER"
      ],
      "checksums": [
        {
          "algorithm": "SHA1",
          "checksumValue": "da39a3ee5e6b4b0d3255bfef95601890afd80709"
        },

Signed-off-by: Paweł Pałucha <pawel.palucha@chainguard.dev>
@kzantow
Copy link
Contributor

kzantow commented Feb 13, 2026

It might be good to do some benchmark on hashing vs returning constants. If memory serves, we actually end up hashing quite a few empty files in some scenarios; it might be best to just have some constants

@ppalucha ppalucha changed the title Main Fix hashed for empty files Feb 16, 2026
@ppalucha ppalucha changed the title Fix hashed for empty files Fix hashes for empty files Feb 16, 2026
@ppalucha
Copy link
Contributor Author

It might be good to do some benchmark on hashing vs returning constants. If memory serves, we actually end up hashing quite a few empty files in some scenarios; it might be best to just have some constants

These are the results of benchmarking as prepared by Claude:

### Overall Performance (All Hashes Together)
  10 - **Speed**: Constants are **~5.8x faster** (313 ns vs 1,819 ns)
  11 - **Memory**: Constants use **~8.8x less memory** (280 B vs 2,474 B)
  12 - **Allocations**: Constants make **~3.8x fewer allocations** (12 vs 46)
  13
  14 ### Individual Hash Algorithm Results
  15
  16 | Algorithm | Constant (ns/op) | Calculate (ns/op) | Speedup | Constant Memory | Calculate Memory | Memory Reduction |
  17 |-----------|------------------|-------------------|---------|-----------------|------------------|------------------|
  18 | MD5       | 35.29           | 347.7             | 9.9x    | 40 B            | 408 B            | 10.2x            |
  19 | SHA1      | 56.25           | 336.1             | 6.0x    | 48 B            | 456 B            | 9.5x             |
  20 | SHA224    | 57.74           | 339.1             | 5.9x    | 48 B            | 496 B            | 10.3x            |
  21 | SHA256    | 57.88           | 348.4             | 6.0x    | 48 B            | 496 B            | 10.3x            |
  22 | SHA384    | 59.87           | 488.3             | 8.2x    | 48 B            | 640 B            | 13.3x            |
  23 | SHA512    | 61.01           | 477.2             | 7.8x    | 48 B            | 688 B            | 14.3x            |

So I guess I will propose a version with constants for the empty hashes.

Signed-off-by: Paweł Pałucha <pawel.palucha@chainguard.dev>
@ppalucha
Copy link
Contributor Author

Updated with using constants for hashes for empty files.

@rezmoss
Copy link
Contributor

rezmoss commented Feb 16, 2026

whats t point of hashing an empty file,does it serve a purpose?

@ppalucha
Copy link
Contributor Author

whats t point of hashing an empty file,does it serve a purpose?

The same purpose as hashing any other file - verifying if the content matches.

@ppalucha
Copy link
Contributor Author

whats t point of hashing an empty file,does it serve a purpose?

The same purpose as hashing any other file - verifying if the content matches.

Currently Syft is producing incorrect checksums for empty files.

@ppalucha
Copy link
Contributor Author

What is the way forward for this?
This is an actual bug, that is affecting our users. I can do search/replace on Syft output as a workaround, but I think it makes more sense to fix it upstream.

Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- I was hoping to find a simple way to detect zero-sized input earlier, but I had a look and there are some *os.File inputs and other inputs that don't have length functions, so there wasn't an obvious way to optimize this further. This is still an improvement over hashing the zero-size files to prevents a bunch of duplicate strings being created, etc.. Thanks for the contribution @ppalucha!

@kzantow kzantow merged commit db76d85 into anchore:main Feb 24, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Checksum is 0 for spdx files

3 participants