Skip to content

chore: centralize temp files and prefer streaming IO#4668

Merged
willmurphyscode merged 5 commits intomainfrom
chore-no-io-readall
Mar 18, 2026
Merged

chore: centralize temp files and prefer streaming IO#4668
willmurphyscode merged 5 commits intomainfrom
chore-no-io-readall

Conversation

@willmurphyscode
Copy link
Contributor

@willmurphyscode willmurphyscode commented Mar 12, 2026

Description

Catalogers that create temp files ad-hoc can easily forget cleanup, leaking files on disk. Similarly, io.ReadAll is convenient but risks OOM on large or malicious inputs.

Introduce internal/tmpdir to manage all cataloger temp storage under a single root directory with automatic cleanup. Prefer streaming parsers (bufio.Scanner, json/yaml.NewDecoder, io.LimitReader) over buffering entire inputs into memory. Add ruleguard rules to enforce both practices going forward.

The goal here is to make it less likely that Syft suffer OOMs or leak temp files without making callers / clients do anything differently.

There are a couple of places where a streaming parser has been added to replace a previous regex over a whole file, but the intended behavior remains the same.

There are also a couple of limit readers where (500 MB reading zip from the go proxy cache and 50MB reading from GGUF headers).

Type of change

  • Chore (improve the developer experience, fix a test flake, etc, without changing the visible behavior of Syft)
  • Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

Issue references

@willmurphyscode willmurphyscode marked this pull request as ready for review March 17, 2026 12:36
Catalogers that create temp files ad-hoc can easily forget cleanup,
leaking files on disk. Similarly, io.ReadAll is convenient but risks
OOM on large or malicious inputs.

Introduce internal/tmpdir to manage all cataloger temp storage under
a single root directory with automatic cleanup. Prefer streaming
parsers (bufio.Scanner, json/yaml.NewDecoder, io.LimitReader) over
buffering entire inputs into memory. Add ruleguard rules to enforce
both practices going forward.

Signed-off-by: Will Murphy <willmurphyscode@users.noreply.github.com>
Signed-off-by: Will Murphy <willmurphyscode@users.noreply.github.com>
Signed-off-by: Will Murphy <willmurphyscode@users.noreply.github.com>
Signed-off-by: Will Murphy <willmurphyscode@users.noreply.github.com>
Signed-off-by: Will Murphy <willmurphyscode@users.noreply.github.com>
@willmurphyscode willmurphyscode merged commit e388511 into main Mar 18, 2026
10 checks passed
@willmurphyscode willmurphyscode deleted the chore-no-io-readall branch March 18, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants