Skip to content

auditbeat: Add a cached file hasher for auditbeat#41952

Merged
haesbaert merged 7 commits intomainfrom
cached-hasher
Dec 11, 2024
Merged

auditbeat: Add a cached file hasher for auditbeat#41952
haesbaert merged 7 commits intomainfrom
cached-hasher

Conversation

@haesbaert
Copy link
Copy Markdown
Contributor

@haesbaert haesbaert commented Dec 9, 2024

Proposed commit message

This implements a LRU cache on top of the FileHasher from hasher.go, it will be used in the new backend for the system process module on linux.

The cache is indexed by file path and stores the metadata (what we get from stat(2)/statx(2)) along with the hashes of each file.

When we want to hash a file: we stat() the file, then do cache lookup and compare against the stored metadata, if it differs, we rehash, if not we use the cached values.

The cache ignores access time (atime), it's only interested in write modifications, if the machine doesn't support statx(2) it falls back to stat(2) but uses the same Unix.Statx_t.

With this we end up with a stat() + lookup on the hotpath, and a stat() + stat() + insert on the cold path.

The motivation for this is that the new backend ends up fetching "all processes", which in turn causes it to try to hash at every event, the current/old hasher just can't cope with it:

  1. Hashing for each event is simply too expensive, in the 100us-50ms range on the default configuration, which puts us below 1000/s.
  2. It has a scan rate throttling that on the default configuration ends easily at 40ms per event (25/s).

With the cache things improve considerably, we stay below 5us (200k/s) in all cases:

MISSES
"miss (/usr/sbin/sshd) took 2.571359ms"
"miss (/usr/bin/containerd) took 52.099386ms"
"miss (/usr/sbin/gssproxy) took 160us"
"miss (/usr/sbin/atd) took 50.032us"
HITS
"hit (/usr/sbin/sshd) took 2.163us"
"hit (/usr/lib/systemd/systemd) took 3.024us"
"hit (/usr/lib/systemd/systemd) took 859ns"
"hit (/usr/sbin/sshd) took 805ns"

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

This implements a LRU cache on top of the FileHasher from hasher.go, it will be
used in the new backend for the system process module on linux.

The cache is indexed by file path and stores the metadata (what we get from
stat(2)/statx(2)) along with the hashes of each file.

When we want to hash a file: we stat() the file, then do cache lookup and
compare against the stored metadata, if it differs, we rehash, if not we use the
cached values.

The cache ignores access time (atime), it's only interested in write
modifications, if the machine doesn't support statx(2) it falls back to stat(2)
but uses the same Unix.Statx_t.

With this we end up with a stat() + lookup on the hotpath, and a stat() + stat()
+ insert on the cold path.

The motivation for this is that the new backend ends up fetching "all
processes", which in turn causes it to try to hash at every event, the
current/old hasher just can't cope with it:
 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the
    default configuration, which puts us below 1000/s.
 2. It has a scan rate throttling that on the default configuration ends easily
    at 40ms per event (25/s).

With the cache things improve considerably, we stay below 5us (200k/s) in all
cases:

```
MISSES
"miss (/usr/sbin/sshd) took 2.571359ms"
"miss (/usr/bin/containerd) took 52.099386ms"
"miss (/usr/sbin/gssproxy) took 160us"
"miss (/usr/sbin/atd) took 50.032us"
HITS
"hit (/usr/sbin/sshd) took 2.163us"
"hit (/usr/lib/systemd/systemd) took 3.024us"
"hit (/usr/lib/systemd/systemd) took 859ns"
"hit (/usr/sbin/sshd) took 805ns"
```
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2024
@haesbaert haesbaert added the Team:Security-Linux Platform Linux Platform Team in Security Solution label Dec 9, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 9, 2024
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 9, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @haesbaert? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 9, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 9, 2024
@haesbaert haesbaert marked this pull request as ready for review December 9, 2024 12:02
@haesbaert haesbaert requested a review from a team as a code owner December 9, 2024 12:02
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

Copy link
Copy Markdown
Contributor

@nicholasberlin nicholasberlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@haesbaert haesbaert merged commit 8ec2e31 into main Dec 11, 2024
@haesbaert haesbaert deleted the cached-hasher branch December 11, 2024 16:04
mergify bot pushed a commit that referenced this pull request Dec 11, 2024
This implements a LRU cache on top of the FileHasher from hasher.go, it will be
used in the new backend for the system process module on linux.

The cache is indexed by file path and stores the metadata (what we get from
stat(2)/statx(2)) along with the hashes of each file.

When we want to hash a file: we stat() the file, then do cache lookup and
compare against the stored metadata, if it differs, we rehash, if not we use the
cached values.

The cache ignores access time (atime), it's only interested in write
modifications, if the machine doesn't support statx(2) it falls back to stat(2)
but uses the same Unix.Statx_t.

With this we end up with a stat() + lookup on the hotpath, and a stat() + stat()
+ insert on the cold path.

The motivation for this is that the new backend ends up fetching "all
processes", which in turn causes it to try to hash at every event, the
current/old hasher just can't cope with it:
 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the
    default configuration, which puts us below 1000/s.
 2. It has a scan rate throttling that on the default configuration ends easily
    at 40ms per event (25/s).

With the cache things improve considerably, we stay below 5us (200k/s) in all
cases:

```
MISSES
"miss (/usr/sbin/sshd) took 2.571359ms"
"miss (/usr/bin/containerd) took 52.099386ms"
"miss (/usr/sbin/gssproxy) took 160us"
"miss (/usr/sbin/atd) took 50.032us"
HITS
"hit (/usr/sbin/sshd) took 2.163us"
"hit (/usr/lib/systemd/systemd) took 3.024us"
"hit (/usr/lib/systemd/systemd) took 859ns"
"hit (/usr/sbin/sshd) took 805ns"
```

(cherry picked from commit 8ec2e31)
haesbaert added a commit that referenced this pull request Dec 11, 2024
This implements a LRU cache on top of the FileHasher from hasher.go, it will be
used in the new backend for the system process module on linux.

The cache is indexed by file path and stores the metadata (what we get from
stat(2)/statx(2)) along with the hashes of each file.

When we want to hash a file: we stat() the file, then do cache lookup and
compare against the stored metadata, if it differs, we rehash, if not we use the
cached values.

The cache ignores access time (atime), it's only interested in write
modifications, if the machine doesn't support statx(2) it falls back to stat(2)
but uses the same Unix.Statx_t.

With this we end up with a stat() + lookup on the hotpath, and a stat() + stat()
+ insert on the cold path.

The motivation for this is that the new backend ends up fetching "all
processes", which in turn causes it to try to hash at every event, the
current/old hasher just can't cope with it:
 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the
    default configuration, which puts us below 1000/s.
 2. It has a scan rate throttling that on the default configuration ends easily
    at 40ms per event (25/s).

With the cache things improve considerably, we stay below 5us (200k/s) in all
cases:

```
MISSES
"miss (/usr/sbin/sshd) took 2.571359ms"
"miss (/usr/bin/containerd) took 52.099386ms"
"miss (/usr/sbin/gssproxy) took 160us"
"miss (/usr/sbin/atd) took 50.032us"
HITS
"hit (/usr/sbin/sshd) took 2.163us"
"hit (/usr/lib/systemd/systemd) took 3.024us"
"hit (/usr/lib/systemd/systemd) took 859ns"
"hit (/usr/sbin/sshd) took 805ns"
```

(cherry picked from commit 8ec2e31)

Co-authored-by: Christiano Haesbaert <haesbaert@elastic.co>
michalpristas pushed a commit to michalpristas/beats that referenced this pull request Dec 13, 2024
This implements a LRU cache on top of the FileHasher from hasher.go, it will be
used in the new backend for the system process module on linux.

The cache is indexed by file path and stores the metadata (what we get from
stat(2)/statx(2)) along with the hashes of each file.

When we want to hash a file: we stat() the file, then do cache lookup and
compare against the stored metadata, if it differs, we rehash, if not we use the
cached values.

The cache ignores access time (atime), it's only interested in write
modifications, if the machine doesn't support statx(2) it falls back to stat(2)
but uses the same Unix.Statx_t.

With this we end up with a stat() + lookup on the hotpath, and a stat() + stat()
+ insert on the cold path.

The motivation for this is that the new backend ends up fetching "all
processes", which in turn causes it to try to hash at every event, the
current/old hasher just can't cope with it:
 1. Hashing for each event is simply to expensive, in the 100us-50ms range on the
    default configuration, which puts us below 1000/s.
 2. It has a scan rate throttling that on the default configuration ends easily
    at 40ms per event (25/s).

With the cache things improve considerably, we stay below 5us (200k/s) in all
cases:

```
MISSES
"miss (/usr/sbin/sshd) took 2.571359ms"
"miss (/usr/bin/containerd) took 52.099386ms"
"miss (/usr/sbin/gssproxy) took 160us"
"miss (/usr/sbin/atd) took 50.032us"
HITS
"hit (/usr/sbin/sshd) took 2.163us"
"hit (/usr/lib/systemd/systemd) took 3.024us"
"hit (/usr/lib/systemd/systemd) took 859ns"
"hit (/usr/sbin/sshd) took 805ns"
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:Security-Linux Platform Linux Platform Team in Security Solution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants