[Auditbeat] System module: Add entity_id fields#10500
Conversation
|
Pinging @elastic/secops |
auditbeat/docs/fields.asciidoc
Outdated
There was a problem hiding this comment.
I'd say it would be worth providing more details for theentity_id (not only this one, but for all). Either say how it is computed or what property it guarantees (i.e. unique value per package installation on one host). The former will be more clear, but the latter would be more future proof (we might change how we compute it).
There was a problem hiding this comment.
How about: This ID uniquely identifies a package on a host. It is computed as a hash of the host ID, package name, and package version.?
And similar wording for the others.
There was a problem hiding this comment.
Agreed that a bit more info should be provided. I think it's important to mention that these IDs are stable across multiple collections of information, and should change only if there's a significant change. Package version above is a good example, PID change if same process dies and restarts, & so on.
But I don't think we need to mention more implementation details than "it's a hash of ..."
There was a problem hiding this comment.
That definition should be fine for users. But if we want other developers to be able to send data with compatible entity IDs we'll need to be much more specific about data encoding.
There was a problem hiding this comment.
I think your proposed wording is a step forward and is enough for now. Too many details would suggest that changing it is a breaking change.
There was a problem hiding this comment.
I'm adding a sentence to each of them saying It is computed as a SHA-256 hash of the host ID, {something else}, and {something else}..
I believe the description only shows up in the docs, so we're more flexible in changing it later if we want.
There was a problem hiding this comment.
Note to self: Should not include InstallTime for the same reasons as outlined in #10508 (comment).
Also, should different versions of a package be counted as two entities?
There was a problem hiding this comment.
Removed InstallTime. Leaving version as part of the entity_id, so two versions of a package will count as two. I think this makes sense because there are cases where two versions are actually installed - e.g. Python 2.6 and 3.x - and they should both be counted.
There was a problem hiding this comment.
Yeah, that seems reasonable. A re-installed package on the same version is really the same thing, IMO.
There was a problem hiding this comment.
Strings are not represented in an endian specific order so you should be able to write h.Write([]byte(hostID)) which will save a few reflection calls.
This happens a few places throughout the code, but I'm just leaving one comment.
There was a problem hiding this comment.
Thanks, I've hopefully caught all of them.
auditbeat/docs/fields.asciidoc
Outdated
There was a problem hiding this comment.
That definition should be fine for users. But if we want other developers to be able to send data with compatible entity IDs we'll need to be much more specific about data encoding.
webmat
left a comment
There was a problem hiding this comment.
Left a few comments, some of which are borderline blockers. A few of my suggestions can be addressed later, however.
auditbeat/docs/fields.asciidoc
Outdated
There was a problem hiding this comment.
Agreed that a bit more info should be provided. I think it's important to mention that these IDs are stable across multiple collections of information, and should change only if there's a significant change. Package version above is a good example, PID change if same process dies and restarts, & so on.
But I don't think we need to mention more implementation details than "it's a hash of ..."
There was a problem hiding this comment.
Would have loved to see an explicit test of a few cases of calculating the entity ID, and the expected result. For example:
- same PID, different collection timestamps => same entity ID (that's the whole point)
- different PID, exact same process start timestamp (unlikely but possible) => different entity ID
- same PID, different collection timestamps, different process start timestamps (stretching it, but possible) => different entity ID
I'm ok if you create an issue to add this as a separate step, given the time pressure to get this in, though.
There was a problem hiding this comment.
Same suggestion goes for each flavor of entity_id, too
There was a problem hiding this comment.
Agreed this needs some unit testing, but I'm fine with it coming in a follow-up PR.
There was a problem hiding this comment.
Well, the code is fairly straightforward (in each case one method with <10 lines), so we'd mostly be checking if Golang's hashing works correctly.
But I agree a test would still be nice, and I'll take it down as a follow-up.
auditbeat/docs/fields.asciidoc
Outdated
There was a problem hiding this comment.
Perhaps a warning that this isn't related to centralized user management. But would rather detect a user being deleted, then recreated with the same name/home/etc.
There was a problem hiding this comment.
Ok, let me change to ID uniquely identifying the user on a host.. That should make it clear this is not across hosts.
There was a problem hiding this comment.
Should we consider the same version being uninstalled and reinstalled as a different entity (installtime)?
If we want to be really strict I think we should. But I'm not sure how helpful that would be. May actually help detect package repo poisoning.
We can always improve that later, however. Perhaps a follow-up improvement after FF?
There was a problem hiding this comment.
Unfortunately, we only have a reliable InstallTime for RPM packages. dpkg does not seem to store it (though at some point we might parse it out of /var/log/dpkg.log), and it is unreliable for Homebrew (directory mod time which can easily change). So that's why I'm not using it to form entity IDs.
There was a problem hiding this comment.
Ok I'm good with this.
I think it's actually valid to have a more solid entity_id that considers the installtime for rpms than for dpkg & brew. The hashes of different families are never meant to be compared to one another.
But I agree this is for later, in any case.
There was a problem hiding this comment.
Have you compared with what goes into the community_id hash, for this one?
I don't think we should use community_id. I'm just suggesting as a sanity check on what gets considered to calculate our entity_id.
I think this is fine, though. I like the inclusion of the inode.
There was a problem hiding this comment.
From what I can see the community ID is using source and dest IP/port, and protocol. For us, we would want to be host specific though, since even if we have both sockets of a connection in the data, we would still want them to have different entity IDs (since they are different entities, owned by different processes and users). That said, we should probably add community ID to the socket fields, to allow correlation between flows and sockets. I think @andrewkroh has suggested it in the past.
There was a problem hiding this comment.
Having the community ID populated wherever possible will make correlation so much easier. So +1 to adding it for the socket dataset.
There was a problem hiding this comment.
This is extra goodness, not a breaking change, though. So up to @cwurm to include now vs wait for 7.1? I think both are fine, it's just a question of time management. I think it's pretty late to add community_id, unless code reuse is more straightforward than I assume :-)
There was a problem hiding this comment.
Yeah, I'll put it in the backlog.
There was a problem hiding this comment.
I would absolutely include user creation date here.
UID can be specified when creating a user. So delete user, re-create with same name/uid needs to be detected. I think user creation date should be enough. Actually do you have access to that?
There was a problem hiding this comment.
We don't have the user creation date, unfortunately. Same as for dpkg and its /var/lib/dpkg/status, Linux does not store the creation time in /etc/passwd. The only way I can see of getting it is through parsing /var/log/auth.log. Something for the future though.
There was a problem hiding this comment.
You calculate it for Windows too? 👍
There was a problem hiding this comment.
Sure, we have the data :)
|
Thanks @andrewkroh and @webmat. I've made a code change for hashing strings, and answered hopefully all comments. If there are no more code changes, I would rather merge this now, and address any changes to the docs in a follow-up PR. What do you think? |
|
I’m good with plan. ^ |
c486cd5 to
d6b2a49
Compare
d6b2a49 to
4ac4a66
Compare
Implements `{entity}.entity_id` as a SHA-256 hash as proposed in elastic#10463.
Closes elastic#10463.
(cherry picked from commit c047ef7)
Implements
{entity}.entity_idas a SHA-256 hash as proposed in #10463.The new fields and what goes in the hash:
system.audit.package.entity_idhost.id + name + versionprocess.entity_idhost.id + PID + StartTimesocket.entity_idhost.id + inode + LocalIP + RemoteIP + LocalPort + RemotePortuser.entity_idhost.id + UID + usernameNote:
socketis a net new top-level object, I just didn't see where else to put it. Open to suggestions.host.idis retrieved when the system module is created and stored so the individual datasets don't have to re-fetch it. It's exposed to all through a newSystemMetricSet.Closes #10463.