Conversation
|
Pinging @elastic/secops |
There was a problem hiding this comment.
Mentioning it mostly for awareness: if I'm not wrong, the Auditbeat report.Event()doesn't guarantee sending at least once. If Elasticsearch is down, for example, it will retry 3 times, and then drop the event. This means that we can potentially lose login events, despite saving the state. I think we could solve this after the Filebeat refactoring, but I suggest documenting this as a limitation for now.
There was a problem hiding this comment.
Hm, that's a shame. It guess this affects all metricsets, and we might lose process starts, socket connections, password changes, audit events, FIM events, etc.
The internal queue does not keep events until they can be sent (or it overflows)? That would seem desirable.
There was a problem hiding this comment.
It does, and one can also use spooling to disk to make the queue larger. But Filebeat offers an extra guarantee by waiting for the ACK and only afterwards update the state on disk. This is not currently possible in Auditbeat. That’s the difference I wanted to highlight.
There was a problem hiding this comment.
Right, maybe if at some point report.Event() would return false if it couldn't write to the queue or otherwise cannot guarantee sending the event eventually - so we can retry later.
|
We should think through what the experience is on non-Linux systems. I'm thinking two things:
Is the first point already happening? I couldn't test because of the |
That should be what's happening (in login_other.go). I've added some extra
It shouldn't - there's an if/else around it in the
I opened #9368 to address that - it compiles for me after applying that. I've changed to using |
tsg
left a comment
There was a problem hiding this comment.
LGTM. Thanks for addressing the comments.
andrewkroh
left a comment
There was a problem hiding this comment.
This will be a useful metricset. I left a few comments that need to be addressed now, and some others that are just suggestions. If you have any questions let me know.
There was a problem hiding this comment.
What does this mean for big-endian architectures? Will this code work?
There was a problem hiding this comment.
Hm, good point and I don't know. Do we support any big-endian systems?
There was a problem hiding this comment.
I'm surprised Go doesn't have a binary.MachineEndian.
There was a problem hiding this comment.
package main
import (
"encoding/binary"
"fmt"
"unsafe"
)
var MachineEndian = getByteOrder()
func getByteOrder() binary.ByteOrder {
var b [2]byte
*((*uint16)(unsafe.Pointer(&b[0]))) = 1
if b[0] == 1 {
return binary.LittleEndian
}
return binary.BigEndian
}
func main() {
var b [4]byte
MachineEndian.PutUint32(b[:], 0x12345678)
fmt.Printf("%x %x %x %x\n", b[0], b[1], b[2], b[3])
}
There was a problem hiding this comment.
Just revisiting this. I'm thinking that it might not be likely that this would be used on a big-endian system that uses UTMP login records. In the interest of keeping the code simple I'd leave it as is. If there's a need to change things we always can.
@adriansr getByteOrder() looks like a useful function that Go itself should have, indeed. Or maybe we should have it in a shared location within Beats so code like this here can use it easily.
There was a problem hiding this comment.
I think it's likely that someone will use it on big-endian. We've had several contributions to go-libaudit to make auditbeat work on big-endian.
There was a problem hiding this comment.
Ok, I've added support for it. Thanks @adriansr for the code snippet!
There was a problem hiding this comment.
This is probably fine for low volumes. But we have seen from the auditd module that this call is expensive. Some caching could be a useful future enhancement.
There was a problem hiding this comment.
Same, long name and a many exported symbols that we might not need somewhere else.
There was a problem hiding this comment.
LoginRecord is encoded and decoded by gob so it needs to be exported, unfortunately.
There was a problem hiding this comment.
I think for now it's unlikely that UTMP files would be stored on separate file systems. I.e. it would be mean different parts of /var/log are on different devices. I'm not even sure that's possible.
There was a problem hiding this comment.
How big do we expect these files to be? Looks like this function will load all available logs into memory on first run.
There was a problem hiding this comment.
So far I've assumed they're usually quite small, but that might not be true in all cases. The /etc/logrotate.conf on my Ubuntu VM rotates it monthly. On that VM that's getting a number of daily logins/logouts, they're <100K. On a real-world machine with a lot of logins, I assume it could be more, I wonder how many records it would take to be significant.
Nevertheless, I guess we could use a channel between Fetch and ReadNew and output records as events as they come.
There was a problem hiding this comment.
This is more or less what filebeat does as well. Have one go-routine per file and forward event by event. That is, the module/input is more or less active all the time. Not sure how well this fits the 'Fetch' semantics in the metricbeat framework. But a bounded channel can indeed help keeping memory usage in place. Given the complexity this introduces I wonder if that's really a problem.
There was a problem hiding this comment.
I've changed to using channels now, and storing offsets of the files (instead of re-reading until the last read record). So hopefully that will make it scalable even if the files are very large.
There was a problem hiding this comment.
Looks like we are ignoring all messages read so far if we end up here (e.g. we read while copytruncate like logrotation happens). We still want to publish known records? Alternatively we can just error and return here, but rely on fetch to reset the file offset and read new contents the next time Fetch is executed.
There was a problem hiding this comment.
Good point, maybe I was a bit too overzealous to make sure we never miss an event here and just returning is better.
However, I'm curious about copytruncate: As I understand it, the logfile is copied (e.g. cp /var/log/wtmp /var/log/wtmp.1) and then truncated. If that happens, I suspect the metricset as it is at the moment (with the file pattern being /var/log/wtmp* by default) would read all of /var/log/wtmp.1 (because it'll have a different inode) - even though it has read all of those records already. Does Filebeat handle this in some way?
There was a problem hiding this comment.
But can wtmp be copy-truncated? I thought it's guaranteed to be renamed with moves. I wouldn't want to bring all the complexity of Filebeat's corner cases in here :)
There was a problem hiding this comment.
Rename with move should be ok. But them we don't get into this code path.
Instead of having a tail recursive call we could use a for loop and reset some state and use continue in order to jump to beginning of the reader and not drop events. But then it's an edge case that normally should not occur.
There was a problem hiding this comment.
The code is now using offsets and if there is some problem in jumping to the stored offset it will jump back to the beginning and re-read the file. This is maybe a bit overzealous, but I think at least for now I'd prefer it if events were sent twice rather than not at all. If there are any problems we can always change.
a165894 to
fe229c7
Compare
adriansr
left a comment
There was a problem hiding this comment.
LGTM, suggesting minor refactor, feel free to do it on a separate PR.
|
|
||
| var byteOrder = getByteOrder() | ||
|
|
||
| func getByteOrder() binary.ByteOrder { |
There was a problem hiding this comment.
Turns out gosigar has a GetEndian() method for this.
Adds the login metricset to the Auditbeat system module as the last of the six initial metricsets. It only works on Linux, and detects not just user logins and logouts, but also system boots and shutdowns. It works by reading the /var/log/wtmp and /var/log/btmp file (and rotated files) present on Linux systems. In reading a file, it is similar to Filebeat, except that UTMP is a binary format, so reading happens using a binary Go reader. (cherry picked from commit 1566e66)
Adds the login metricset to the Auditbeat system module as the last of the six initial metricsets. It only works on Linux, and detects not just user logins and logouts, but also system boots and shutdowns. It works by reading the /var/log/wtmp and /var/log/btmp file (and rotated files) present on Linux systems. In reading a file, it is similar to Filebeat, except that UTMP is a binary format, so reading happens using a binary Go reader. (cherry picked from commit 1566e66)
This adds the
loginmetricset to the Auditbeat system module. It's the last of the six initial metricsets. It only works on Linux, and detects not just user logins and logouts, but also system boots and shutdowns.It works by reading the
/var/log/wtmpand/var/log/btmpfile (and rotated files) present on Linux systems. In reading a file, it is similar to Filebeat, except that UTMP is a binary format, so reading happens using a binary Go reader. See utmp(5) for the format of that file.The logic is roughly as follows:
login.utmp_file_patternandlogin.btmp_file_patternwill contain the pattern matching the wtmp (good logins, as well as system shutdowns and boots) and btmp (bad/failed logins) files and rotated files (if desired). The defaults are/var/log/wtmp*and/var/log/btmp*. These are expanded usingfilepath.Globand the files are sorted lexicographically in reverse order (i.e./var/log/wtmp.1will come before/var/log/wtmp) so that we read older login records first - reading in order is required for matching login and logout records, see next steps.Fetchit checks for new entries: Any new files are read from the beginning, while known files are read from a saved offset. To that purpose, the last offset per file is saved and persisted to disk inbeat.db. A new file is one that has an unknown inode, but files are also read completely if theirnewSize < oldSizefor some reason (that should make it work with any potential inode reuse - very unlikely since this will never read a lot of files but still possible).LoginRecordin the code). Boot and shutdown events are fairly straightforward, user login and logout events have to be matched using theirtty- so there is aloginSessionsmap that stores logins to enrich the logouts, and is also persisted to disk.Note: This dataset also introduces
event.origincontaining the file the event came from, e.g./var/log/wtmp.1. In other cases, it would be something likeprocfsornetlink. It's useful to know where information comes from, e.g. to know how reliable it is.