Syslog inputs parses RFC3164 events via TCP or UDP#6842
Syslog inputs parses RFC3164 events via TCP or UDP#6842kvch merged 1 commit intoelastic:masterfrom ph:feature/syslog-parser-with-udp-tcp
Conversation
filebeat/input/syslog/input.go
Outdated
There was a problem hiding this comment.
This won't work as desired.
This should never be used from any global contexts, otherwise you will receive a no-op Logger. This is because the logp package needs to be initialized first. Instead create new Logger instance that your object reuses.
https://godoc.org/github.com/elastic/beats/libbeat/logp#NewLogger
filebeat/input/syslog/input.go
Outdated
There was a problem hiding this comment.
You can write //go:generate ragel -Z -G2 parser.rl -o parser.go so that go generate will run the commands for you.
There was a problem hiding this comment.
Oh I've commented it for testing something, I will revert it back.
There was a problem hiding this comment.
Yup, not too hard to understand and bit clearer to understand than a big regexp.
High level is each declaration prio = "<" + prioprity ">" will create a state machine, you can compose them together and reuse them, ragel will take care of extrapolating all of them. This file could be used as is in any language supported by ragel: python, ruby and Java. So if this code is OK, it could be shared with the logstash-input-syslog instead of using grok.
There was a problem hiding this comment.
I really like this approach. Do you mind sharing why did you choose ragel? What other lexers did you consider? I am just curious.
There was a problem hiding this comment.
I haven't look too much in other lexer other than https://github.com/timtadh/lexmachine
I've worked with ragel before, I've made a SQL like DSL on top of a lucene document. They really focus on speed and this is a mature project.
Also, I like the fact that the lexer and the target language are not linked, Meaning we can target both the beats syslog input and the logstash input syslog. But ragel isn't the only tool available that can do that.
filebeat/inputsource/tcp/config.go
Outdated
There was a problem hiding this comment.
Can one input have tcp and udp at the same time? If not, we should probably have the syslog input defined twice here.
There was a problem hiding this comment.
it cannot for complexity reason :) I am OK to have it twice.
There was a problem hiding this comment.
I'm glad it can't. Perhaps we should have a check to make sure only one is used at the same time.
There was a problem hiding this comment.
Uniqueness of the protocol is already taken care with the Config.namespace's handling, more than one namespace configured accessing 'filebeat.inputs.1.protocol'
There was a problem hiding this comment.
Will take a note to check if we can change that error message.
There was a problem hiding this comment.
Should we bring message size for the 2 in line? Also the way it's configured, I like the humanize one.
There was a problem hiding this comment.
I will need to add support for humanize in udp, I will do it.
filebeat/input/syslog/parser_test.go
Outdated
There was a problem hiding this comment.
Nice, that are the tests I was looking for.
filebeat/input/syslog/input.go
Outdated
There was a problem hiding this comment.
Could we align the fields here already with ECS? This would be process.pid.
To make sure we don't conflict, all these fields should be prefixed with syslog. if they are not already used fields.
|
@ph Do you mind opening an issue for RFC5424 after this PR is merged? |
|
@kvch 👍 for another issue for rfc5424, some state machine of the parser can be reused as is, the |
filebeat/input/syslog/config.go
Outdated
There was a problem hiding this comment.
Please humanize this option, as it's done in defaultTCP.
There was a problem hiding this comment.
It would be nice to make bytes a first class citizen in ucfg.
There was a problem hiding this comment.
@andrewkroh I prefer to have it in the beats repository, this way we can more control deprecation if needed.
There was a problem hiding this comment.
Please add a space before = :)
filebeat/input/syslog/parser_test.go
Outdated
There was a problem hiding this comment.
When are test cases going to be added to this? I would like to see that we fail properly on unsupported messages to make sure we don't forward unpredictable or garbage events.
There was a problem hiding this comment.
I can delete that, I've added all tests for somewhat invalid syslog in the main test https://github.com/elastic/beats/pull/6842/files#diff-6ae69c7f1d1edd97ba8ebcdc7df2e855R18
On failure the catch all is to only populate the event.Message and for us to be a valid syslog event we must have timestamp and a message defined, everything else is just extra data.
|
Thanks for implementing BSD syslog protocol. It's great that you added TCP to the supported transports, regardless that it's not required by the legacy protocol. I am very happy we went for using lexers instead of unmaintainable monster regexes. Furthermore, with this PR we move forward with extending capabilities and use cases of Filebeat which is the "beatific" way. But it's clear that the current architecture is not flexible enough yet. I am a bit sad it can't receive syslog messages from arbitrary source (e.g file). Unfortunately, right now reading from sources and parsing the incoming bytes is too tightly coupled to achieve that. This coupling is the a major problem with the architecture of your PR. That's where all of my review notes stems from. I haven't shared those problems here, as it is out of scope of this PR. Also, this feature is valuable as is. We really need a big refactor to address this architectural problem. We should decouple transport layer from creating, parsing and transforming events. Otherwise we would never be able to read syslog messages from a local file. (Unless, we send through to a local port over TCP or UDP :D) I am glad to merge this PR as soon as review notes are resolved so we could focus on the refactor. FYI I am preparing a proposal of this. :) |
Yes, this is why I was making baby steps, moving the UDP logic and the TCP out of the inputs so they can become composable. I wanted to do the same with the log inputs, the syslog parsing is currently coupled with theses two protocols, but the syslog parsing is just a function call so it can be another building block.
This is nice I was working on a similar proposal, we should probably sync on that, when I've created that feature, I've seen the same problems but it would require a bit more work to achieve without doing some refactoring. My idea higher level was to be able mostly works with stream and leverage more of the processor interface. To be able to do:
|
|
@kvch If we want to make it work right now with the file input it could be a processor, nothing prevent us to do that without the need to refactor. The syslog input could just initialize the processor or use the library directly. (maybe we have some problems with the system module) |
|
@ph I would keep it as is, so we have a chance to release it in 6.3. |
|
@kvch @ruflin I've updated the PR with the reviews comments, I've followed ECS as much as I can, I think it make sense. I've updated the fields too, added a wip docs for it. The following is an example of the output if all fields are present: {
"@timestamp": "2018-10-12T02:14:15.000Z",
"@metadata": {
"beat": "filebeat",
"type": "doc",
"version": "7.0.0-alpha1",
"truncated": false
},
"message": "'su root' failed for lonvick on \/dev\/pts\/8 0",
"source": "127.0.0.1:56949",
"hostname": "wopr.mymachine.co",
"event": {
"severity": 5
},
"input": {
"type": "syslog"
},
"syslog": {
"priority": 13,
"severity_label": "Notice",
"facility_label": "user-level"
"facility": 1
},
"process": {
"pid": 2000,
"program": "postfix"
},
"prospector": {
"type": "syslog"
},
"beat": {
"name": "sashimi",
"hostname": "sashimi",
"version": "7.0.0-alpha1"
}
} |
There was a problem hiding this comment.
@dedemorton This input share all the options from the TCP/UDP inputs, what would be the best way to keep track of them here? Should we extract the options into external files that can be included in the syslog / tcp and udp.
Something like
tcp-options
udp-options
tcp input include tcp-options
udp input include udp options
syslog input include both tcp-options and udp-options
There was a problem hiding this comment.
Yup, I'd do something similar to what I've done here:
So maybe you would have inputs/input-common-tcp-options.asciidoc and input-common-udp-options.asciidoc. Make sure that you add a comment to the top of the file explaining how the options are shared (see input-common-file-options.asciidoc for an example).
Of course, you'll need to use attributes to resolve the section IDs for each option because the IDs need to be unique across the book. Something like:
[id="{beatname_lc}-input-{type}-option-name"]
|
@ph ECS implementation looks really good 🎉 . The only thing I wonder is the content of |
|
@ruflin So its kind-of-agent |
The Syslog inputs will use the UDP and TCP source lib, allowing the same socket behavior and the
same options as the two existing inputs. The parser is a state machine build with ragel[1] and allow
to parse FC3164[2] events with some less than perfect variants, if the received event is a complete
RFC3164 we will extract all of them, for us the minimum valid message MUST have the `date` and the
`message` defined.
Anything else we will log and drop them.
Fields:
* priority
* timestamp
* program
* pid
* message
* facility: extracted from the priority
* severity: extracted from the priority
* severity_label: mapped from the official list.
* facility_label: mapped from the official list[2]
Sample Configuration:
```yaml
#enabled: false
#protocol.tcp:
# The host and port to receive the new event
#host: "localhost:9000"
# Character used to split new message
#line_delimiter: "\n"
# Maximum size in bytes of the message received over TCP
#max_message_size: 20MiB
# The number of seconds of inactivity before a remote connection is closed.
#timeout: 300s
#protocol.udp:
# The host and port to receive the new event
#host: "localhost:9000"
# Maximum size of the message received over UDP
#max_message_size: 10240
```
Limitations:
* Doesn't support multiline events like darwin can do, we need to extract the multiline logic from the log input.
* Only support RFC3164, RFC5424 will require more work on the parser.
close #5862
[1]: http://www.colm.net/open-source/ragel/
[2]: https://tools.ietf.org/html/rfc3164
|
I believe there is a bug in syslog_rfc3164.rl . See referenced issue: |
The Syslog inputs will use the UDP and TCP source lib, allowing the same socket behavior and the
same options as the two existing inputs. The parser is a state machine build with ragel and allow to parse FC3164 events with some less than perfect variants, if the received event is a complete
FC3164 we will extract all of them, for us the minimum valid message MUST have the
dateand themessagedefined.Anything else we will log and drop them.
Fields:
Sample Configuration:
Limitations:
close #5862
Missing in this PR
Note to reviewers:
is done at the state machine level.
parser.go, this is a generated file fromparser.rlandsyslog_rfc3164.rl, it's a messy file using all the possibletricks in the book to make it really fast.
Speed
Performance looks OK to me, I don't have the number from my regular expression poc, but it's much better.