Implement new parser for parser_syslog by repeatedly · Pull Request #2599 · fluent/fluentd

repeatedly · 2019-09-03T05:54:45Z

Signed-off-by: Masahiro Nakagawa repeatedly@gmail.com

Which issue(s) this PR fixes:
Fixes #2585

What this PR does / why we need it:
Changing regexp is difficult because it may break existing user environment, so
add non-regexp based parser to support more syslog message.
This parser is also 2x faster.

require 'benchmark/ips'
require 'fluent/engine'
require_relative 'lib/fluent/plugin/parser_syslog'

parser = Fluent::Plugin::SyslogParser.new
parser.configure(Fluent::Config::Element.new('ROOT', '', {'keep_time_key' => true}, []))
log = 'Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test'
parser2 = Fluent::Plugin::SyslogParser.new
parser2.configure(Fluent::Config::Element.new('ROOT', '', {'with_priority' => true, 'keep_time_key' => true}, []))
log2 = '<1>Feb 28 12:00:00 192.168.0.1 fluentd[11111]: [error] Syslog test'

Benchmark.ips do |x|
  x.report "regexp" do
    parser.parse_plain(log) { |t, r| }
  end
  x.report "string" do
    parser.parse_rfc3164(log) { |t, r| }
  end
  x.report "regexp with pri" do
    parser2.parse_plain(log2) { |t, r| }
  end
  x.report "string with pri" do
    parser2.parse_rfc3164(log2) { |t, r| }
  end
end

Warming up --------------------------------------
              regexp    20.246k i/100ms
              string    42.531k i/100ms
     regexp with pri    18.060k i/100ms
     string with pri    37.281k i/100ms
Calculating -------------------------------------
              regexp    218.482k (± 1.7%) i/s -      1.093M in   5.005516s
              string    488.549k (± 1.5%) i/s -      2.467M in   5.050341s
     regexp with pri    195.467k (± 2.1%) i/s -    993.300k in   5.083917s
     string with pri    408.422k (± 3.7%) i/s -      2.050M in   5.028522s

Docs Changes:
Add new parameters and log example.

Release Note:
Same as title.

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

ganmacs · 2019-09-04T03:38:02Z

+        if @with_priority
+          if text.start_with?(PRI_START_CHAR)
+            i = text.index('>'.freeze, 1)
+            pri = text.slice(1, i - 1).to_i


pri can be 0 if i < 2 or text.slice(1, i - 1) contains not number charactors. Is it acceptable?

https://tools.ietf.org/html/rfc3164#section-4.1.1

The PRI part MUST have three, four, or five characters

Need it check i is less than 5?

I don't think so because the purpose of this parser is for supporting more format, not strict parser.

I don't think so because the purpose of this parser is for supporting more format, not strict parser.

What kind of format do we want to support?
If this parser does not follow the rfc3164, it's probably better to use new message_format name (rfc3164_ext or like that).

If this parser does not follow the rfc3164, it's probably better to use new message_format name (rfc3164_ext or like that).

The problem is existing products uses rfc3164 for it. rfc3164 describes the collection of existing message format and BSD spec but many existing tools doesn't follow rfc3164 strictly. This parser can support strict rfc3164 format, so it doesn't block user's usecase for me.

pri can be 0 if i < 2 or text.slice(1, i - 1) contains not number charactors. Is it acceptable?

Is it okay?

ganmacs · 2019-09-04T03:46:33Z

+        if time_end == SPLIT_CHAR
+          time_str = text.slice(start, diff)
+          start += 16 # time + ' '
+        elsif time_end == '.'.freeze


rfc3164 seems not to support subsecond time https://tools.ietf.org/html/rfc3164#section-4.1.2

Yes but some products send rfc3164 syslog message with subsecond time.
regexp version also supports subsecond time.

then, why don't you change the method name? this is not parse_rfc3164.

Comment it on above.

ganmacs · 2019-09-04T06:06:48Z

+                  end
+                  text.slice(i + 1, text.bytesize)
+                else
+                  text.slice(i - diff, text.bytesize)


In this case, ident(TAG in rfc) does not exist, does message follow the rfc?

I'm not sure but some product sends such message. And this is good for capturing all MSG broken syslog messages.

ganmacs · 2019-09-04T06:12:01Z

+        msg.chomp!
+        record['message'] = msg
+
+        time = @time_parser.parse(time_str.squeeze(SPLIT_CHAR))


time_str was sliced with size(15), right? why do we need to call String#squeeze?

I can't remember the actual product name but users hit the problem with the number of ' '.
regexp version also calls squeeze for avoiding this problem.

value.gsub(/ +/, "") is introduced by #68 ce1c410
But we don't need using String#squeeze to parse time string including redundant spaces.
Now we can remove String#squeeze here.

>> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 05 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> strptime = Strptime.new("%b %d %H:%M:%S") => #<Strptime:0x000055e8adc01110> >> strptime.exec("Feb 5 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 5 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 05 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 05 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 5 11:11:11") => 2019-02-05 11:11:11 +0900

Existing regex does not specify the size of chars. but here specifies the fixed size(15).
So if the character which needs to call squeeze is passed to this filter plugin, the parser will fail to parse it in any case, right?.

ganmacs · 2019-09-04T06:15:59Z

+        host = text.slice(start, diff)
+        start += (diff + 1)
+
+        i = text.index(SPLIT_CHAR, start)


If the text is ${TIMESTAMP} ${HOST} test:val, this parser won't work. and I think test:val is valid syslog message.
ident should be test and msg should be val

This is very difficult problem because rfc3164 doesn't define how ident and message are separated. For the experience, here is an actual examples:

HEADER ident msg # No problem HEADER ident: msg # No problem HEADER ident:msg # Existing products send this as only msg, not ident:msg.

So to support both cases, adding option is better.

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

okkez · 2019-09-05T02:04:02Z

+        end
+
+        # header part
+        time_diff = 15 # skip Mmm dd hh:mm:ss


How about time_width or time_size or time_str_size?

okkez · 2019-09-05T02:08:27Z

+          cursor += 16 # time + ' '
+        elsif time_end == '.'.freeze
+          # support subsecond time
+          i = text.index(SPLIT_CHAR, time_diff)


How about time_end_pos or something instead of i?

okkez · 2019-09-05T02:09:24Z

+          return
+        end
+
+        i = text.index(SPLIT_CHAR, cursor)


How about host_end_pos instead of i?

okkez · 2019-09-05T02:09:50Z

+          yield nil, nil
+          return
+        end
+        host_diff = i - cursor


How about host_width or host_size?

okkez · 2019-09-05T02:10:34Z

+        end
+        host_diff = i - cursor
+        host = text.slice(cursor, host_diff)
+        cursor += (host_diff + 1)


Redundant brackets?

okkez · 2019-09-05T02:13:14Z

+              else
+                if text[i - 1] == ':'.freeze
+                  if text[i - 2] == ']'.freeze
+                    j = text.index('['.freeze, cursor)


How about left_bracket_pos instead of j?

Why .freeze for ":", "[", "]"? Shoud we extract them to constants?

Yeah, we can extract it. This is used at only one point so I use literal instead.

okkez · 2019-09-05T02:16:18Z

+                if text[i - 1] == ':'.freeze
+                  if text[i - 2] == ']'.freeze
+                    j = text.index('['.freeze, cursor)
+                    record['ident'] = text.slice(cursor, j - cursor)


How about adding a local value ident_size = j - cursor?

okkez · 2019-09-05T02:18:11Z

+                  if text[i - 2] == ']'.freeze
+                    j = text.index('['.freeze, cursor)
+                    record['ident'] = text.slice(cursor, j - cursor)
+                    record['pid'] = text.slice(j + 1, i - j - 3) # remove '[' / ']:'


How about adding local values pid_start_pos = j + 1 and pid_size = i - j - 3?
j may be changed to left_bracket_pos.

okkez · 2019-09-05T02:18:39Z

+                    record['ident'] = text.slice(cursor, j - cursor)
+                    record['pid'] = text.slice(j + 1, i - j - 3) # remove '[' / ']:'
+                  else
+                    record['ident'] = text.slice(cursor, i - cursor - 1)


okkez · 2019-09-05T02:23:25Z

+        msg.chomp!
+        record['message'] = msg
+
+        time = @time_parser.parse(time_str.squeeze(SPLIT_CHAR))


value.gsub(/ +/, "") is introduced by #68 ce1c410
But we don't need using String#squeeze to parse time string including redundant spaces.
Now we can remove String#squeeze here.

>> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 5 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> Time.strptime("Feb 05 11:11:11", "%b %d %H:%M:%S") => 2019-02-05 11:11:11 +0900 >> strptime = Strptime.new("%b %d %H:%M:%S") => #<Strptime:0x000055e8adc01110> >> strptime.exec("Feb 5 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 5 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 05 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 05 11:11:11") => 2019-02-05 11:11:11 +0900 >> strptime.exec("Feb 5 11:11:11") => 2019-02-05 11:11:11 +0900

ganmacs · 2019-09-05T03:11:27Z

+      @parser.configure('parser_type' => 'string')
+      @parser.instance.parse("1990 Oct 22 10:52:01 TZ-6 scapegoat.dmz.example.org 10.1.2.32 sched[0]: That's All Folks!") { |time, record|
+        expected = {'host' => 'scapegoat.dmz.example.org', 'ident' => 'sched', 'pid' => '0', 'message' => "That's All Folks!"}
+        assert_not_equal(expected, record)


what the result of parsing it? I think assert_equal should be used.

Actual generated record is not important for this case.
This test shows parser can't parse non-standard HEADER/MSG syslog message correctly, so assert_not_equal seems no problem for me.

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

repeatedly · 2019-09-09T04:37:31Z

If no other concerns, I will merge this patch.

repeatedly self-assigned this Sep 3, 2019

repeatedly mentioned this pull request Sep 3, 2019

Fluentd input plugin syslog is truncating syslog entry #2585

Closed

repeatedly force-pushed the add-new-parser-to-syslog branch from 39ec522 to ede1ff1 Compare September 3, 2019 06:14

parser_syslog: Implement new parser for syslog rfc3164

68f3ec7

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

repeatedly force-pushed the add-new-parser-to-syslog branch from ede1ff1 to 68f3ec7 Compare September 3, 2019 06:21

repeatedly added the enhancement Feature request or improve operations label Sep 3, 2019

parser_json: Add comparison test for regexp and string parser

f8af924

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

repeatedly requested review from ganmacs and okkez September 3, 2019 11:55

repeatedly marked this pull request as ready for review September 3, 2019 11:55

ganmacs reviewed Sep 4, 2019

View reviewed changes

repeatedly added 3 commits September 4, 2019 17:09

parser_syslog: refactor code

302852e

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

parser_syslog: Fix only no whitespace content handling

94887d8

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

parser_syslog: Separate broken syslog tests

4154272

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

okkez requested changes Sep 5, 2019

View reviewed changes

ganmacs reviewed Sep 5, 2019

View reviewed changes

repeatedly added 3 commits September 5, 2019 13:46

parser_syslog: Remove squeeze from new parser

6fc5fb4

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

parser_syslog: Check empty priority case in new parser

2bc4294

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

parser_syslog: Apply review for better variable name

60c838f

Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>

repeatedly merged commit 1e4e0c1 into master Sep 9, 2019

repeatedly deleted the add-new-parser-to-syslog branch September 9, 2019 10:54

repeatedly mentioned this pull request Jun 2, 2020

syslog_parser: Add string parser for rfc5424 #3015

Merged

Conversation

repeatedly commented Sep 3, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

repeatedly Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

repeatedly commented Sep 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

repeatedly Sep 4, 2019 •

edited

Loading