Logstash information:
tested with versions:
8.8.1, 7.12.1 and 7.17.10
JVM (e.g. java -version):
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment Temurin-11.0.19+7 (build 11.0.19+7)
OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (build 11.0.19+7, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux qadebuglog 5.10.0-23-amd64 #1 SMP Debian 5.10.179-1 (2023-05-12) x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Pipeline
input {
tcp {
port => 33039
codec => fluent
}
filter {
if [log] =~ "\A\{.+\}" {
json {
source => "log"
}
if "_jsonparsefailure" not in [tags] {
mutate {
remove_field => ["log"]
}
}
}
else {
mutate {
rename => ["log", "message"]
}
}
}
}
output {
file {
path => "/var/log/logstash/data.log"
}
}
If the "log" field contains a invalid UTF-8 sequence, logstash stopps itself (see logfile)
The issue happens on this line
if [log] =~ "\A{.+}" {
yes, I think it would be possible to write the filters more elegant, but I think invalid UTF-8 shouldn't "crash" the logstash itself.
last year someone had posted this issue on https://discuss.elastic.co/t/input-tcp-codec-fluent-invalid-byte-sequence-in-utf-8-in-regex/296290 unfortunately there wasn't any response.
Steps to reproduce:
deliver a json line like this via a fluentd instance
{"log": "�"}
the invalid bytesequence is a 0x3c character
Provide logs (if relevant):
[2023-06-13T15:55:50,315][ERROR][logstash.javapipeline ][main] Pipeline worker error, the pipeline will be stopped {:pipeline_id=>"main", :error=>"(ArgumentError) invalid byte sequence in UTF-8", :exception=>Java::OrgJrubyExceptions::ArgumentError, :backtrace=>["org.jruby.RubyRegexp.match?(org/jruby/RubyRegexp.java:1170)", "RUBY.start_workers(/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:304)"], :thread=>"#<Thread:0x2c8de6c@/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:134 sleep>"}
[2023-06-13T15:55:52,330][WARN ][logstash.javapipeline ][main] Waiting for input plugin to close {:pipeline_id=>"main", :thread=>"#<Thread:0x2c8de6c@/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:134 run>"}
[2023-06-13T15:55:54,658][INFO ][logstash.javapipeline ][main] Pipeline terminated {"pipeline.id"=>"main"}
[2023-06-13T15:55:54,965][INFO ][logstash.pipelinesregistry] Removed pipeline from registry successfully {:pipeline_id=>:main}
[2023-06-13T15:55:54,972][INFO ][logstash.runner ] Logstash shut down.
Logstash information:
tested with versions:
8.8.1, 7.12.1 and 7.17.10
JVM (e.g.
java -version):openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment Temurin-11.0.19+7 (build 11.0.19+7)
OpenJDK 64-Bit Server VM Temurin-11.0.19+7 (build 11.0.19+7, mixed mode)
OS version (
uname -aif on a Unix-like system):Linux qadebuglog 5.10.0-23-amd64 #1 SMP Debian 5.10.179-1 (2023-05-12) x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Pipeline
If the "log" field contains a invalid UTF-8 sequence, logstash stopps itself (see logfile)
The issue happens on this line
if [log] =~ "\A{.+}" {
yes, I think it would be possible to write the filters more elegant, but I think invalid UTF-8 shouldn't "crash" the logstash itself.
last year someone had posted this issue on https://discuss.elastic.co/t/input-tcp-codec-fluent-invalid-byte-sequence-in-utf-8-in-regex/296290 unfortunately there wasn't any response.
Steps to reproduce:
deliver a json line like this via a fluentd instance
{"log": "�"}the invalid bytesequence is a 0x3c character
Provide logs (if relevant):