Skip to content

Logparser/tail input on Windows stops sendings logs due to file being locked #6539

@roelandvh

Description

@roelandvh

Relevant telegraf.conf:

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["http://192.168.50.10:8086"]
  database = "telegraf"

[[inputs.logparser]]
  files = ["C:/Program Files/Apache/logs/All-AccessLog.*"]
  from_beginning = false

  [inputs.logparser.grok]
    patterns = ["%{COMBINED_LOG_FORMAT} %{RESPONSE_TIME}"]
    measurement = "apache_access_log"

System info:

  • OS: Windows 10 (version 1903)
  • Telegraf: telegraf-1.12.3_windows_amd64

Steps to reproduce:

  1. Start a process to repeatedly request pages from the Apache server to ensure log lines keep being written to the access logs (I used watch -n 10 wget -q -r -P wget-out https://<url>)
  2. Start telegraf:
telegraf.exe --console --config <path to above config>
  1. In influx CLI, select the telegraf database and keep executing and watching the results of this query:
select * from apache_access_log order by time desc limit 10;

Expected behavior:

New data points should keep appearing in apache_access_log, shortly after each logline appears in the Apache log.

Actual behavior:

Observe that immediately after starting, fresh data will appear in apache_access_log, but that will stop after a shorter or longer period of time (but usually within a few minutes max), even though new log lines are appearing in the Apache log files.

Additional info:

I think I've reproduced the cause of this issue using the following simple Go program that uses the same code for tailing the log files as telegraf does:

package main

import (
	"log"
	"os"
	"github.com/influxdata/tail"
)

func main() {
	t, err := tail.TailFile("c:\\Program Files\\Apache\\logs\\All-AccessLog.2019-10-17", tail.Config{
		Follow:   true,
		ReOpen:   true,
		Location: &tail.SeekInfo{Offset: -200, Whence: os.SEEK_END}, // start 200 bytes from the end
		Logger:   log.New(os.Stdout, "[TailLog] ", log.LstdFlags),
		Poll:     true,
	})
	if err != nil {
		log.Fatal(err)
	}

	for line := range t.Lines {
		log.Println(line.Text)
	}
}

After some experimenting (printf-debugging, really) in the influxdata/tail code, I noticed that the tail would sooner or later die, as a result of a "non-EOF error". The output of the above program was:

...
[lines from the file that is being tailed]
...
[TailLog] 2019/10/17 16:01:26 non-EOF error: %!(EXTRA *os.PathError=read c:\Program Files\Apache\logs\All-AccessLog.2019-10-17: The process cannot access the file because another process has locked a portion of the file.)

This would happen at this place in the tail.go file in the tailFileSync func:

...
		line, err := tail.readLine()

		// Process `line` even if err is EOF.
		if err == nil {
...
		} else if err == io.EOF {
...
		} else {
			tail.Logger.Printf("non-EOF error: ", err) // Line added by me
			// non-EOF error
			tail.Killf("Error reading %s: %s", tail.Filename, err)
			return
		}

So it appears that the fact that Apache is writing additional lines to the file at the same moment telegraf logparser wants to read new lines is causing this issue.

(Initially I wanted to report this at https://github.com/influxdata/tail but that repository doesn't accept issues.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/tailbugunexpected problem or unintended behaviorhelp wantedRequest for community participation, code, contributionplatform/windowssize/l1 week or more effortupstreambug or issues that rely on dependency fixes

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions