Skip to content

Spooling to disk GA #6859

@urso

Description

@urso

Add spooling to disk to beats. Spooling all events to disk is useful for beats if the output is blocked or not fast enough to deal with bursts of events. With spooling to disk available, metricbeat modules will not be blocked and filebeat has a way of copying events from very fast rotating log files.

Requirements:

  • Consistency and ability to recover on failures/crash
  • Limit/Configurable queue size in disk space usage. Block if queue is full.
  • Async Producer ACK signal
    • Beats pipeline requires async ACK of last N events on flush. This gives filebeat the chance of updating the registry file, the time events have been flushed to the spool file.
  • Async Conusmer ACKing
    • Events must only be removed from queue, after async ACK signal from output has been received. Allow for resends between restarts if dequeues events have not been ACKed yet.

Tasks:

  • Add spooling to disk feature Introduce spooling to disk #6581
  • Add documentation Add file spool to queue docs #6902
  • Fix exported queue monitoring metrics
  • Telemetry on configured queue type
  • Add new IO metrics
  • (optional) Support for Write-Ahead-Log file to reduce number of costly fsync operations: External Write Ahead Log with relaxed guarantees go-txfile#25
  • testing:
    • libbeat end-to-end test for growing/shrinking existing spool files (see TestResizeFile in go-txfile)
    • Separate regular stress testing (related: MacOS X Panic when running test for github.com/elastic/beats/libbeat/publisher/queue/spool #8490). See: libbeat/scripts/cmd/stress_pipeline and libbeat/publisher/pipeline/stress.
    • Check spool file does not break if disk has not enough space to finish a write transaction:
      • Windows NTFS
      • MacOS
      • Ext3
      • Ext4
      • XFS
      • btrfs
    • Improve unit test coverage:
      • Failing IO operations
      • Full queue blocked -> unblock if events are ACKed
      • Shutdown with/without pending events
      • check ACK signals are send if buffers are flushed
      • ACK loop correctly combines ACK counts if former ACK IO op failed
      • Flush timeout for producer/consumer part of spool
      • Test support/encoding of timestamps: Spool to disk not working with time.Time fields #10099 (consider to have special encoding of timestamp -> recover go type when parsing)
  • Resilience improvements:
    • correctly Handle go-txfile errors to prevent potential deadlock
    • (optional) Introduce per event checksum. Without checksum parsing might fail anyways
    • Introduce per event page checksum. (Bump queue version + support for reading old/new queue)
    • Optional startup check that queue linking is not broken (all pages are reachable)
    • Try to repair by checking/reusing second-last transaction state
    • Reclaim unreachable pages if no existing on-disk transaction can be recovered.
    • Top-level queue of queues -> allow co-existance of old and new event schema on upgrades + reduce amount of data loss if on-disk structures are broken.
  • Debugging support
    • CLI tool to report file internals/structure/metrics
    • CLI tool to print all events in queue to JSON
    • (optional) special Beat command/CLI too to drain spool file to ES/Logstash
  • Reported issues to be investigated:

Metadata

Metadata

Assignees

Labels

Team:IntegrationsLabel for the Integrations teamTeam:Services(Deprecated) Label for the former Integrations-Services teamext-goalExternal goal of an iterationlibbeatmetav7.14.0

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions