Investigate changing the pipeline order to improve performance

Let's consider the case where the shipper is configured to use a disk queue with the Elasticsearch output. Let's also assume we use the default protobuf encoding over gRPC. If we reuse the existing structure of the beats publishing pipeline, the data flow will look like:

```mermaid
flowchart LR

A[Input] -->|Protobuf| B[Server] 
B --> C[Processors] 
C -->|CBOR| D[Disk Queue] 
D -->|JSON| E[Elasticsearch]
```

The diagram shows that the data must be serialized multiple times:
1) To the protobuf wire format when the input sends events to the shipper using gRPC. This could optionally be replaced with JSON, but we would likely still need to deserialize it regardless.
2) To [CBOR](https://cbor.io/) when writing to the disk queue.
3) To JSON when writing to Elasticsearch.

It seems extremely worthwhile to restructure the pipeline to eliminate the amount of times the data must be serialized:

```mermaid
flowchart LR

A[Input] -->|Protobuf| B[Server] 
B -->|Protobuf| C[Disk Queue] 
C --> D[Processors] 
D -->|JSON| E[Elasticsearch]
```

In this case we would change the disk queue's serialization format to protobuf, deferring deserialization until after data as been read from the queue. This leaves us with a single transformation from protobuf, to the shipper's internal data format, and then back to JSON (or whatever encoding the output requires).

If the memory queue were used instead of the disk queue, we could use the same strategy of storing serialized events in the memory queue and only decoding them when they are read from the queue. This would give us a way to deterministically calculate the number of bytes stored in the memory queue. Currently the memory queue size must be specified in events.

The output of this issue should be a proof of concept demonstrating that this reordering of the pipeline is possible and has the expected benefits. At minimum the work will need to include:

1. Modifying the gRPC server in the shipper to stop deserializing messages so they can be passed directly to the queue. The ideal option would be to keep the existing RPC definitions but implement a no-op codec. See the gRPC [encoding documentation](https://github.com/grpc/grpc-go/blob/master/Documentation/encoding.md). We may need to write a custom set of RPC handlers instead of generating them: https://github.com/elastic/elastic-agent-shipper/blob/ca42ed1b8a2166590adcdf758a8980f929188229/api/shipper_grpc.pb.go#L145 A fallback option is to use messages that just wrap a `bytes` payload with the required message type and serialization documented in the RPC call. 
3. Modify the disk queue to use protobuf serialization. At minimum this depends on https://github.com/elastic/elastic-agent-shipper/issues/41 and possibly some of the work for https://github.com/elastic/elastic-agent-shipper/issues/33 to use the new disk queue headers.
4. Ensure we can still return errors back to clients (after deserialization or processing, for example). https://github.com/elastic/elastic-agent-shipper/issues/9 should provide a mechanism for this.
5. Benchmark the performance of the modified pipeline and compare it to the original configuration. We do not have a set of repeatable performance tests yet, so we may choose to defer this work until we do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate changing the pipeline order to improve performance #44

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigate changing the pipeline order to improve performance #44

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions