Skip to content

Setup benchmarks at CI/CD #560

@norberttech

Description

@norberttech

While project is growing, it will become easier to miss performance degradation, like it happened here: #558

We should think about creating a set of benchmarks for each adapter/core/lib and execute those benchmarks at CI/CD so we can at least manually check which PR's introduced bottlenecks.
The ideal solution would be to store those performance benchmarks as workflows artifacts after merge to 1.x and compare them with benchmarks from newly open PR's.


I was thinking about creating benchmarks for specific building blocks separately, for example:

  1. Extractors - we could come up with some dataset schema, save it as all supported file types, and just benchmark extraction without doing any operations on the dataset.
  2. Transformers - since we reduced the number of transformers, keeping only critical ones, we might want to start at least from those most frequently used, like the one that evaluates expressions. Here, I think we can take a similar approach, but instead of using extractors, we can directly pass prepared Rows to it and measure the performance of transformations themselves.
  3. Expressions - just like with Transformers, but here we don't even need Rows. Single Row should be enough
  4. Loaders - similarly to Transformers, prepare Rows and execute Loading them into the destination directly

Those are very granular benchmarks, which can test all building blocks separately, providing clear insights about each element separately. However, on top of that, I would probably still try to benchmark entire Pipelines on a selected subset of the most frequently used extractors/loaders/transformers (we would need to develop a few scenarios here).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions