Macrobenchmarks versus microbenchmarks

This has come up in a few places recently and I think having a dedicated space to talk about these issues might help. I'm not sure there's much of a to-do here (yet, possibly ever) but this is a good place to have a dsicussion.

Conbench is designed to work with both micro- and macro-benchmarks. Internally we treat this incredibly similarly (though our runners sometimes treat them quick differently)[^1]. 

So far our distinction is as follows (note, that many of these are in reference to the Arrow benchmarks, but are applicable to other implementations, but of course each setup is unique)

| Micro benchmarks | Macro benchmarks |
|---|---|
| Measures a small (ideally tiny!) piece of code |  Tends to be a larger chunk of code (or at least the code touches more layers of our stack) | 
| Uses relatively small inputs | Users relatively large inputs  | 
| Iterations are very quick (mili or nano seconds) | Iterations can be much longer (seconds, minutes) | 
| Typically measures that code a massive number of times | Typically measures that code a smaller number of times | 
| Generally reports aggregated data (though, note that we do not currently pull all the aggregate details into conbench yet!) | Reports each individual iteration | 
| Is very good for targeted developer attention (e.g. "this benchmark got 50% slower, it's touching this specific code, figure out what happened") | Is very good for top-line user-facing numbers "you can ingest this parquet file 2x faster on this common dataset than it was 2 years ago!" | 
| Can and should be very tuned to the code being executed | (sometimes) purposefully designed to be a generic external task that can be compared across implementations (TPC benchmarks, other industry standards) |
| Tends to be in lower-level languages (e.g. C++) | Tends to be in higher-level languages (e.g. Python, R) | 

But @jgehrcke has mentioned a few times (very rightfully! [one example](https://github.com/voltrondata-labs/benchmarks/issues/123#issuecomment-1372215463)) that there can be a tension in the macrobenchmarks between being helpful for driving development (targeted, not impacted by "uninteresting" things like disk speed, easily repeatable many times) and some of the properties of the macro benchmarks above (and most specifically, the external-validity that makes macro benchmarks good for user-facing questions of performance. 

I wonder if we might benefit from thinking about these as more of a continuum, or possibly even having a midpoint between them ("minibenchmarks", "micro-style macro benchmarks", call these "macro benchmarks" and what's in the table above "integration benchmarks" or "jumbo benchmarks"): there are most certainly reasons why we might want a targeted benchmark in Python that exercises one and only one (smallish) section of code without confounders of things like disk I/O, etc. to help drive development and really show exactly where something has slowed down to drive development. But those don't neatly fit into macrobenchmarks as they are described in the table above — they have features of both. But, having *both* of these styles is important (for different purposes, some of which are listed in the table above).

[^1]: Treating them all the same internally is not a hard requirement — but is what we have done so far (and has a lot of benefits in that we don't have to implement multiple things for the same concepts, or introduce indirection with inheriting (both programmatically and conceptually).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Macrobenchmarks versus microbenchmarks #579

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Micro benchmarks	Macro benchmarks
Measures a small (ideally tiny!) piece of code	Tends to be a larger chunk of code (or at least the code touches more layers of our stack)
Uses relatively small inputs	Users relatively large inputs
Iterations are very quick (mili or nano seconds)	Iterations can be much longer (seconds, minutes)
Typically measures that code a massive number of times	Typically measures that code a smaller number of times
Generally reports aggregated data (though, note that we do not currently pull all the aggregate details into conbench yet!)	Reports each individual iteration
Is very good for targeted developer attention (e.g. "this benchmark got 50% slower, it's touching this specific code, figure out what happened")	Is very good for top-line user-facing numbers "you can ingest this parquet file 2x faster on this common dataset than it was 2 years ago!"
Can and should be very tuned to the code being executed	(sometimes) purposefully designed to be a generic external task that can be compared across implementations (TPC benchmarks, other industry standards)
Tends to be in lower-level languages (e.g. C++)	Tends to be in higher-level languages (e.g. Python, R)

Macrobenchmarks versus microbenchmarks #579

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions