Skip to content

Make statistics optional#977

Merged
ahrtr merged 1 commit intoetcd-io:mainfrom
nspcc-dev:optional-statistics
Jun 15, 2025
Merged

Make statistics optional#977
ahrtr merged 1 commit intoetcd-io:mainfrom
nspcc-dev:optional-statistics

Conversation

@roman-khimov
Copy link
Contributor

See #967 for motivation and test used.

@k8s-ci-robot
Copy link

Hi @roman-khimov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ahrtr
Copy link
Member

ahrtr commented Jun 11, 2025

Your original PR have multiple change. I am not sure how significant this single change has on the performance. Do you have any data for comparison?

@roman-khimov
Copy link
Contributor Author

It's all in the commit message, abb5797

@roman-khimov
Copy link
Contributor Author

I can also add nspcc-dev@53975e3 here, it's not related to the original problem (reader lock contention), but still can be beneficial for writers.

@ahrtr
Copy link
Member

ahrtr commented Jun 11, 2025

abb5797

I am a little confused, in your test case, there is 30ms sleep in each readonly TXN, so any duration should be >= 30ms, but you got some XXus data.

@roman-khimov
Copy link
Contributor Author

It's not ms, it's us, in the code (56bb52b) that's time.Sleep(30 * time.Microsecond). So yes, we can't have times less than 30 microseconds and we don't have them. The test isn't perfect, it's just that I wasn't creative enough to write code doing come calculations in func(tx *bolt.Tx) error for similar time. But it's sufficient for the purpose and most importantly #967 did improve things for real application (but the most interesting patch from it is already somewhat reworked for a better version, it'll follow after this PR, see nspcc-dev@0a70458 if interested).

@ahrtr
Copy link
Member

ahrtr commented Jun 11, 2025

It's not ms, it's us, in the code (56bb52b) that's time.Sleep(30 * time.Microsecond)

Ah, thx for the clarification.

@ahrtr
Copy link
Member

ahrtr commented Jun 11, 2025

/ok-to-test

cc @Elbehery @fuweid @ivanvc @tjungblu

Comment on lines -39 to -43
// Put `stats` at the first field to ensure it's 64-bit aligned. Note that
// the first word in an allocated struct can be relied upon to be 64-bit
// aligned. Refer to https://pkg.go.dev/sync/atomic#pkg-note-BUG. Also
// refer to discussion in https://github.com/etcd-io/bbolt/issues/577.
stats Stats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to say this, but did you read the comment? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to say this, but did you read the comment? :)

It changes it to a pointer. So I think it should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you guys have a raspi or any other arm32 device sitting around to validate this? I'm not so worried about the stats pointer, but rather the sync.Once or sync.Pool could become problematic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just quickly ran the tests on your branch on my pi - so everything works as intended. Sorry I'm in the middle of a move, luckily I had it plugged in and could ssh into it. Not sure whether you have your QEMU setup still somewhere...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you guys have a raspi or any other arm32 device sitting around to validate this?

qemu-user-static might be good enough to verify this. #577 (comment)

I think we might need to setup a workflow to guarantee it won't be broken in future.

but rather the sync.Once or sync.Pool could become problematic.

The known issue https://pkg.go.dev/sync/atomic#pkg-note-BUG seems only specific to sync.atomic package. But I agree that we should verify it anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to setup a workflow to guarantee it won't be broken in future.

cc @ivanvc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to setup a workflow to guarantee it won't be broken in future.

On it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @ivanvc pls refer to #665

@tjungblu
Copy link
Contributor

generally looks good to me and I would even vouch for changing the default to disable it and reap the benefit in etcd - but this needs a bit more refactor over there, we use the Stats() in a few places and they are not really nil safe 😄

@Elbehery
Copy link
Member

generally looks good to me and I would even vouch for changing the default to disable it and reap the benefit in etcd - but this needs a bit more refactor over there, we use the Stats() in a few places and they are not really nil safe 😄

thus was my question above 👍🏽

@roman-khimov
Copy link
Contributor Author

changing the default to disable it

Works for me if there is a general consensus on this. It's just that the intention was to keep the default behavior unchanged. Otherwise I think half a year after release with this change someone will come here and tell that Stats() is broken (even though as I've said in abb5797, I don't expect a lot of users for this API).

we use the Stats() in a few places and they are not really nil safe

Stats() interface is still the same.

@ahrtr
Copy link
Member

ahrtr commented Jun 12, 2025

Summary & comments:

  • Please do not change the default value of NoStatistics, because it's a breaking change. Even we insist on changing it, we should only do it in a major release (i.e v2.0).
    • I don't think we've decided whether to release v2.0 or v1.5 as the next major or minor release, as it depends on the amount of effort or refactoring we put into the project.
  • Please verify this PR using qemu-user-static per Make statistics optional #977 (comment), although @tjungblu already verified it separately (thx @tjungblu )

db.go Outdated
Comment on lines +1353 to +1355
// NoStatistics turns off statistics collection, Stats method will
// return empty structure in this case. This can be beneficial for
// performance in some cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NoStatistics turns off statistics collection, Stats method will
// return empty structure in this case. This can be beneficial for
// performance in some cases.
// NoStatistics turns off statistics collection, Stats method will
// return empty structure in this case. This can be beneficial for
// performance under high-concurrency read-only transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

I think most Bolt users never care about this data, so we're just wasting
time for nothing. This is also one of the exclusive locks that we have on
the View() path. While this patch doesn't change much on its own, because
the other lock is still here (subject to a different patch), once that lock
is removed the difference in concurrent View() test is pretty clear. With
NoStatistics=false:

workers samples min             avg             50%             80%             90%             max
1       10      123.905µs       969.042µs       1.062529ms      1.065585ms      1.071537ms      1.071537ms
10      100     34.636µs        178.176µs       89.7µs          110.439µs       943.753µs       1.055165ms
100     1000    31.79µs         280.166µs       51.358µs        526.992µs       1.034306ms      2.47819ms
1000    10000   30.608µs        818.098µs       86.464µs        935.799µs       2.681115ms      10.595186ms
10000   100000  30.569µs        3.060826ms      64.132µs        6.56151ms       11.199984ms     64.855384ms

NoStatistics=true:

workers samples min             avg             50%             80%             90%             max
1       10      68.049µs        962.039µs       1.060335ms      1.064633ms      1.066087ms      1.066087ms
10      100     34.846µs        315.346µs       90.943µs        862.499µs       1.00516ms       1.08366ms
100     1000    31.45µs         225.53µs        36.88µs         236.63µs        939.115µs       1.466286ms
1000    10000   30.539µs        207.383µs       43.643µs        110.841µs       408.146µs       5.689001ms
10000   100000  30.488µs        152.603µs       39.636µs        90.622µs        145.266µs       9.28235ms

The default behavior is kept for compatibility. In future the option can be
extended to avoid collecting transaction statistics as well.

Now that stats is a pointer we can also revert a part of 26f89a5
and make the structure cleaner.

Signed-off-by: Roman Khimov <roman@nspcc.ru>
@roman-khimov roman-khimov force-pushed the optional-statistics branch from abb5797 to 87d0cf7 Compare June 12, 2025 10:42
@roman-khimov
Copy link
Contributor Author

verify this PR using qemu-user-static

Works fine as in #577 (comment)

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

thx for working on this.

tx.db.statlock.Unlock()
if tx.db.stats != nil {
tx.db.statlock.Lock()
tx.db.stats.FreePageN = freelistFreeN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we can use atomic to update stats here. so no more lock and no more extra option for openning db.
anyway, changes look good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we can use atomic to update stats here

I think using atomic functions is problematic, since it only guarantees atomic of one operation, but it doesn't guarantee atomic of multiple operations.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, roman-khimov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr merged commit fdf3825 into etcd-io:main Jun 15, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

7 participants