Skip to content

On demand heartbeats via --heartbeat_on_demand_duration, used by the tablet throttler#10198

Merged
shlomi-noach merged 17 commits intovitessio:mainfrom
planetscale:heartbeat-by-demand
May 9, 2022
Merged

On demand heartbeats via --heartbeat_on_demand_duration, used by the tablet throttler#10198
shlomi-noach merged 17 commits intovitessio:mainfrom
planetscale:heartbeat-by-demand

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

Description

Fixes #10196, and see the general description of the problem and suggested fix in #10196.

This PR introduces a new command line flag, --heartbeat_on_demand_duration. By default it's zero, which means it's disabled and has no effect (no change in behavior).

When positive, e.g. --heartbeat_on_demand_duration=5s, and assuming --heartbeat_enable is set, then the tablet does not in fact start generating heartbeats on startup. Instead, it only generates heartbeats on a lease per demand.

In this PR the tablet throttler is currently the only entity which asks the heartbeats writer to generate on demand heartbeats. Right now, the tablet throttler makes such requests any time anyone runs a throttler check. Normally, no one does, and so the throttler does not make any request for heartbeats, and so no heartbeats are generated.

However, whenever a vreplication workflow begins, or whenever an Online DDL operation begins, the throttler is called (assuming --enable_lag_throttler is enabled, of course), and in turn the throttler asks the heartbeat writer to generate heartbeats.

Once the workflow or migration complete, and assuming no other workflows/migrations running, the throttler stops asking for heartbeats, the leas expires, the heartbeat writer ceases to write heartbeats.

The effect is that we keep the binary logs small, as we don't write heartbeats when those are not strictly needed.

As tests go, I've added --heartbeat_on_demand_duration=5s to all endtoend/onlineddl tests where --enable_lag_throttler is set; a couple endtoend where --heartbeat_on_demand_duration=5s is set are left without --heartbeat_on_demand_duration.

Related Issue(s)

#10196

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management release notes labels May 3, 2022
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented May 3, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has the correct release notes label. release notes none should only be used for PRs that are so trivial that they need not be included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

rohit-nayak-ps
rohit-nayak-ps previously approved these changes May 3, 2022
Copy link
Copy Markdown
Member

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


The default value for `--heartbeat_on_demand_duration` is zero, which means the flag is not set and there is no change in behavior.

When `--heartbeat_on_demand_duration` has a positive value, then heartbeats are only injected on demand, per internal requests. For example, when `--heartbeat_on_demand_duration=5s`, the tablet starts without injecting heartbeats. An internal module, like the lag throttle, may request the heartbeat writer for heartbeats. Starting at that point in time, and for the duration of `5s` in our example, the tablet will write heartbeats. If no other requests come in during that duraiton, then the tables then ceases to write heartbeats. If more requests for heartbeats come while heartbeats are being written, then the tablet extends the heartbeat duration for the next `5s` following up each request. Thus, it stops writing heartbeats `5s` after the last request is received.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duraiton=>duration
"the tables then ceases" => tablet

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


The throttler now checks in with the heartbeat writer to request heartbeats, any time it (the throttler) is asked for a check.

When `--heartbeat_on_demand_duration` is not set, there is now change in behavior.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now=>no

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
// In this function we're going to create a timer to activate heartbeats by-demand. Creating a timer has a cost.
// Now, this function can be spammed by clients (the lag throttler). We therefore only allow this function to
// actually once per X seconds (1/4 of onDemandDuration as a reasonable oversampling value):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@rohit-nayak-ps rohit-nayak-ps dismissed their stale review May 3, 2022 10:16

@shlomi: do you need to add the heartbeat module to git?

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

I think there's a legitimate CI failure in tabletmanager_throttler after switching it to --heartbeat_on_demand_duration. Will probably not be able to fix this today, and so it's next week for me.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…emand

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

OK, solved the failing test - which actually validated the changes here. Added more testing to validate the changes.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

some more legit failures. Will only look at these next week.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…he change to create _vt.schema_migrations. This is a simple fix to ensure table exists before issuing any query from QueryExecutor

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Good for a renewed review. Fixed tests. I believe I reduced some flakiness, or at least did not introduce new ones!

Copy link
Copy Markdown
Member

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suggestion: Vttablet: on demand lag throttler heartbeats to reduce binary log volume

2 participants