Skip to content

Heartbeat Lag with Aurora blocking cutover.  #1081

@myusuf3

Description

@myusuf3

Hello everyone!

I am currently running a long running migration on Aurora MySQL 5.7. I running it on a copy of my production database. With Aurora style replication (meaning not actual replication) I have successfully run it on smaller tables, but when I run a migration on the scale of 20-24 hours I keep running into HeartbeatLag that always increases and never returns once it starts to drift.

Here is a sample output from a recent run.

[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:137795900 to mysql-bin-changelog.011973
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:0 to mysql-bin-changelog.011973
Copy: 111620000/904947480 12.3%; Applied: 0; Backlog: 0/1000; Time: 3h11m30s(total), 3h11m30s(copy); streamer: mysql-bin-changelog.011973:41926866; Lag: 0.02s, HeartbeatLag: 871.02s, State: migrating; ETA: 22h41m3s
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:138079395 to mysql-bin-changelog.011974
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:0 to mysql-bin-changelog.011974
Copy: 111880000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m0s(total), 3h12m0s(copy); streamer: mysql-bin-changelog.011974:23055116; Lag: 0.02s, HeartbeatLag: 892.02s, State: migrating; ETA: 22h41m0s
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:139227492 to mysql-bin-changelog.011975
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:0 to mysql-bin-changelog.011975
Copy: 112140000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m30s(total), 3h12m30s(copy); streamer: mysql-bin-changelog.011975:2853151; Lag: 0.02s, HeartbeatLag: 911.92s, State: migrating; ETA: 22h40m56s
Copy: 112400000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m0s(total), 3h13m0s(copy); streamer: mysql-bin-changelog.011975:121871428; Lag: 0.02s, HeartbeatLag: 931.22s, State: migrating; ETA: 22h40m52s
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:148195656 to mysql-bin-changelog.011976
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:0 to mysql-bin-changelog.011976
Copy: 112650000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m30s(total), 3h13m30s(copy); streamer: mysql-bin-changelog.011976:92692665; Lag: 0.02s, HeartbeatLag: 951.42s, State: migrating; ETA: 22h40m56s
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:140138087 to mysql-bin-changelog.011977
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:0 to mysql-bin-changelog.011977
Copy: 112910000/904947480 12.5%; Applied: 0; Backlog: 0/1000; Time: 3h14m0s(total), 3h14m0s(copy); streamer: mysql-bin-changelog.011977:71588220; Lag: 0.02s, HeartbeatLag: 971.02s, State: migrating; ETA: 22h40m51s

I have completed this same migration up until the cut over phase and it was just blocked there due to HeartbeatLag being so high by the time the migration completed. If really odd when the HeartbeatLag kicks in and simply doesn't return to lower values ever increasing from then on out.

I have also set the binlog retention to 72 hours.

call mysql.rds_set_configuration('binlog retention hours', 72);

I even tried to forcibly trigger the cutover and it replied with nothing is blocking the cutover.

I have followed all the recommended Aurora settings aside from running the migration on master directly.

  • Database backup retention is greater than 1 day to enable binlogs
  • The parameter aurora_enable_repl_bin_log_filtering is set to 0
  • I have checked clocks to make sure there was no drift.
  • row based replication
  • I have also run this same command smaller migrations without issue.

Here is the command I ran.

 gh-ost \
--max-load=Threads_running=100 \
--critical-load=Threads_running=1000 \
--chunk-size=10000 \
--throttle-control-replicas="" \
--max-lag-millis=1500 \
--user="" \
--password="" \
--host=db \
--allow-on-master \
--database="" \
--table="" \
--verbose \
--alter="" \
--switch-to-rbr \
--allow-master-master \
--cut-over=default \
--exact-rowcount \
--concurrent-rowcount \
--default-retries=120 \
--panic-flag-file=/tmp/ghost.panic.flag \
--postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
--initially-drop-ghost-table \
--execute

Any guidance from other aurora users would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions