-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](cloud-mow) schema change should retry when encouter TXN_CONFILCT in cloud mode #46748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
TeamCity be ut coverage result: |
|
run buildall |
TPC-H: Total hot run time: 32783 ms |
TPC-DS: Total hot run time: 188789 ms |
ClickBench: Total hot run time: 31.59 s |
|
TeamCity be ut coverage result: |
be/src/cloud/cloud_meta_mgr.cpp
Outdated
| res->status().msg()); | ||
| } else if (res->status().code() == | ||
| MetaServiceCode::KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES) { | ||
| return Status::Error<ErrorCode::DELETE_BITMAP_LOCK_ERROR, false>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return such error for all rpc? DELETE_BITMAP_LOCK_ERROR is only used for delete
bitmap lock related rpc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already remove and use another way to judge @zhannngchen
|
run buildall |
|
run buildall |
TPC-H: Total hot run time: 32603 ms |
TPC-DS: Total hot run time: 194536 ms |
|
TeamCity be ut coverage result: |
ClickBench: Total hot run time: 31.15 s |
zhannngchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
dataroaring
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…T in cloud mode (#46748) For mow table, shcema change may encouter TXN_CONFILCT beacause of tow tablet trying to modify delete bitmap lock in the same time, which may lead to shcema change failed, so should add retry in fe.
…T in cloud mode (apache#46748) For mow table, shcema change may encouter TXN_CONFILCT beacause of tow tablet trying to modify delete bitmap lock in the same time, which may lead to shcema change failed, so should add retry in fe.
### What problem does this PR solve? cloud heavy sc job will retry the whole alter tasks when encounter `KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in `commit_tablet_job`(#46748). We should remove stop token(#48399) in MS for the sc job if it fails in `commit_tablet_job`, otherwise the later retries may fail to regsiter stop token(because the first stop token won't expire in `config::lease_compaction_interval_seconds * 4=80s`) and the schema change job will fail. ``` I20250318 15:40:15.851157 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:40:31.346628 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:40:31.346635 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:40:31.350860 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:40:31.350906 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:40:31.350916 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:40:31.382493 6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289 I20250318 15:40:31.385416 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2 I20250318 15:40:31.387535 6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 I20250318 15:40:31.388285 6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829 I20250318 15:40:31.391326 6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829 I20250318 15:40:31.392035 6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 W20250318 15:40:39.947554 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 } I20250318 15:40:46.204162 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:41:07.487172 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:41:07.487183 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:41:07.489440 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:41:07.489511 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:41:07.489523 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:41:07.490249 6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829 I20250318 15:41:07.490275 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2 W20250318 15:41:07.490864 6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 W20250318 15:41:07.490897 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 ```
### What problem does this PR solve? cloud heavy sc job will retry the whole alter tasks when encounter `KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in `commit_tablet_job`(#46748). We should remove stop token(#48399) in MS for the sc job if it fails in `commit_tablet_job`, otherwise the later retries may fail to regsiter stop token(because the first stop token won't expire in `config::lease_compaction_interval_seconds * 4=80s`) and the schema change job will fail. ``` I20250318 15:40:15.851157 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:40:31.346628 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:40:31.346635 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:40:31.350860 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:40:31.350906 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:40:31.350916 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:40:31.382493 6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289 I20250318 15:40:31.385416 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2 I20250318 15:40:31.387535 6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 I20250318 15:40:31.388285 6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829 I20250318 15:40:31.391326 6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829 I20250318 15:40:31.392035 6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 W20250318 15:40:39.947554 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 } I20250318 15:40:46.204162 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:41:07.487172 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:41:07.487183 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:41:07.489440 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:41:07.489511 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:41:07.489523 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:41:07.490249 6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829 I20250318 15:41:07.490275 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2 W20250318 15:41:07.490864 6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 W20250318 15:41:07.490897 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 ```
…he#49275) ### What problem does this PR solve? cloud heavy sc job will retry the whole alter tasks when encounter `KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in `commit_tablet_job`(apache#46748). We should remove stop token(apache#48399) in MS for the sc job if it fails in `commit_tablet_job`, otherwise the later retries may fail to regsiter stop token(because the first stop token won't expire in `config::lease_compaction_interval_seconds * 4=80s`) and the schema change job will fail. ``` I20250318 15:40:15.851157 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:40:31.346628 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:40:31.346635 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:40:31.350860 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:40:31.350906 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:40:31.350916 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:40:31.382493 6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289 I20250318 15:40:31.385416 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2 I20250318 15:40:31.387535 6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 I20250318 15:40:31.388285 6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829 I20250318 15:40:31.391326 6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829 I20250318 15:40:31.392035 6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 W20250318 15:40:39.947554 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 } I20250318 15:40:46.204162 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:41:07.487172 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:41:07.487183 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:41:07.489440 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:41:07.489511 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:41:07.489523 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:41:07.490249 6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829 I20250318 15:41:07.490275 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2 W20250318 15:41:07.490864 6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 W20250318 15:41:07.490897 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 ```
…in cloud mode (apache#50705) ### What problem does this PR solve? apache#46748 should also be in rollup tasks.
For mow table, shcema change may encouter TXN_CONFILCT beacause of tow tablet trying to modify delete bitmap lock in the same time, which may lead to shcema change failed, so should add retry in fe.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)