Skip to content

MPPTask::runImpl is not exception safe #2322

@JaySon-Huang

Description

@JaySon-Huang

Steps to reproduce: checkout this commit: JaySon-Huang@597dd86

I've added a failpoint exception_during_mpp_write_err_to_tunnel to mock this bug.

create table t (a int primary key clustered);
insert into test.t values (1),(2),(3),(4);
set tidb_enforce_mpp=1

-- 1. With no failpoint injected, running this SQL is OK, MPPTask can be released as expected
tidb> select * from (select * from t) a join (select * from t) b;
ch>  select * from dt_tables where tidb_database='test' and tidb_table='t' \G
...
tidb_database:                            test
tidb_table:                               t
...
storage_stable_oldest_snapshot_lifetime:  0 <-- the snapshot will be released as expected

-- 2. With failpoint "exception_during_mpp_non_root_task_run" enabled
ch> DBGInvoke enable_fail_point(exception_during_mpp_non_root_task_run);
-- MPPTask can be released as expected too
tidb> select * from (select * from t) a join (select * from t) b;
ch>  select * from dt_tables where tidb_database='test' and tidb_table='t' \G
...
tidb_database:                            test
tidb_table:                               t
...
storage_stable_oldest_snapshot_lifetime:  0 <-- the snapshot will be released as expected

-- 3. With failpoint "exception_during_mpp_non_root_task_run" and "exception_during_mpp_write_err_to_tunnel" enabled
ch> DBGInvoke enable_fail_point(exception_during_mpp_non_root_task_run);
ch> DBGInvoke enable_fail_point(exception_during_mpp_write_err_to_tunnel);
-- MPPTask can not be released, it will hold the resources of IStorage, and cause other trouble, such as:
--   TiFlash stucks after applying DDL operations
--   DeltaTree can not run GC on its delta data
tidb> select * from (select * from t) a join (select * from t) b;
ch>  select * from dt_tables where tidb_database='test' and tidb_table='t' \G
...
tidb_database:                            test
tidb_table:                               t
...
storage_stable_oldest_snapshot_lifetime:  45.19153881 <-- the snapshot will not be released
...

And we can see that ~MPPTask() is not called in case 3, there is no logging like "finish MPPTask", while case 1 and case 2 have.

JaySon-Huang@c4ba52f#diff-31a1170d1da0609fde7eb7068713665415014352587f2256fd15a29fe581b3caR302-R336
I guess these changes can resolve part of the problem, cause MPPTunnel::close won't throw exception like MPPTunnel::write when the tunnel get closed.
However, if other exceptions are thrown in MPPTunnel::close, then MPPTask still gets stuck and can not release resources of the Storage layer.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions