Skip to content

Conversation

@bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Oct 12, 2023

Proposed changes

This PR fix the alignment process during publish phase when conflict occurs during concurrent partial updates: if we encounter a row with the same key and larger value in sequence column, it means that there exists another load which introduces a row with the same keys and larger sequence column value published successfully after the commit phase of the current load. We should act as follows:

  • If the columns we update include sequence column, we should delete the current row becase the partial update on the current row has been overwritten by the previous one with larger sequence column value.
  • Otherwise, we should combine the values of the missing columns in the previous row and the values of the including columns in the current row into a new row.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 12, 2023

run buildall

@bobhan1 bobhan1 force-pushed the coverage-row-store-concurrent-partial-update branch from 4c04b73 to cb55e9a Compare October 12, 2023 06:58
@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 12, 2023

run buildall

dataroaring
dataroaring previously approved these changes Oct 12, 2023
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 12, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bobhan1 bobhan1 force-pushed the coverage-row-store-concurrent-partial-update branch from cb55e9a to b8def92 Compare October 12, 2023 11:17
@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 12, 2023

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Oct 12, 2023
@bobhan1 bobhan1 force-pushed the coverage-row-store-concurrent-partial-update branch from b8def92 to 9f82ffc Compare October 13, 2023 11:19
@bobhan1 bobhan1 changed the title [Coverage](merge-on-write) add concurrent partial updates case for row store table [Fix](merge-on-write) Correct the alignment process when the table has sequence column and add cases Oct 13, 2023
@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 13, 2023

run buildall

@bobhan1 bobhan1 requested a review from zhannngchen October 13, 2023 11:22
zhannngchen
zhannngchen previously approved these changes Oct 13, 2023
Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 13, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@bobhan1 bobhan1 force-pushed the coverage-row-store-concurrent-partial-update branch from 9f82ffc to bbf535d Compare October 16, 2023 01:54
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@bobhan1 bobhan1 force-pushed the coverage-row-store-concurrent-partial-update branch from bbf535d to f3a1729 Compare October 16, 2023 03:09
@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 16, 2023

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Oct 16, 2023
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.02% (8284/22376)
Line Coverage: 29.07% (66450/228614)
Region Coverage: 27.75% (34473/124230)
Branch Coverage: 24.38% (17521/71878)
Coverage Report: http://coverage.selectdb-in.cc/coverage/de1190a37b147592c2a5a260370549840943ce0d_de1190a37b147592c2a5a260370549840943ce0d/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.1 seconds
stream load tsv: 554 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 29.2 seconds inserted 10000000 Rows, about 342K ops/s
storage size: 17162117359 Bytes

@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 17, 2023

run p0

1 similar comment
@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 17, 2023

run p0

@bobhan1 bobhan1 force-pushed the coverage-row-store-concurrent-partial-update branch from de1190a to 3246ed2 Compare October 17, 2023 09:12
@bobhan1
Copy link
Contributor Author

bobhan1 commented Oct 17, 2023

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.01% (8280/22374)
Line Coverage: 29.05% (66408/228561)
Region Coverage: 27.74% (34461/124213)
Branch Coverage: 24.36% (17508/71884)
Coverage Report: http://coverage.selectdb-in.cc/coverage/3246ed229fde9be1b599ca58cb8a7840f0ac7d90_3246ed229fde9be1b599ca58cb8a7840f0ac7d90/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.11 seconds
stream load tsv: 556 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 64 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17161999736 Bytes

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhannngchen
Copy link
Contributor

run p0

@zhannngchen zhannngchen merged commit 64aeeb9 into apache:master Oct 18, 2023
xiaokang pushed a commit that referenced this pull request Oct 23, 2023
…s when the table has sequence column and add cases #25346" (#25789)
dutyu pushed a commit to dutyu/doris that referenced this pull request Oct 28, 2023
…s sequence column and add cases (apache#25346)

This PR fix the alignment process during publish phase when conflict occurs during concurrent partial updates: if we encounter a row with the same key and larger value in sequence column, it means that there exists another load which introduces a row with the same keys and larger sequence column value published successfully after the commit phase of the current load. We should act as follows:

- If the columns we update include sequence column, we should delete the current row becase the partial update on the current row has been overwritten by the previous one with larger sequence column value.
- Otherwise, we should combine the values of the missing columns in the previous row and the values of the including columns in the current row into a new row.
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
…s sequence column and add cases (apache#25346)

This PR fix the alignment process during publish phase when conflict occurs during concurrent partial updates: if we encounter a row with the same key and larger value in sequence column, it means that there exists another load which introduces a row with the same keys and larger sequence column value published successfully after the commit phase of the current load. We should act as follows:

- If the columns we update include sequence column, we should delete the current row becase the partial update on the current row has been overwritten by the previous one with larger sequence column value.
- Otherwise, we should combine the values of the missing columns in the previous row and the values of the including columns in the current row into a new row.
dataroaring pushed a commit that referenced this pull request Jul 18, 2025
…update (#41701)

1. Fix incorrect result when there are insert after delete for same key
in one load. In flexible partial update, if there is an insert after a
delete on the same key in one batch, the final result will be always
delete that row in table, which is incorrect. The reason is that when
doing aggregation for rows with same keys in memtable, we treat delete
sign column as an ordinary column. So if there is a row with delete sign
in that batch, the final aggregated row will always have a delete sign
and any insert on the same key after it will be lost.
In the above situation, we should keep those rows and don't merge them
when doing aggregation in memtable. When flush the memtable to segment,
we should first apply the semantic of delete if exists and then insert
the row after it.
2. fix flexible partial update don't use row store column to read data
3. Fix flexible partial update incorrect read index when resolving
conflict in publish phase
4. add some default value cases and fix on update current timestamp
5. fix publish phase alignment problem when table has sequence column
and the rowset to read has multi segments.
6. add some publish alignment case when table has sequence column
7. ~revert #25346, we use the
sequence value in flush phase as its final value and will not do
alignment in publish phase if the partial update load doesn't specified
sequence column.~
8. fix auto-increment column value is wrongly updated when it's not
specified.
9. Fix the problem that the value of sequence column may decrease.
10. Only read seq col and seq map col when the previous row is deleted
in partial update and flexible partial update when doing alignment.
11. Fix the value of auto-increment column is not consistent between
replicas.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 18, 2025
…update (apache#41701)

1. Fix incorrect result when there are insert after delete for same key
in one load. In flexible partial update, if there is an insert after a
delete on the same key in one batch, the final result will be always
delete that row in table, which is incorrect. The reason is that when
doing aggregation for rows with same keys in memtable, we treat delete
sign column as an ordinary column. So if there is a row with delete sign
in that batch, the final aggregated row will always have a delete sign
and any insert on the same key after it will be lost.
In the above situation, we should keep those rows and don't merge them
when doing aggregation in memtable. When flush the memtable to segment,
we should first apply the semantic of delete if exists and then insert
the row after it.
2. fix flexible partial update don't use row store column to read data
3. Fix flexible partial update incorrect read index when resolving
conflict in publish phase
4. add some default value cases and fix on update current timestamp
5. fix publish phase alignment problem when table has sequence column
and the rowset to read has multi segments.
6. add some publish alignment case when table has sequence column
7. ~revert apache#25346, we use the
sequence value in flush phase as its final value and will not do
alignment in publish phase if the partial update load doesn't specified
sequence column.~
8. fix auto-increment column value is wrongly updated when it's not
specified.
9. Fix the problem that the value of sequence column may decrease.
10. Only read seq col and seq map col when the previous row is deleted
in partial update and flexible partial update when doing alignment.
11. Fix the value of auto-increment column is not consistent between
replicas.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 18, 2025
…update (apache#41701)

1. Fix incorrect result when there are insert after delete for same key
in one load. In flexible partial update, if there is an insert after a
delete on the same key in one batch, the final result will be always
delete that row in table, which is incorrect. The reason is that when
doing aggregation for rows with same keys in memtable, we treat delete
sign column as an ordinary column. So if there is a row with delete sign
in that batch, the final aggregated row will always have a delete sign
and any insert on the same key after it will be lost.
In the above situation, we should keep those rows and don't merge them
when doing aggregation in memtable. When flush the memtable to segment,
we should first apply the semantic of delete if exists and then insert
the row after it.
2. fix flexible partial update don't use row store column to read data
3. Fix flexible partial update incorrect read index when resolving
conflict in publish phase
4. add some default value cases and fix on update current timestamp
5. fix publish phase alignment problem when table has sequence column
and the rowset to read has multi segments.
6. add some publish alignment case when table has sequence column
7. ~revert apache#25346, we use the
sequence value in flush phase as its final value and will not do
alignment in publish phase if the partial update load doesn't specified
sequence column.~
8. fix auto-increment column value is wrongly updated when it's not
specified.
9. Fix the problem that the value of sequence column may decrease.
10. Only read seq col and seq map col when the previous row is deleted
in partial update and flexible partial update when doing alignment.
11. Fix the value of auto-increment column is not consistent between
replicas.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 21, 2025
…update (apache#41701)

1. Fix incorrect result when there are insert after delete for same key
in one load. In flexible partial update, if there is an insert after a
delete on the same key in one batch, the final result will be always
delete that row in table, which is incorrect. The reason is that when
doing aggregation for rows with same keys in memtable, we treat delete
sign column as an ordinary column. So if there is a row with delete sign
in that batch, the final aggregated row will always have a delete sign
and any insert on the same key after it will be lost.
In the above situation, we should keep those rows and don't merge them
when doing aggregation in memtable. When flush the memtable to segment,
we should first apply the semantic of delete if exists and then insert
the row after it.
2. fix flexible partial update don't use row store column to read data
3. Fix flexible partial update incorrect read index when resolving
conflict in publish phase
4. add some default value cases and fix on update current timestamp
5. fix publish phase alignment problem when table has sequence column
and the rowset to read has multi segments.
6. add some publish alignment case when table has sequence column
7. ~revert apache#25346, we use the
sequence value in flush phase as its final value and will not do
alignment in publish phase if the partial update load doesn't specified
sequence column.~
8. fix auto-increment column value is wrongly updated when it's not
specified.
9. Fix the problem that the value of sequence column may decrease.
10. Only read seq col and seq map col when the previous row is deleted
in partial update and flexible partial update when doing alignment.
11. Fix the value of auto-increment column is not consistent between
replicas.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 22, 2025
…update (apache#41701)

1. Fix incorrect result when there are insert after delete for same key
in one load. In flexible partial update, if there is an insert after a
delete on the same key in one batch, the final result will be always
delete that row in table, which is incorrect. The reason is that when
doing aggregation for rows with same keys in memtable, we treat delete
sign column as an ordinary column. So if there is a row with delete sign
in that batch, the final aggregated row will always have a delete sign
and any insert on the same key after it will be lost.
In the above situation, we should keep those rows and don't merge them
when doing aggregation in memtable. When flush the memtable to segment,
we should first apply the semantic of delete if exists and then insert
the row after it.
2. fix flexible partial update don't use row store column to read data
3. Fix flexible partial update incorrect read index when resolving
conflict in publish phase
4. add some default value cases and fix on update current timestamp
5. fix publish phase alignment problem when table has sequence column
and the rowset to read has multi segments.
6. add some publish alignment case when table has sequence column
7. ~revert apache#25346, we use the
sequence value in flush phase as its final value and will not do
alignment in publish phase if the partial update load doesn't specified
sequence column.~
8. fix auto-increment column value is wrongly updated when it's not
specified.
9. Fix the problem that the value of sequence column may decrease.
10. Only read seq col and seq map col when the previous row is deleted
in partial update and flexible partial update when doing alignment.
11. Fix the value of auto-increment column is not consistent between
replicas.
bobhan1 added a commit to bobhan1/doris that referenced this pull request Jul 23, 2025
…update (apache#41701)

1. Fix incorrect result when there are insert after delete for same key
in one load. In flexible partial update, if there is an insert after a
delete on the same key in one batch, the final result will be always
delete that row in table, which is incorrect. The reason is that when
doing aggregation for rows with same keys in memtable, we treat delete
sign column as an ordinary column. So if there is a row with delete sign
in that batch, the final aggregated row will always have a delete sign
and any insert on the same key after it will be lost.
In the above situation, we should keep those rows and don't merge them
when doing aggregation in memtable. When flush the memtable to segment,
we should first apply the semantic of delete if exists and then insert
the row after it.
2. fix flexible partial update don't use row store column to read data
3. Fix flexible partial update incorrect read index when resolving
conflict in publish phase
4. add some default value cases and fix on update current timestamp
5. fix publish phase alignment problem when table has sequence column
and the rowset to read has multi segments.
6. add some publish alignment case when table has sequence column
7. ~revert apache#25346, we use the
sequence value in flush phase as its final value and will not do
alignment in publish phase if the partial update load doesn't specified
sequence column.~
8. fix auto-increment column value is wrongly updated when it's not
specified.
9. Fix the problem that the value of sequence column may decrease.
10. Only read seq col and seq map col when the previous row is deleted
in partial update and flexible partial update when doing alignment.
11. Fix the value of auto-increment column is not consistent between
replicas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.3-merged merge_conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants