planner, statistics: build the global statistics for the partition table#22472
planner, statistics: build the global statistics for the partition table#22472Reminiscent wants to merge 8 commits intopingcap:masterfrom
Conversation
…o global-level stats"
|
Please follow PR Title Format:
Or if the count of mainly changed packages are more than 3, use
|
|
Please follow PR Title Format:
Or if the count of mainly changed packages are more than 3, use
|
| } else { | ||
| e.tasks = append(e.tasks, b.buildAnalyzeIndexPushdown(task, v.Opts, autoAnalyze)) | ||
| } | ||
| e.tasks = append(e.tasks, b.buildAnalyzeIndexPushdown(task, v.Opts, autoAnalyze)) |
There was a problem hiding this comment.
TODO: Need more investigation on when to use buildAnalyzeFastIndex.
| type AnalyzeTableID struct { | ||
| PersistID int64 | ||
| CollectIDs []int64 | ||
| PersistID int64 | ||
| // FatherID just used in the partition table. | ||
| // It represents the ID of the table to which the partition belongs. | ||
| FatherID int64 | ||
| } |
There was a problem hiding this comment.
Version1(The 'static-only' mode): No global-level stats, each partition has its own stats.
Version2: Only global-level stats, no partition-level stats.
Version3(This PR): Both have partition-level stats and global-level stats, and the global-level stats are merged from the partition-level stats
Give an example to make the meaning of these IDs more clear.
Now, we have table t(ID = 1). And the table t has three partitions p1(ID = 2), p2(ID = 3) and p3(ID = 4).
In version1, the analyze request has three tasks: AnalyzeTableID{PersistID = 2, CollectIDs = {2}}, AnalyzeTableID{PersistID = 3, CollectIDs = {3}} , AnalyzeTableID{PersistID = 4, CollectIDs = {4}}.
In version2, the analyze request has only one task: AnalyzeTableID{PersistID = 1, CollectIDs = {2, 3, 4}}.
In version3, the analyze request has three tasks: AnalyzeTableID{PersistID = 2, FatherID= 1}, AnalyzeTableID{PersistID = 3, FatherID= 1} , AnalyzeTableID{PersistID = 4, FatherID= 1}.
|
Please follow PR Title Format:
Or if the count of mainly changed packages are more than 3, use
|
refine the code deal with some todo lists
| "IndexReader_8 2.80 root index:IndexRangeScan_7", | ||
| "└─IndexRangeScan_7 2.80 cop[tikv] table:t3, partition:p1, index:k(v) range:[3,3], keep order:false", |
There was a problem hiding this comment.
TODO: Need more investigation on it is possible to use partition-level stats when there is only a single partition in the where condition.
|
The reason why the test fails now is that the merged function of other statistical information like histogram and topN has not been completed, we need to wait until they are all completed. |
|
These labels are not found |
| func (t *Table) getColumnStatsInfo(colID int64) (*Histogram, *CMSketch, *TopN) { | ||
| colStatsInfo := t.Columns[colID] | ||
| return colStatsInfo.Histogram.Copy(), colStatsInfo.CMSketch.Copy(), colStatsInfo.TopN.Copy() | ||
| } | ||
|
|
||
| func (t *Table) getIndexStatsInfo(idxID int64) (*Histogram, *CMSketch, *TopN) { | ||
| idxStatsInfo := t.Indices[idxID] | ||
| return idxStatsInfo.Histogram.Copy(), idxStatsInfo.CMSketch.Copy(), idxStatsInfo.TopN.Copy() | ||
| } | ||
|
|
||
| // GetStatsInfo returns their statistics according to the ID of the column or index, including histogram, CMSketch and TopN. | ||
| func (t *Table) GetStatsInfo(ID int64, isIndex int) (*Histogram, *CMSketch, *TopN) { | ||
| if isIndex == 0 { | ||
| return t.getColumnStatsInfo(ID) | ||
| } | ||
| return t.getIndexStatsInfo(ID) | ||
| } | ||
|
|
There was a problem hiding this comment.
getColumnStatsInfo and getIndexStatsInfo are simple and not used by other functions directly, so how about merging these 3 functions into 1?
| return | ||
| } | ||
| tableInfo := partitionTable.Meta() | ||
| partitionStats, err := h.tableStatsFromStorage(tableInfo, partitionID, false, nil) |
There was a problem hiding this comment.
Should we load it from cache first?
| type GlobalStats struct { | ||
| num int | ||
| count int64 | ||
| hg []*statistics.Histogram | ||
| cms []*statistics.CMSketch | ||
| topN []*statistics.TopN | ||
| } |
There was a problem hiding this comment.
How about exposing this struct's fields and removing its methods, which may make it clearer?
| succ = false | ||
| break | ||
| } | ||
| err = statsHandle.SaveStatsToStorage(info.tableID, globalStatsCount, info.isIndex, hg, cms, topN, info.statsVersion, 1) |
There was a problem hiding this comment.
Should we update the cache after saving?
|
To make it easier to review, we will split this PR into multiple sub-PRs. |
| reqBuilder := builder.SetHandleRangesForTables(e.ctx.GetSessionVars().StmtCtx, e.tableID.CollectIDs, e.handleCols != nil && !e.handleCols.IsInt(), ranges, nil) | ||
| reqBuilder := builder.SetHandleRangesForTables(e.ctx.GetSessionVars().StmtCtx, []int64{e.tableID.FatherID}, e.handleCols != nil && !e.handleCols.IsInt(), ranges, nil) |
There was a problem hiding this comment.
Why we use FatherID rather than PersistID here? I'm kind of confused. Thx~
There was a problem hiding this comment.
Please see this PR. It has some differences with this PR. Thanks~
What problem does this PR solve?
Issue Number: related: #18551
Problem Summary:
Build the global statistics for the partition tables when we execute the
analyzestatement.What is changed and how it works?
Proposal (in Chinese)
What's Changed(In the
Dynamic-Onlymode):In the origin implementation, we build analyze task that "collect multi partitions and save as a table" mode. You can see PR#19846, PR#19899 and PR#20271 for more details.
In this PR, we build the
analyzetask that "collect a partition save a partition" first. And then merge the partition-level stats which belong to the same partition table to get the global-level stats.We will use the changed variable
analyzeTableIDfor more explanation. You can see this comment for more details.How it Works(In the
Dynamic-Onlymode):analyzetask, we build every task for every partition. And record the table ID to which table the partition belongs to.Related changes
Check List
Tests
Side effects
Release note