Histograms are a special type of column statistics. Data is stored in a series of ordered buckets to describe the statistical distribution features of the column. The optimizer can estimate a more accurate number of rows based on a histogram.
By default, the optimizer considers that data is evenly distributed in a column and estimates the number of rows based on this feature. However, in real scenarios, data distribution is uneven in most tables. A histogram helps the optimizer estimate the number of rows more accurately.
In the OceanBase Database optimizer, column histogram statistics are stored in the ALL_TAB_HISTOGRAMS, DBA_TAB_HISTOGRAMS, and USER_TAB_HISTOGRAMS views and include the following information:
Applicability
At present, OceanBase Database Community Edition does not support the ALL_TAB_HISTOGRAMS and USER_TAB_HISTOGRAMS views.
Basic information of the histogram, including
tenant_id,table_id,partition_id, andcolumn_idStatistics type of the histogram, where the information level can be
GLOBAL,PARTITION, orSUBPARTITIONAmount of data accumulated per bucket in the histogram (including the sum of the current bucket and its previous buckets)
Maximum value in each bucket in the histogram
Frequency of the maximum value in each bucket in the histogram
Types of histograms
The optimizer of OceanBase Database supports three types of histograms: frequency histograms, TopK histograms, and hybrid histograms. In a frequency histogram, each distinct column value corresponds to a single bucket in the histogram. The number of buckets specified must be no less than the NDV value of the column. A TopK histogram is a variant of the frequency histogram. Based on the Lossy Counting algorithm, it estimates the overall data distribution by calculating the feature of some data at a ratio of no less than 1 - (1/bucket_size). A hybrid histogram is set based on a specified amount of data collected. Unlike frequency histograms and TopK histograms, in a hybrid histogram, a bucket may contain multiple distinct values. The collected data records are divided into segments whose number is the same as that of buckets. Data records in each segment are placed in the corresponding bucket, so that a minimum number of buckets are used to describe the distribution of data at a larger volume. The maximum value in each bucket is endpoint_value. endpoint_repeat_cnt records the frequency of endpoint_value.
Selection of histograms
The OceanBase Database optimizer selects the histogram type according to the values of NDV, bucket_size, and p.
The metrics are described as follows:
NDV: specifies the number of distinct values in a column.bucket_size: specifies the number of buckets for the histogram. Default value: 254.p: specifies the minimum percentage threshold expected for the TopK histogram. Calculation formula:(1 - (1/bucket_size)) * 100. If the default value 254 is used, the corresponding minimum percentage threshold is 99.6.