Determine the columnar storage format
OceanBase Database supports columnstore tables, hybrid rowstore-columnstore tables, and columnstore indexes to adapt to different scenarios. The syntax is as follows:
Create a columnstore table
CREATE TABLE tbl1_cg (col1 INT PRIMARY KEY, col2 VARCHAR(50)) WITH COLUMN GROUP(each column);Create a hybrid rowstore-columnstore table
CREATE TABLE tbl2_cg (col1 INT PRIMARY KEY, col2 VARCHAR(50)) WITH COLUMN GROUP(all columns, each column);The table supports both row-based storage and columnar storage (with one redundant data replica). The optimizer chooses whether to process data based on row-based storage or columnar storage based on costs.
Create a columnstore index
CREATE TABLE tbl3(id INT, name VARCHAR(18), date DATE, PRIMARY KEY (id), INDEX idx1_tbl3_cg (date) WITH COLUMN GROUP(each column));
The primary table uses row-based storage, and the index uses columnar storage.
Columnar storage is commonly used for analytical processing (AP). In queries involving a large amount of data, columnar storage is advantageous over row-based storage with less data scanned, filter pushdown, and the skip index attribute. In queries involving a small amount of data, such as single-row reads by primary key and small-range scans by index, row-based storage achieves fewer access I/Os and supports faster table access by index primary key. You can choose row-based storage or columnar storage based on whether your business involves typical TP or AP.
If your business involves both frequent simple queries and complex large-scale data analysis, you can use hybrid rowstore-columnstore tables or create columnstore indexes on rowstore tables. You can determine whether to use row-based storage or columnar storage based on the following criteria:
For an SQL query, you can estimate its I/Os in table access to determine whether row-based storage or columnar storage is more suitable. Access to a table in row-based storage mode requires at least one I/O operation, and that in columnar storage mode requires one I/O operation per column. In a large-range scan where the number of I/O operations required for row-based storage is greater than or equal to that required for columnar storage, row-based storage has no advantages over columnar storage, and you can use columnar storage instead. I/Os are measured by microblocks. One I/O operation involves 16 KB to 64 KB of data. Here, it is assumed that 64 KB of data is involved in each I/O operation. Assume that a table contains N_TAB_COL columns, the average column length is 8 bytes, and the SQL query is to access N_ACCESS_COL columns and N_ROW rows. In this case, we recommend that you use columnar storage under the following condition:
N_ROW > N_ACCESS_COL * 64K / (8 * N_TAB_COL)
namely,
N_ROW > 8192 * N_ACCESS_COL / N_TAB_COL
Here is an example:
CREATE TABLE t1 (id INT, id2 INT, c1 INT, c2 INT, c3 VARCHAR(256), c4 VARCHAR(256), c5 VARCHAR(256), PRIMARY KEY(id, id2));
SELECT c1, SUM(c2) FROM t1 WHERE id > 0 AND id < 1000 AND c3 > 'abc' GROUP BY c4;
In this example, N_TAB_COL is 7. The SQL query accesses the id, c1, c2, c3, and c4 columns, and therefore N_ACCESS_COL is 5. Assume that the length of each column is 8 bytes. We recommend that you use columnar storage when N_ROW is greater than 8192 × 5/7 within the id range of (0, 1000), namely when more than 6,000 rows are to be scanned.
Simply put, columnar storage is suitable for the following scenarios:
- More than 1,000 rows of a wide table with more than 64 columns are to be scanned.
- More than 8,000 rows of a non-wide table are to be scanned.
If row-based storage store or columnar storage is suitable for all major SQL queries in your business, you can directly choose row-based storage or columnar storage. If your business involves considerable quantities of both SQL queries that prefer row-based storage and SQL queries that prefer columnar storage, you can create a hybrid rowstore-columnstore table. If your business involves only a few SQL queries that prefer columnar storage and that concern a few columns, you can create a rowstore table with columnstore indexes.
Partitioning
OceanBase Database controls data distribution across servers by partition. A partition is the minimum unit of data distribution. Considering possible data skew between partitions and future scalability, you can set the number of partitions to twice the total number of CPU cores of all servers in a zone.
AP usually involves multi-dimensional analytical queries. If you want to distribute data across servers but cannot create partitions suitable for all queries by any dimension, you can select a column as the partitioning key for HASH or KEY partitioning. The column must meet the following requirements:
- The number of distinct values (NDV) in the column is much greater than the number of partitions.
- No data skew or only slight data skew occurs in the column.
- Preferably, the column contains integer or datetime data. Otherwise, the column can contain varchar or char data.
Usually, recent data is queried, with the latest data queried most frequently. For example, you can import business flow data by day and analyze the latest transactions. In this case, you can use a datetime column as the partitioning key for LIST or RANGE partitioning and then perform HASH partitioning. Here is an example:
CREATE TABLE t1 (c1 DATE, c2 INT, c3 INT)
PARTITION BY RANGE COLUMNS (c1)
SUBPARTITION BY HASH (c2) SUBPARTITIONS 32
(
PARTITION p0 VALUES LESS THAN ('2024-01-01'),
PARTITION p1 VALUES LESS THAN ('2024-02-01'),
PARTITION p2 VALUES LESS THAN ('2024-03-01'),
PARTITION p3 VALUES LESS THAN ('2024-04-01'),
PARTITION p4 VALUES LESS THAN ('2024-05-01'),
PARTITION p5 VALUES LESS THAN ('2024-06-01'),
PARTITION p6 VALUES LESS THAN ('2024-07-01'),
PARTITION p8 VALUES LESS THAN ('2024-08-01'),
PARTITION p9 VALUES LESS THAN ('2024-09-01')
)
WITH COLUMN GROUP(EACH COLUMN);
The preceding statement creates a columnstore table that is first RANGE-partitioned with the datetime column as the partitioning key and then HASH-partitioned.