Frequently asked questions about columnar storage
What are columnar storage databases?
OceanBase columnar storage is a data storage format that stores static data on disk in columnar format and modifies data in memory in row format. It balances scanning performance and transaction processing capabilities.
For analytical queries, columnar storage can significantly improve query performance. It is an essential feature for OceanBase to achieve HTAP capabilities. In traditional AP databases, columnar data is typically static and difficult to update in place. However, in OceanBase's LSM Tree architecture, SSTables are static and naturally suitable for columnar storage. MemTables are dynamic and still use row-based storage, which does not affect transaction processing. This allows us to balance the performance of TP and AP queries to some extent.
What is a columnstore index?
OceanBase Database also supports the concept of columnar indexes, which is different from "building an index on columnar data." A columnar index refers to an index table structured in columnar format.
For example, if we have a row-based table t6 and want to sum the values in column c3 with the best performance, we can create a columnar index on c3:
create table t6(
c1 TINYINT,
c2 SMALLINT,
c3 MEDIUMINT
);
create /*+ parallel(2) */ index idx1 on t6(c3) with column group (each column);
In addition, we support more index creation methods:
Redundant row-based data in the index.
create index idx1 on t1(c2) storing(c1) with column group(all columns, each column); alter table t1 add index idx1 (c2) storing(c1) with column group(all columns, each column);Index-only scans are supported.
create index idx1 on t1(c2) storing(c1) with column group(each column); alter table t1 add index idx1 (c2) storing(c1) with column group(each column);
The purpose of using the STORING clause in database indexes is to store additional non-index column data in the index. This can optimize query performance by avoiding table lookups and reducing the cost of index sorting. When a query only needs to access the columns stored in the index and does not require a table lookup, it can significantly improve query efficiency.
What is the difference between a columnar table and a columnar replica?
A columnar table refers to a table where the schema of both the leader and follower partitions is in columnar format, and OLAP queries can be strongly consistent reads. A columnar replica, on the other hand, is a type of replica where the schema of the leader partition is in columnar format, while the schema of the follower partitions remains in row-based format. OLAP queries on a columnar replica can only be weakly consistent reads.
Common questions about columnar tables and columnar replicas
What is the difference between a columnar table and a columnar replica? What is their relationship?
They are not the same concept.
- Columnar table: Refers to the storage format of a table. In a tenant's F (Full) or R (Read) replica, baseline data is stored in columnar format (with optional row-based, columnar, or hybrid row-column storage modes). In a columnar table, both the leader and follower partitions are in columnar format, and OLAP queries can be strongly consistent reads.
- Columnar replica (C replica): Refers to a replica type deployed in an independent zone. In this zone, baseline data of user tables is stored in columnar format. C replicas do not participate in Paxos elections or voting. They asynchronously follow the logs of F/R replicas and locally replay them. They provide only read-only and weakly consistent read services.
Relationship: They are independent. You can create a columnar table in a tenant without any C replicas (the table exists only in F/R replicas in columnar format). Conversely, in a tenant with C replicas, the F/R replicas store baseline data in row-based format, while the C replica stores the same table in columnar format for weakly consistent reads.
Is it necessary to have a columnar replica to create a columnar table? Can I create a columnar table on a Full replica?
No, you can create a columnar table on a Full (F) replica without a columnar replica.
A columnar table is a table-level storage format configuration. You can specify it using the WITH COLUMN GROUP clause when creating a table or by setting the tenant-level configuration item default_table_store_format. As long as the tenant has F/R replicas, you can create a columnar table or a hybrid row-column table in any locality (e.g., F@zone1, F@zone2). The data will be stored in columnar (or hybrid) format in the F/R replicas. A columnar replica (C replica) is an optional, independently deployed replica type used for physical isolation of TP/AP workloads and weakly consistent reads. It is not required to create a columnar table.
Can I directly create a pure columnar table in OceanBase Database? Or must I base it on a row-based table?
You can directly create a pure columnar table without basing it on a row-based table.
You can specify WITH COLUMN GROUP (each column) when creating a table to get a pure columnar table (only columnar baseline, no row-based baseline). For example:
CREATE TABLE t (pk INT PRIMARY KEY, c1 INT, c2 INT) WITH COLUMN GROUP (each column);
In a partitioned table, the primary key column is additionally stored in row-based format for point queries, but the table is still considered a pure columnar table. You can also create a row-based table first and then convert it to a columnar table using ALTER TABLE ... ADD COLUMN GROUP (each column); and other DDL operations. For more information, see "How to create a columnar table?" in this topic.
How is data consistency ensured between columnar and row-based storage? What is the latency if it is asynchronous?
It depends on the scenario:
- Columnar table (on F/R replicas): Baseline data is stored in columnar format, and incremental data is stored in row-based format (MemTable/dump). During reads, the incremental row-based data is real-time merged with the columnar baseline data, ensuring strong consistency with no additional latency.
- Columnar replica (C replica): C replicas do not participate in Paxos. They asynchronously follow the logs of F/R replicas and replay them. They provide weakly consistent reads. The specific latency values are not provided in the documentation. The actual latency depends on network conditions, workload, and the progress of log catching. As an observer, C replicas "real-time follow" the logs, generally achieving sub-second or second-level latency, depending on the actual environment. If strong consistency is required for analysis, use a columnar table or a hybrid row-column table on F/R replicas instead of reading from a C replica.
How can I separate TP and AP traffic? What disaster recovery options are available if AP fails?
Traffic separation:
- Without C replicas: TP and AP traffic share the same F/R replicas and ODP. The optimizer automatically selects the row-based or columnar path. Physical isolation is not possible.
- With C replicas: TP traffic uses the primary ODP and accesses F/R replicas (strong reads/writes). AP traffic uses an independent ODP, configured for weak reads (e.g.,
obproxy_read_consistency = 1), and is routed only to the zone containing the C replica (e.g.,proxy_primary_zone_name,route_target_replica_type = 'ColumnStore',init_sql='set @@ob_route_policy = COLUMN_STORE_ONLY', etc.). This achieves physical isolation between TP and AP workloads. For more information, see Use columnar replicas.
Disaster recovery and limitations for AP (C replica) side:
- C replicas cannot become leaders. Physical recovery is not supported for C replicas (if a tenant's locality includes a C replica, the recovery will fail). If the primary database does not have a C replica, it is not recommended to deploy a C replica on the standby database.
- If the zone or ODP containing the C replica fails, AP queries can fallback to F/R replicas: Change the AP client to connect to the primary ODP or adjust the routing to direct analysis requests to execute strongly consistent reads on F/R replicas (which will share resources with TP workloads).
- Starting from V4.3.5 BP1, you can deploy multiple C replicas (up to 3 recommended) across multiple zones. You can route AP reads to different C replicas through the ODP, achieving redundancy and high availability for AP reads.
How to create a columnstore table?
First, create a hybrid row-columnstore table.
- Non-partitioned table:
create table t1(
c1 TINYINT,
c2 SMALLINT,
c3 MEDIUMINT
) with column group (all columns, each column);
- Partitioned table:
create table t2(
pk int,
c1 int,
c2 int,
primary key (pk)
)
partition by hash(pk) partitions 4
with column group (all columns, each column);
When creating a hybrid row-columnstore table, the with column group (all columns, each column) syntax is always used. This syntax has the following meanings:
all columns: All columns are aggregated into a single group, treated as a wide column, and stored row by row. This is essentially the same as the traditional rowstore.each column: Each column is stored separately in columnar format.
When both all columns and each column are specified, it means that after creating a columnstore table, the data is redundantly stored in both rowstore and columnstore formats. Each replica stores two copies of the baseline data. However, it's worth noting that regardless of the number of baseline data copies, the incremental data in the memtable and the dump still share the same baseline data.
Next, create a pure columnstore table.
- Non-partitioned table:
create table t3(
c1 TINYINT,
c2 SMALLINT,
c3 MEDIUMINT
) with column group (each column);
- Partitioned table:
create table t4(
pk1 int,
pk2 int,
c1 int,
c2 int,
primary key (pk1, pk2)
)
partition by hash(pk1) partitions 4
with column group (each column);
For the t4 table, a columnstore is created for each of the pk1, pk2, c1, and c2 columns, and a rowstore is created for the (pk1, pk2) combination.
How to make a tenant's default table creation format columnstore?
Set a tenant-level parameter:
alter system set default_table_store_format = "column";
Similarly, you can set the default table creation format to rowstore or both rowstore and columnstore:
alter system set default_table_store_format = "row"; // Rowstore
alter system set default_table_store_format = "compound"; // Both rowstore and columnstore
What are the recommended configurations for columnstore?
# Set the collation to utf8mb4_bin to boost performance by 15%
set global collation_connection = utf8mb4_bin;
set global collation_server = utf8mb4_bin;
set global ob_query_timeout= 10000000000;
set global ob_trx_timeout= 100000000000;
set global ob_sql_work_area_percentage=30;
set global max_allowed_packet=67108864;
# Recommended value is 10 times the number of CPU cores
set global parallel_servers_target=1000;
set global parallel_degree_policy = auto;
set global parallel_min_scan_time_threshold = 10;
# Limit the maximum DOP when parallel_degree_policy is set to auto
# A high DOP may cause performance issues. The recommended value is cpu_count * 2
set global parallel_degree_limit = 0;
alter system set compaction_low_thread_score = cpu_count;
alter system set compaction_mid_thread_score = cpu_count;
alter system set default_table_store_format = "column";
Note
In the code above, cpu_count represents the min_cpu specified when the tenant was created.
How to determine if a query is using columnstore?
When a query uses rowstore, the explain output shows TABLE FULL SCAN. When a query uses columnstore, the explain output shows COLUMN TABLE FULL SCAN. For example, consider the following t5 table:
create table t5(
c1 TINYINT,
c2 SMALLINT,
c3 MEDIUMINT,
c4 INT,
c5 INT,
PRIMARY KEY(c1, c2)
) with column group(all columns, each column);
OceanBase(admin@test)>explain select c1,c2 from t5;
+------------------------------------------------------------------------+
| Query Plan |
+------------------------------------------------------------------------+
| =============================================== |
| |ID|OPERATOR |NAME|EST.ROWS|EST.TIME(us)| |
| ----------------------------------------------- |
| |0 |TABLE FULL SCAN|t5 |1 |3 | |
| =============================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([t5.c1], [t5.c2]), filter(nil), rowset=16 |
| access([t5.c1], [t5.c2]), partitions(p0) |
| is_index_back=false, is_global_index=false, |
| range_key([t5.c1], [t5.c2]), range(MIN,MIN ; MAX,MAX)always true |
+------------------------------------------------------------------------+
11 rows in set (0.011 sec)
OceanBase(admin@test)>explain select c1 from t5;
+------------------------------------------------------------------------+
| Query Plan |
+------------------------------------------------------------------------+
| ====================================================== |
| |ID|OPERATOR |NAME|EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------------ |
| |0 |COLUMN TABLE FULL SCAN|t5 |1 |3 | |
| ====================================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([t5.c1]), filter(nil), rowset=16 |
| access([t5.c1]), partitions(p0) |
| is_index_back=false, is_global_index=false, |
| range_key([t5.c1], [t5.c2]), range(MIN,MIN ; MAX,MAX)always true |
+------------------------------------------------------------------------+
11 rows in set (0.003 sec)
Can columns be added or dropped in a columnstore table?
Yes, columns can be added or dropped.
The size of a VARCHAR column can be increased or decreased.
Columnstore supports various offline DDL operations, similar to rowstore tables.
For more information about modifying columnstore tables, see Convert a table from rowstore to columnstore (MySQL mode) and Convert a table from rowstore to columnstore (Oracle mode).
What are the characteristics of queries on columnstore tables?
In a hybrid row-columnstore table, range scans default to columnstore, while point get queries fall back to rowstore.
In a pure columnstore table, all queries use columnstore.
Does columnstore support transactions, and are there any size limitations?
Yes, columnstore supports transactions, and there are no size limitations, ensuring high consistency.
Are there any special considerations for log synchronization, backup, and restore in columnstore tables?
No, columnstore tables behave the same as rowstore tables. Synchronization logs are in rowstore format.
Can a rowstore table be converted to a columnstore table using DDL?
Yes. You can add columnstore and drop rowstore. Here's an example:
create table t1( pk1 int, c2 int, primary key (pk1));
alter table t1 add column group(all columns, each column);
alter table t1 drop column group(all columns, each column);
alter table t1 add column group(each column);
alter table t1 drop column group(each column);
Note
After executing alter table t1 drop column group(all columns, each column);, you don't need to worry about any group to hold the data. All columns will be placed in a default group called DEFAUTL COLUMN GROUP.
Can multiple columns be grouped together in columnstore?
In OceanBase Database V4.3.0, only two options are supported: either store each column separately or store all columns as a single rowstore group. Mixed grouping of columns is not supported.
Does OceanBase Database support updates on columnstore tables, and what is the structure of a MemTable?
In OceanBase Database, add, delete, and modify operations are performed in memory. The data is stored in the MemTable in rowstore format. The baseline data is read-only and stored in columnstore format on disk. When a column is queried, the rowstore data in the MemTable and the columnstore data on disk are merged in real time and returned to the user. This means that OceanBase Database supports strong-consistency reads on columnstore tables without data latency. The data written to the MemTable can be dumped, and the dumped data is still stored in rowstore format. After a major compaction, the rowstore data and baseline columnstore data are merged to form new baseline columnstore data.
Notice
For columnstore tables, if there are a large number of update operations and no major compaction is performed in a timely manner, query performance will be affected. Therefore, we recommend that you perform a major compaction after you import data in batches to achieve optimal query performance. A small number of update operations will not significantly affect performance.
What are the characteristics of queries on columnar tables?
In a redundant rowstore table, queries on columnstore tables default to using range scans in columnstore mode, while point get queries fall back to rowstore mode.
In a pure columnar table, all queries are performed in columnar mode.
Can I create an index on a column of a columnstore table?
Yes. OceanBase Database does not differentiate between creating an index on a columnstore table or a rowstore table. The index structures are the same.
Creating an index on one or more columns of a columnstore table allows you to construct a covering index to improve point query performance or sort specific columns to enhance sorting performance.
What is the maximum number of columns supported?
Currently, a columnstore table can support up to 4096 columns.
What are the considerations when using columnstore tables?
First: After you import data in batches, we recommend that you perform a major compaction to optimize read performance. After you import data, trigger a major compaction in the tenant. Then, execute the alter system major freeze; statement in the tenant and execute the select STATUS from CDB_OB_MAJOR_COMPACTION where TENANT_ID = tenant ID; statement in the system tenant to check whether the major compaction is completed. When the STATUS field changes to IDLE, the major compaction is completed.
Second: After a major compaction, we recommend that you collect statistics. To collect statistics, perform the following steps:
In a business tenant, collect statistics for all tables by using 16 threads in parallel:
CALL DBMS_STATS.GATHER_SCHEMA_STATS ('db', granularity=>'auto', degree=>16);You can query the
GV$OB_OPT_STAT_GATHER_MONITORview to view the statistics collection progress.
Third: You can use the full-logic direct load feature to import data in batches. Tables that use this feature do not need to perform a major compaction to achieve optimal columnstore scan performance. The obloader tool and the native LOAD DATA statement support full-logic direct load.
Fourth: For non-large wide tables, you can achieve performance comparable to that of columnstore tables without using columnstore tables. This is due to the row-column hybrid storage architecture at the microblock level in OceanBase Database's rowstore version (do not be surprised by this).
Fifth: The performance of cold runs and hot runs differs for tables with large amounts of data.
Sixth: The optimizer automatically selects whether to access columnstore data in rowstore mode or columnstore mode based on cost estimation.
Seventh: The merge speed of columnar tables will be slower.
Are there any special considerations for log synchronization, backup, and restore of columnstore tables?
No. They are the same as those for rowstore tables. The synchronized logs are in rowstore mode.
FAQ about data import and migration
What is bypass import and how do I perform it?
Bypass import is a method for accelerating data import and query. For large tables, we recommend using bypass import. Currently, the load data command and the insert into select statement support bypass import. For more information, see Overview of bypass import.
Can I use Flink CDC to synchronize data from other databases to OceanBase Database?
Yes. For more information, see Use Flink CDC to synchronize data from a MySQL database to OceanBase Database.
Can I use Flink Connector to access OceanBase Database?
Yes. For more information, see https://github.com/oceanbase/flink-connector-oceanbase.
FAQ about performance optimization
What are other ways to further improve AP query performance?
Based on some practical experience, if there are no special sorting requirements, do not use the utf8mb4 character set when creating a table, but instead use the binary character set to improve performance. For example:
create table t5(c1 TINYINT, c2 VARVHAR(50)) CHARSET=binary with column group (each column);
Secondly, if the user or business can accept it, specify the utf8mb4_bin character set when creating a MySQL tenant table, and add the following parameters: CHARSET = utf8mb4 collate=utf8mb4_bin
Additionally, increasing the IOPS of the unit can speed up the direct import.
What are the characteristics of the optimizer for columnar storage?
Compared to the optimizer for row-based storage, the optimizer for columnar storage has the following features:
- The optimizer can autonomously choose between row-based and columnar storage.
- The optimizer can control the choice between row-based and columnar storage using hints at the table level.
- The optimizer has adapted the plan cost calculation for columnar storage.
- The optimizer has added late materialization optimization for columnar storage.
