A well-designed table is essential not only for ensuring data integrity and consistency but also for significantly improving query performance, storage efficiency, and system scalability.
This topic provides best practices for table design in hybrid transactional and analytical (HTAP) scenarios, including selecting the appropriate table storage format, designing table structures, defining primary keys, and effectively using indexes.
Key benefits and applicable scenarios
| Scenario | Optimization solution | Effect |
|---|---|---|
| Massive data ETL analysis | Columnstore table without a primary key + time range partitioning | Enhances batch import performance and reduces management overhead |
| HTAP workload | Hybrid row-column storage + primary key that includes the partitioning key | Balances transactional and analytical workloads while ensuring consistency |
| JSON multi-valued query | JSON multi-valued index | Speeds up operations like JSON_CONTAINS and improves filtering efficiency |
| Text fuzzy search | Full-text index | Quickly retrieves text content without requiring a full-table scan |
| Join performance optimization for multi-tables | Table group management (data evenly distributed) | Minimizes cross-node data migration and improves complex query efficiency |
| Time-series data processing | HASH partitioning (user ID) + subpartitioning by RANGE (time field) | Speeds up time-range queries through partition pruning and ensures balanced data distribution |
Storage format selection strategy
Comparison of row-based storage and columnar storage
| Storage type | Applicable scenario | Performance characteristic | Typical application scenario |
|---|---|---|---|
| Columnar storage | Online analytical processing (OLAP) workloads | High compression ratio/Optimized columnar scans | Data warehouses/Complex aggregation queries |
| Hybrid row-column storage | HTAP workloads | Balanced performance for transactions and analytics | Real-time reporting/Transaction-analytics systems |
| Row-based storage | Online transactional processing (OLTP) workloads | Low-latency point queries/High-concurrency writes | Order systems/User account management |
Configure the storage format
-- Query the current storage format configuration.
SHOW PARAMETERS LIKE '%store_format%';
-- Set the default storage format at the tenant level.
ALTER SYSTEM SET default_table_store_format = "column"; -- Columnar storage mode.
ALTER SYSTEM SET default_table_store_format = "compound"; -- Hybrid row-column storage mode.
Schema design guidelines
Index-organized and heap-organized table design
Comparison
| Feature | Index-organized table | Heap-organized table |
|---|---|---|
| Data storage method | Data is stored in primary key order, tightly bound to the primary key. | Data is stored randomly, with indexes separate from the data. |
| Primary key index type | Clustered index (primary key points directly to the data location). | Non-clustered index (primary key serves as a secondary index). |
| Write performance | Lower (requires maintaining sorted order). | Higher (does not require sorting data). |
| Recommended scenarios | Ideal for frequent primary key-based data access. | Ideal for for bulk imports and complex query analysis. |
Index-organized table
Characteristics:
- Primary key bound to data: Data is stored and sorted by the primary key, which acts as a clustered index. This eliminates additional storage overhead for secondary indexes.
- Efficient access: The primary key directly maps to the data’s storage location, enabling faster lookups during queries.
- Write performance: Insert operations require maintaining the primary key order, which can impact write performance, particularly in high-concurrency environments.
Example schema:
CREATE TABLE customer (
user_id bigint NOT NULL,
login_time timestamp NOT NULL,
customer_name varchar(100) NOT NULL,
phone_num bigint NOT NULL,
city_name varchar(50) NOT NULL,
sex int NOT NULL,
id_number varchar(18) NOT NULL,
home_address varchar(255) NOT NULL,
office_address varchar(255) NOT NULL,
age int NOT NULL
);
Note
By default, the example above creates an index-organized table. OceanBase Database allows you to control the default table organization mode for newly created tables using the default_table_organization parameter. Setting this parameter to HEAP creates heap-organized tables by default, while INDEX creates index-organized tables. For more information, see default_table_organization.
Heap-organized table
Heap-organized tables are available starting from OceanBase Database V4.3.5 BP1.
Characteristics:
- Unordered data storage: Data is inserted without being sorted by the primary key.
- Primary key as a secondary index: The primary key acts as a non-clustered index, keeping the data independent of the index.
- High write performance: Since data order is not enforced, write operations are faster and more efficient.
Example schema:
CREATE TABLE customer_heap (
user_id BIGINT NOT NULL PRIMARY KEY,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
...
) ORGANIZATION = HEAP;
Partitioning design
Scenario:
- Large data volumes: When the number of records in a table reaches hundreds of millions or more, operations on a single table can degrade query performance and even create resource bottlenecks in the database.
- Frequent queries on specific fields: For example, if the time field is frequently used as a query condition, partitioning by time can significantly improve the efficiency of time-based queries.
Choosing a partition type:
| Partition strategy | Applicable scenario | Data distribution characteristics | Typical configuration example |
|---|---|---|---|
| HASH | Uniform distribution of data across nodes | High-cardinality columns (such as user IDs) | PARTITION BY HASH(user_id) |
| RANGE | Time-series or numeric range data | Date or numeric ranges |
|
| LIST | Discrete value-based storage | Regions or status codes | PARTITION BY LIST(region_code) |
| KEY | Multi-column composite key or uniform storage of non-integer fields | Integer values obtained by applying the hash algorithm to the partitioning key, followed by modulo operation | PARTITION BY KEY(region, create_date) |
Notice
KEY partitioning is supported only in the MySQL-compatible mode of OceanBase Database.
In analytical processing (AP) scenarios, queries often involve multiple dimensions. If no single dimension can partition the data effectively for all queries, but partitioning is still necessary to distribute data across multiple machines and leverage the computing power of distributed nodes, the partition key for HASH partitioning can be chosen as follows to ensure an even data distribution:
- Ensure the number of distinct values (NDV) is more than the number of partitions.
- Ensure the data in this column is not be heavily skewed or has only minor skew (for example, transaction IDs, user IDs, or auto-increment columns).
- Preferably, use integer columns or time columns. Consider varchar/char columns if necessary.
- Choose columns that frequently appear in query conditions to enable partition pruning and improve query performance.
Example: Using HASH partitioning
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL
)
PARTITION BY HASH(user_id) PARTITIONS 128;
Primary key design
Scenario:
- Ensure uniqueness: Design a primary key when it is necessary to guarantee the uniqueness of each record in the table.
- Boost query efficiency: A primary key helps the query optimizer generate more efficient query plans. We recommend setting fields that appear in every query as the primary key. This allows the primary key to enhance query performance effectively.
Here is an example:
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
PRIMARY KEY (user_id, age, login_time)
)
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
SUBPARTITION p_youth VALUES LESS THAN (25),
SUBPARTITION p_adult VALUES LESS THAN (40),
SUBPARTITION p_middle_aged VALUES LESS THAN (60),
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE)
);
Optimization solutions for HTAP workloads
Storage architecture selection
| Architecture type | Applicable scenarios | Performance characteristics | Resource consumption |
|---|---|---|---|
| Rowstore table + columnstore index | Transaction processing (TP) workloads with lightweight AP queries | High write performance, limited optimization for AP queries | Index storage increases by 30%-50% |
| Columnstore table + rowstore index | AP workloads with occasional TP point queries | Optimal analysis performance, TP queries require index assistance | Full data is stored twice |
| Hybrid row-column storage | Balanced TP and AP workloads | Strong consistency, with automatic routing | Storage overhead increases by 100% |
| Columnstore replica (2F1A1C) | Independent AP analysis | Read-write isolation with eventual consistency | Additional replica storage |
Special index optimization practices
JSON multi-valued indexes
Scenario:
JSON multi-valued indexes are designed for array fields in JSON documents. They are ideal for scenarios where queries involve multiple values or attributes. Common use cases include:
Many-to-many relationship queries:
In scenarios with many-to-many relationships between two entities, a multi-valued index can significantly improve query performance. For example, actors may appear in multiple movies, and a movie may feature multiple actors. By storing all the actors in a JSON array for each movie, a JSON multi-valued index can optimize queries for movies featuring a specific actor.
Tag and category queries:
When entities have multiple tags or categories, a multi-valued index can enhance query performance. For example, a product may have multiple tags stored in a JSON array. Using a JSON multi-valued index allows for faster queries to find products with specific tags.
When to use: JSON multi-valued indexes are particularly effective for accelerating queries on JSON arrays when the WHERE clause includes any of the following predicates:
MEMBER OF()JSON_CONTAINS()JSON_OVERLAPS()
For more information about JSON multi-valued indexes, see Multi-valued indexes.
Case study:
Scenario: Suppose we have a user information table that records user ID, name, age, and hobbies. The hobbies are stored as a JSON array, where a user can have multiple hobbies, which can be understood as tags for the user. The structure and sample data of the table used to store user information are as follows:
create table user_info(user_id bigint, name varchar(1024), age bigint, hobbies json);
insert into user_info values(1, "LiLei", 18, '["reading", "knitting", "hiking"]');
insert into user_info values(2, "HanMeimei", 17, '["reading", "Painting", "Swimming"]');
insert into user_info values(3, "XiaoMing", 19, '["hiking", "Camping", "Swimming"]');
In targeted advertising for products, we may need to accurately identify users based on their hobbies. For example, before promoting outdoor hiking equipment, we need to query which users have a hobby of "hiking." The corresponding query is as follows:
select user_id, name from user_info where JSON_CONTAINS(hobbies->'$[*]', CAST('["hiking"]' AS JSON));
The query result is as follows:
+---------+----------+
| user_id | name |
+---------+----------+
| 1 | LiLei |
| 3 | XiaoMing |
+---------+----------+
Now let's execute the following statement to check the efficiency of the initial query:
-- When this query is executed for the first time, it may require a full table scan, resulting in low efficiency.
Explain select user_id, name from user_info where JSON_CONTAINS(hobbies->'$[*]', CAST('["hiking"]' AS JSON));
The return result is as follows:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ==================================================== |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ---------------------------------------------------- |
| |0 |TABLE FULL SCAN|user_info|2 |3 | |
| ==================================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([user_info.user_id], [user_info.name]), filter([JSON_CONTAINS(JSON_EXTRACT(user_info.hobbies, '$[*]'), cast('[\"hiking\"]', JSON(536870911)))]), rowset=16 |
| access([user_info.hobbies], [user_info.user_id], [user_info.name]), partitions(p0) |
| is_index_back=false, is_global_index=false, filter_before_indexback[false], |
| range_key([user_info.__pk_increment]), range(MIN ; MAX)always true |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
11 rows in set
From the query plan above, we can see that the entire query requires a full table scan (TABLE FULL SCAN), and each row must be filtered and compared against the JSON array. The filtering overhead of JSON itself is also significant. When the number of rows to be filtered reaches a certain threshold, it can severely impact query efficiency. At this point, creating a JSON multi-valued index on the hobbies column can significantly improve the query's performance.
Currently, the feature to add JSON multi-valued indexes to an existing table is disabled by default. It needs to be enabled under the sys tenant by turning on the switch for creating JSON multi-valued indexes.
Alter system set _enable_add_fulltext_index_to_existing_table = true;
-- The query plan shows that the entire table needs to be scanned. To improve performance, we recommend that you create a JSON multi-valued index on the hobbies column:
CREATE TABLE user_info (
user_id BIGINT,
name VARCHAR(1024),
age BIGINT,
hobbies JSON,
INDEX idx1((CAST(hobbies->"$[*]" AS UNSIGNED ARRAY)))
);
Execute the query again to see whether the query efficiency is improved:
-- After the index is created, execute the query again. The performance is significantly improved.
Explain select user_id, name from user_info where JSON_CONTAINS(hobbies->'$[*]', CAST('["hiking"]' AS JSON));
The return result is as follows:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ========================================================== |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ---------------------------------------------------------- |
| |0 |TABLE FULL SCAN|user_info(idx1)|1 |10 | |
| ========================================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([user_info.user_id], [user_info.name]), filter([JSON_CONTAINS(JSON_EXTRACT(user_info.hobbies, '$[*]'), cast('[\"hiking\"]', JSON(536870911)))]) |
| access([user_info.__pk_increment], [user_info.hobbies], [user_info.user_id], [user_info.name]), partitions(p0) |
| is_index_back=true, is_global_index=false, filter_before_indexback[false], |
| range_key([user_info.SYS_NC_mvi_21], [user_info.__pk_increment], [user_info.__doc_id_1733716274684183]), range(hiking,MIN,MIN ; hiking,MAX,MAX) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
11 rows in set
Note that JSON multi-valued indexes consume additional storage space and may impact write performance. When you modify a JSON field that contains multi-valued indexes (for example, insert, update, or delete operations), the indexes are also updated, increasing the write overhead. Therefore, weigh the advantages and disadvantages of JSON multi-valued indexes and create them as needed.
Full-text index
Scenarios:
When performing fuzzy searches on large volumes of text data, scanning the entire table to query each row often fails to meet performance requirements, especially when the text is extensive, and the data volume is high. Additionally, certain complex query scenarios—such as approximate matching and relevance-based ranking—are difficult to support by simply rewriting SQL statements.
To handle these scenarios more efficiently, full-text indexes are introduced. By preprocessing text content and building keyword indexes, full-text indexes significantly improve the performance of full-text searches. They are applicable to a wide range of use cases, as highlighted below:
- Enterprise internal knowledge bases: Many large enterprises maintain internal knowledge bases to store project documents, meeting records, research reports, and other materials. Full-text indexes enable employees to quickly and accurately locate the information they need, boosting overall work efficiency.
- Online libraries and e-book platforms: For platforms offering a vast collection of books, full-text indexes are essential. Users can search by book titles, author names, or even specific excerpts from a book. The system uses full-text indexes to efficiently return results that match the search criteria.
- News portals and social media platforms: These platforms generate massive amounts of content daily, including articles, posts, and comments. Full-text indexes allow users to filter content streams based on topics, events, or specific names, helping them access the most relevant information.
- Legal document retrieval systems: The legal industry involves extensive document review, such as analyzing contracts, judgments, and legal provisions. A highly efficient full-text search engine can streamline workflows for lawyers, enabling them to quickly find precedents, citations, and relevant legal references.
- Medical and health information systems: In healthcare, doctors often need to review patients' historical medical records, the latest research papers, and other reference materials. Full-text indexes make it easier for healthcare professionals to access the information they need, supporting more accurate diagnostic decisions.
Full-text indexes are ideal for applications that involve managing and querying large volumes of unstructured text data. For more formation about the full-text search capabilities of OceanBase Database, see Full-text index.
Example:
Scenario: We define a table to store document data and set up a full-text index for the documents. Using the full-text index, documents containing the desired keywords can be quickly matched and ranked in descending order of relevance.
CREATE TABLE Articles (
id INT AUTO_INCREMENT,
title VARCHAR(255) ,
content TEXT ,
PRIMARY KEY (id),
FULLTEXT ft1 (content) WITH PARSER SPACE
);
INSERT INTO Articles (title, content) VALUES
('Introduction to OceanBase', 'OceanBase is an open-source relational database management system.'),
('Full-Text Search in Databases', 'Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly.'),
('Advantages of Using OceanBase', 'OceanBase offers several advantages such as high performance, reliability, and ease of use. ');
select * from Articles;
The query result is as follows:
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | title | content |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1 | Introduction to OceanBase | OceanBase is an open-source relational database management system. |
| 2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. |
| 3 | Advantages of Using OceanBase | OceanBase offers several advantages such as high performance, reliability, and ease of use. |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in SET
Execute the following query to find the matching documents:
-- Query the matching documents
select id,title, content,match(content) against('OceanBase database') score from Articles where match(content) against('OceanBase database');
The return result is as follows:
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| id | title | content | score |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| 1 | Introduction to OceanBase | OceanBase is an open-source relational database management system. | 0.5699481865284975 |
| 3 | Advantages of Using OceanBase | OceanBase offers several advantages such as high performance, reliability, and ease of use. | 0.240174672489083 |
| 2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. | 0.20072992700729927 |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
3 rows in set
You can also execute the following EXPLAIN statement to view the execution plan and analyze its performance:
Explain select id,title, content,match(content) against('OceanBase database') score from Articles where match(content) against('OceanBase database');
The execution result is as follows:
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| ============================================================== |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| -------------------------------------------------------------- |
| |0 |SORT | |17 |145 | |
| |1 |└─TEXT RETRIEVAL SCAN|articles(ft1)|17 |138 | |
| ============================================================== |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([articles.id], [articles.title], [articles.content], [MATCH(articles.content) AGAINST('OceanBase database')]), filter(nil), rowset=256 |
| sort_keys([MATCH(articles.content) AGAINST('OceanBase database'), DESC]) |
| 1 - output([articles.id], [articles.content], [articles.title], [MATCH(articles.content) AGAINST('OceanBase database')]), filter(nil), rowset=256 |
| access([articles.id], [articles.content], [articles.title]), partitions(p0) |
| is_index_back=true, is_global_index=false, |
| calc_relevance=true, match_expr(MATCH(articles.content) AGAINST('OceanBase database')), |
| pushdown_match_filter(MATCH(articles.content) AGAINST('OceanBase database')) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
15 rows in set
Optimization in distributed environments
Table group management
In a distributed database, tables are often partitioned, which means the data is distributed across multiple servers. This distribution can result in cross-server communication during complex operations like JOIN queries, leading to performance overhead. To reduce this impact, table groups can be used to minimize cross-server access. For example, you can create a table group named tg1 with the SHARDING attribute set to ADAPTIVE and include two partitioned tables: customer and sales. When the join condition for operations on these partitioned tables includes the partitioning keys, a partition-wise join can be performed, significantly improving performance.
Here is an example:
CREATE TABLEGROUP tg1 SHARDING = 'ADAPTIVE';
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL)
TABLEGROUP = tg1
PARTITION BY HASH(user_id) PARTITIONS 128;
CREATE TABLE sales (
order_id INT,
user_id INT primary key,
item_id INT,
item_count INT)
TABLEGROUP = tg1
PARTITION BY HASH(user_id) PARTITIONS 128;
SELECT * FROM customer, sales where customer.user_id = sales.user_id;
Subpartition management
For scenarios involving large volumes of business data where query patterns are well-defined, subpartitions can be created to take advantage of partition pruning and accelerate query performance. AP systems often involve queries for data from the most recent day or month, which are typically time-based. In such cases, it’s recommended to use a time-related field or time functions as the subpartition key and implement range partitioning for efficient range queries.
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key contains all partitioning keys (user_id and age).
PRIMARY KEY (user_id, age, login_time)
)
-- Primary partitions: distributed by hash on user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partitions: divided by age group
SUBPARTITION p_youth VALUES LESS THAN (25),
SUBPARTITION p_adult VALUES LESS THAN (40),
SUBPARTITION p_middle_aged VALUES LESS THAN (60),
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE)
);
Considerations
Index trade-offs:
- JSON multi-valued indexes and full-text indexes can increase storage overhead. It is important to balance write frequency and query requirements when deciding to create these indexes.
Partition key selection:
- The primary key must include all partitioning keys.
- For hash- or key-partitioned tables, choose fields with high cardinality, minimal skew, and frequent queries as partition keys, such as user IDs or order numbers.
- For range-partitioned tables, use time columns or numeric columns to define ranges.
- Low-cardinality fields, such as province names or company names, are better suited as partition keys for list-partitioned tables.
Distributed design:
- Related tables should be placed in the same table group to ensure data co-location and minimize the network overhead of join operations.
Subpartitioning:
- In AP scenarios, subranges can be designed based on time or business dimension fields (for example, by quarter). While low-cardinality fields are acceptable, the partitioning logic must align with the query patterns.