Best practices for table design and index optimization-OceanBase

A well-designed table is essential not only for ensuring data integrity and consistency but also for significantly improving query performance, storage efficiency, and system scalability.

This topic provides best practices for table design in hybrid transactional and analytical (HTAP) scenarios, including selecting the appropriate table storage format, designing table structures, defining primary keys, and effectively using indexes.

Key benefits and applicable scenarios

Scenario	Optimization solution	Effect
Massive data ETL analysis	Columnstore table without a primary key + time range partitioning	Enhances batch import performance and reduces management overhead
HTAP workload	Hybrid row-column storage + primary key that includes the partitioning key	Balances transactional and analytical workloads while ensuring consistency
JSON multi-valued query	JSON multi-valued index	Speeds up operations like `JSON_CONTAINS` and improves filtering efficiency
Text fuzzy search	Full-text index	Quickly retrieves text content without requiring a full-table scan
Join performance optimization for multi-tables	Table group management (data evenly distributed)	Minimizes cross-node data migration and improves complex query efficiency
Time-series data processing	HASH partitioning (user ID) + subpartitioning by RANGE (time field)	Speeds up time-range queries through partition pruning and ensures balanced data distribution

Storage format selection strategy

Comparison of row-based storage and columnar storage

Storage type	Applicable scenario	Performance characteristic	Typical application scenario
Columnar storage	Online analytical processing (OLAP) workloads	High compression ratio/Optimized columnar scans	Data warehouses/Complex aggregation queries
Hybrid row-column storage	HTAP workloads	Balanced performance for transactions and analytics	Real-time reporting/Transaction-analytics systems
Row-based storage	Online transactional processing (OLTP) workloads	Low-latency point queries/High-concurrency writes	Order systems/User account management

Configure the storage format

-- Query the current storage format configuration.
SHOW PARAMETERS LIKE '%store_format%';

-- Set the default storage format at the tenant level.
ALTER SYSTEM SET default_table_store_format = "column";  -- Columnar storage mode.
ALTER SYSTEM SET default_table_store_format = "compound"; -- Hybrid row-column storage mode.

Schema design guidelines

Index-organized and heap-organized table design

Comparison

Feature	Index-organized table	Heap-organized table
Data storage method	Data is stored in primary key order, tightly bound to the primary key.	Data is stored randomly, with indexes separate from the data.
Primary key index type	Clustered index (primary key points directly to the data location).	Non-clustered index (primary key serves as a secondary index).
Write performance	Lower (requires maintaining sorted order).	Higher (does not require sorting data).
Recommended scenarios	Ideal for frequent primary key-based data access.	Ideal for for bulk imports and complex query analysis.

Index-organized table

Characteristics:

Primary key bound to data: Data is stored and sorted by the primary key, which acts as a clustered index. This eliminates additional storage overhead for secondary indexes.
Efficient access: The primary key directly maps to the data’s storage location, enabling faster lookups during queries.
Write performance: Insert operations require maintaining the primary key order, which can impact write performance, particularly in high-concurrency environments.

Example schema:

CREATE TABLE customer (
user_id bigint NOT NULL,
login_time timestamp NOT NULL,
customer_name varchar(100) NOT NULL,    
phone_num bigint NOT NULL,
city_name varchar(50) NOT NULL,
sex int NOT NULL,
id_number varchar(18) NOT NULL,
home_address varchar(255) NOT NULL,
office_address varchar(255) NOT NULL,  
age int NOT NULL  
);

Note

By default, the example above creates an index-organized table. OceanBase Database allows you to control the default table organization mode for newly created tables using the default_table_organization parameter. Setting this parameter to HEAP creates heap-organized tables by default, while INDEX creates index-organized tables. For more information, see default_table_organization.

Heap-organized table

Heap-organized tables are available starting from OceanBase Database V4.3.5 BP1.

Characteristics:

Unordered data storage: Data is inserted without being sorted by the primary key.
Primary key as a secondary index: The primary key acts as a non-clustered index, keeping the data independent of the index.
High write performance: Since data order is not enforced, write operations are faster and more efficient.

Example schema:

CREATE TABLE customer_heap (
    user_id BIGINT NOT NULL PRIMARY KEY,
    login_time TIMESTAMP NOT NULL,
    customer_name VARCHAR(100) NOT NULL,
    ...
) ORGANIZATION = HEAP;

Partitioning design

Scenario:

Large data volumes: When the number of records in a table reaches hundreds of millions or more, operations on a single table can degrade query performance and even create resource bottlenecks in the database.
Frequent queries on specific fields: For example, if the time field is frequently used as a query condition, partitioning by time can significantly improve the efficiency of time-based queries.

Choosing a partition type:

Partition strategy	Applicable scenario	Data distribution characteristics	Typical configuration example
HASH	Uniform distribution of data across nodes	High-cardinality columns (such as user IDs)	`PARTITION BY HASH(user_id)`
RANGE	Time-series or numeric range data	Date or numeric ranges	MySQL-compatible mode: `PARTITION BY RANGE [COLUMNS] (create_date)` Oracle-compatible mode: `PARTITION BY RANGE(create_date)`
LIST	Discrete value-based storage	Regions or status codes	`PARTITION BY LIST(region_code)`
KEY	Multi-column composite key or uniform storage of non-integer fields	Integer values obtained by applying the hash algorithm to the partitioning key, followed by modulo operation	`PARTITION BY KEY(region, create_date)`

Notice

KEY partitioning is supported only in the MySQL-compatible mode of OceanBase Database.

In analytical processing (AP) scenarios, queries often involve multiple dimensions. If no single dimension can partition the data effectively for all queries, but partitioning is still necessary to distribute data across multiple machines and leverage the computing power of distributed nodes, the partition key for HASH partitioning can be chosen as follows to ensure an even data distribution:

Ensure the number of distinct values (NDV) is more than the number of partitions.
Ensure the data in this column is not be heavily skewed or has only minor skew (for example, transaction IDs, user IDs, or auto-increment columns).
Preferably, use integer columns or time columns. Consider varchar/char columns if necessary.
Choose columns that frequently appear in query conditions to enable partition pruning and improve query performance.

Example: Using HASH partitioning

CREATE TABLE customer (
    user_id BIGINT NOT NULL,
    login_time TIMESTAMP NOT NULL,
    customer_name VARCHAR(100) NOT NULL,
    phone_num BIGINT NOT NULL,
    city_name VARCHAR(50) NOT NULL,
    sex INT NOT NULL,
    id_number VARCHAR(18) NOT NULL,
    home_address VARCHAR(255) NOT NULL,
    office_address VARCHAR(255) NOT NULL,
    age INT NOT NULL
)
PARTITION BY HASH(user_id) PARTITIONS 128;

Primary key design

Scenario:

Ensure uniqueness: Design a primary key when it is necessary to guarantee the uniqueness of each record in the table.
Boost query efficiency: A primary key helps the query optimizer generate more efficient query plans. We recommend setting fields that appear in every query as the primary key. This allows the primary key to enhance query performance effectively.

Here is an example:

CREATE TABLE customer (
    user_id BIGINT NOT NULL,
    login_time TIMESTAMP NOT NULL,
    customer_name VARCHAR(100) NOT NULL,
    phone_num BIGINT NOT NULL,
    city_name VARCHAR(50) NOT NULL,
    sex INT NOT NULL,
    id_number VARCHAR(18) NOT NULL,
    home_address VARCHAR(255) NOT NULL,
    office_address VARCHAR(255) NOT NULL,
    age INT NOT NULL,
    PRIMARY KEY (user_id, age, login_time)
)
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
    SUBPARTITION p_youth VALUES LESS THAN (25),
    SUBPARTITION p_adult VALUES LESS THAN (40),
    SUBPARTITION p_middle_aged VALUES LESS THAN (60),
    SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE)
);

Optimization solutions for HTAP workloads

Storage architecture selection

Architecture type	Applicable scenarios	Performance characteristics	Resource consumption
Rowstore table + columnstore index	Transaction processing (TP) workloads with lightweight AP queries	High write performance, limited optimization for AP queries	Index storage increases by 30%-50%
Columnstore table + rowstore index	AP workloads with occasional TP point queries	Optimal analysis performance, TP queries require index assistance	Full data is stored twice
Hybrid row-column storage	Balanced TP and AP workloads	Strong consistency, with automatic routing	Storage overhead increases by 100%
Columnstore replica (2F1A1C)	Independent AP analysis	Read-write isolation with eventual consistency	Additional replica storage

Special index optimization practices

JSON multi-valued indexes

Scenario:

JSON multi-valued indexes are designed for array fields in JSON documents. They are ideal for scenarios where queries involve multiple values or attributes. Common use cases include:

Many-to-many relationship queries:

In scenarios with many-to-many relationships between two entities, a multi-valued index can significantly improve query performance. For example, actors may appear in multiple movies, and a movie may feature multiple actors. By storing all the actors in a JSON array for each movie, a JSON multi-valued index can optimize queries for movies featuring a specific actor.
Tag and category queries:

When entities have multiple tags or categories, a multi-valued index can enhance query performance. For example, a product may have multiple tags stored in a JSON array. Using a JSON multi-valued index allows for faster queries to find products with specific tags.

When to use: JSON multi-valued indexes are particularly effective for accelerating queries on JSON arrays when the WHERE clause includes any of the following predicates:

MEMBER OF()
JSON_CONTAINS()
JSON_OVERLAPS()

For more information about JSON multi-valued indexes, see Multi-valued indexes.

Case study:

Scenario: Suppose we have a user information table that records user ID, name, age, and hobbies. The hobbies are stored as a JSON array, where a user can have multiple hobbies, which can be understood as tags for the user. The structure and sample data of the table used to store user information are as follows:

create table user_info(user_id bigint, name varchar(1024), age bigint, hobbies json);
insert into user_info values(1, "LiLei", 18, '["reading", "knitting", "hiking"]');
insert into user_info values(2, "HanMeimei", 17, '["reading", "Painting", "Swimming"]');
insert into user_info values(3, "XiaoMing", 19, '["hiking", "Camping", "Swimming"]');

In targeted advertising for products, we may need to accurately identify users based on their hobbies. For example, before promoting outdoor hiking equipment, we need to query which users have a hobby of "hiking." The corresponding query is as follows:

select user_id, name from user_info where JSON_CONTAINS(hobbies->'$[*]', CAST('["hiking"]' AS JSON));

The query result is as follows:

+---------+----------+
| user_id | name     |
+---------+----------+
|       1 | LiLei    |
|       3 | XiaoMing |
+---------+----------+

Now let's execute the following statement to check the efficiency of the initial query:

-- When this query is executed for the first time, it may require a full table scan, resulting in low efficiency.
Explain select user_id, name from user_info where JSON_CONTAINS(hobbies->'$[*]', CAST('["hiking"]' AS JSON));

The return result is as follows:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan                                                                                                                                                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ====================================================                                                                                                                    |
| |ID|OPERATOR       |NAME     |EST.ROWS|EST.TIME(us)|                                                                                                                    |
| ----------------------------------------------------                                                                                                                    |
| |0 |TABLE FULL SCAN|user_info|2       |3           |                                                                                                                    |
| ====================================================                                                                                                                    |
| Outputs & filters:                                                                                                                                                      |
| -------------------------------------                                                                                                                                   |
|   0 - output([user_info.user_id], [user_info.name]), filter([JSON_CONTAINS(JSON_EXTRACT(user_info.hobbies, '$[*]'), cast('[\"hiking\"]', JSON(536870911)))]), rowset=16 |
|       access([user_info.hobbies], [user_info.user_id], [user_info.name]), partitions(p0)                                                                                |
|       is_index_back=false, is_global_index=false, filter_before_indexback[false],                                                                                       |
|       range_key([user_info.__pk_increment]), range(MIN ; MAX)always true                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
11 rows in set

From the query plan above, we can see that the entire query requires a full table scan (TABLE FULL SCAN), and each row must be filtered and compared against the JSON array. The filtering overhead of JSON itself is also significant. When the number of rows to be filtered reaches a certain threshold, it can severely impact query efficiency. At this point, creating a JSON multi-valued index on the hobbies column can significantly improve the query's performance.

Currently, the feature to add JSON multi-valued indexes to an existing table is disabled by default. It needs to be enabled under the sys tenant by turning on the switch for creating JSON multi-valued indexes.

Alter system set _enable_add_fulltext_index_to_existing_table = true;

-- The query plan shows that the entire table needs to be scanned. To improve performance, we recommend that you create a JSON multi-valued index on the hobbies column:
CREATE TABLE user_info (
    user_id BIGINT,
    name VARCHAR(1024),
    age BIGINT,
    hobbies JSON,
    INDEX idx1((CAST(hobbies->"$[*]" AS UNSIGNED ARRAY)))
);

Execute the query again to see whether the query efficiency is improved:

-- After the index is created, execute the query again. The performance is significantly improved.
Explain select user_id, name from user_info where JSON_CONTAINS(hobbies->'$[*]', CAST('["hiking"]' AS JSON));

The return result is as follows:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ==========================================================                                                                                                   |
| |ID|OPERATOR       |NAME           |EST.ROWS|EST.TIME(us)|                                                                                                   |
| ----------------------------------------------------------                                                                                                   |
| |0 |TABLE FULL SCAN|user_info(idx1)|1       |10          |                                                                                                   |
| ==========================================================                                                                                                   |
| Outputs & filters:                                                                                                                                           |
| -------------------------------------                                                                                                                        |
|   0 - output([user_info.user_id], [user_info.name]), filter([JSON_CONTAINS(JSON_EXTRACT(user_info.hobbies, '$[*]'), cast('[\"hiking\"]', JSON(536870911)))]) |
|       access([user_info.__pk_increment], [user_info.hobbies], [user_info.user_id], [user_info.name]), partitions(p0)                                         |
|       is_index_back=true, is_global_index=false, filter_before_indexback[false],                                                                             |
|       range_key([user_info.SYS_NC_mvi_21], [user_info.__pk_increment], [user_info.__doc_id_1733716274684183]), range(hiking,MIN,MIN ; hiking,MAX,MAX)        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
11 rows in set

Note that JSON multi-valued indexes consume additional storage space and may impact write performance. When you modify a JSON field that contains multi-valued indexes (for example, insert, update, or delete operations), the indexes are also updated, increasing the write overhead. Therefore, weigh the advantages and disadvantages of JSON multi-valued indexes and create them as needed.

Full-text index

Scenarios:

When performing fuzzy searches on large volumes of text data, scanning the entire table to query each row often fails to meet performance requirements, especially when the text is extensive, and the data volume is high. Additionally, certain complex query scenarios—such as approximate matching and relevance-based ranking—are difficult to support by simply rewriting SQL statements.

To handle these scenarios more efficiently, full-text indexes are introduced. By preprocessing text content and building keyword indexes, full-text indexes significantly improve the performance of full-text searches. They are applicable to a wide range of use cases, as highlighted below:

Enterprise internal knowledge bases: Many large enterprises maintain internal knowledge bases to store project documents, meeting records, research reports, and other materials. Full-text indexes enable employees to quickly and accurately locate the information they need, boosting overall work efficiency.
Online libraries and e-book platforms: For platforms offering a vast collection of books, full-text indexes are essential. Users can search by book titles, author names, or even specific excerpts from a book. The system uses full-text indexes to efficiently return results that match the search criteria.
News portals and social media platforms: These platforms generate massive amounts of content daily, including articles, posts, and comments. Full-text indexes allow users to filter content streams based on topics, events, or specific names, helping them access the most relevant information.
Legal document retrieval systems: The legal industry involves extensive document review, such as analyzing contracts, judgments, and legal provisions. A highly efficient full-text search engine can streamline workflows for lawyers, enabling them to quickly find precedents, citations, and relevant legal references.
Medical and health information systems: In healthcare, doctors often need to review patients' historical medical records, the latest research papers, and other reference materials. Full-text indexes make it easier for healthcare professionals to access the information they need, supporting more accurate diagnostic decisions.

Full-text indexes are ideal for applications that involve managing and querying large volumes of unstructured text data. For more formation about the full-text search capabilities of OceanBase Database, see Full-text index.

Example:

Scenario: We define a table to store document data and set up a full-text index for the documents. Using the full-text index, documents containing the desired keywords can be quickly matched and ranked in descending order of relevance.

CREATE TABLE Articles (
  id INT AUTO_INCREMENT,
  title VARCHAR(255) ,
  content TEXT ,
  PRIMARY KEY (id),
  FULLTEXT ft1 (content) WITH PARSER SPACE
  );

 INSERT INTO Articles (title, content) VALUES
   ('Introduction to OceanBase', 'OceanBase is an open-source relational database management system.'),
   ('Full-Text Search in Databases', 'Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly.'),
   ('Advantages of Using OceanBase', 'OceanBase offers several advantages such as high performance, reliability, and ease of use. ');

select * from Articles;

The query result is as follows:

+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | title                         | content                                                                                                                                                      |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1 | Introduction to OceanBase     | OceanBase is an open-source relational database management system.                                                                                           |
|  2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. |
|  3 | Advantages of Using OceanBase | OceanBase offers several advantages such as high performance, reliability, and ease of use.                                                                  |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in SET

Execute the following query to find the matching documents:

-- Query the matching documents
select id,title, content,match(content) against('OceanBase database') score from Articles where match(content) against('OceanBase database');

The return result is as follows:

+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| id | title                         | content                                                                                                                                                      | score               |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
|  1 | Introduction to OceanBase     | OceanBase is an open-source relational database management system.                                                                                           |  0.5699481865284975 |
|  3 | Advantages of Using OceanBase | OceanBase offers several advantages such as high performance, reliability, and ease of use.                                                                  |   0.240174672489083 |
|  2 | Full-Text Search in Databases | Full-text search allows for searching within the text of documents stored in a database. It is particularly useful for finding specific information quickly. | 0.20072992700729927 |
+----+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
3 rows in set

You can also execute the following EXPLAIN statement to view the execution plan and analyze its performance:

Explain select id,title, content,match(content) against('OceanBase database') score from Articles where match(content) against('OceanBase database');

The execution result is as follows:

+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan                                                                                                                                          |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| ==============================================================                                                                                      |
| |ID|OPERATOR             |NAME         |EST.ROWS|EST.TIME(us)|                                                                                      |
| --------------------------------------------------------------                                                                                      |
| |0 |SORT                 |             |17      |145         |                                                                                      |
| |1 |└─TEXT RETRIEVAL SCAN|articles(ft1)|17      |138         |                                                                                      |
| ==============================================================                                                                                      |
| Outputs & filters:                                                                                                                                  |
| -------------------------------------                                                                                                               |
|   0 - output([articles.id], [articles.title], [articles.content], [MATCH(articles.content) AGAINST('OceanBase database')]), filter(nil), rowset=256 |
|       sort_keys([MATCH(articles.content) AGAINST('OceanBase database'), DESC])                                                                      |
|   1 - output([articles.id], [articles.content], [articles.title], [MATCH(articles.content) AGAINST('OceanBase database')]), filter(nil), rowset=256 |
|       access([articles.id], [articles.content], [articles.title]), partitions(p0)                                                                   |
|       is_index_back=true, is_global_index=false,                                                                                                    |
|       calc_relevance=true, match_expr(MATCH(articles.content) AGAINST('OceanBase database')),                                                       |
|       pushdown_match_filter(MATCH(articles.content) AGAINST('OceanBase database'))                                                                  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
15 rows in set

Optimization in distributed environments

Table group management

In a distributed database, tables are often partitioned, which means the data is distributed across multiple servers. This distribution can result in cross-server communication during complex operations like JOIN queries, leading to performance overhead. To reduce this impact, table groups can be used to minimize cross-server access. For example, you can create a table group named tg1 with the SHARDING attribute set to ADAPTIVE and include two partitioned tables: customer and sales. When the join condition for operations on these partitioned tables includes the partitioning keys, a partition-wise join can be performed, significantly improving performance.

Here is an example:

CREATE TABLEGROUP tg1 SHARDING = 'ADAPTIVE';

CREATE TABLE customer (
    user_id BIGINT NOT NULL,
    login_time TIMESTAMP NOT NULL,
    customer_name VARCHAR(100) NOT NULL,
    phone_num BIGINT NOT NULL,
    city_name VARCHAR(50) NOT NULL,
    sex INT NOT NULL,
    id_number VARCHAR(18) NOT NULL,
    home_address VARCHAR(255) NOT NULL,
    office_address VARCHAR(255) NOT NULL,
    age INT NOT NULL)
  TABLEGROUP = tg1
  PARTITION BY HASH(user_id) PARTITIONS 128;

CREATE TABLE sales (
  order_id INT,
  user_id INT primary key,
  item_id INT,
  item_count INT)
  TABLEGROUP = tg1
  PARTITION BY HASH(user_id) PARTITIONS 128;


SELECT * FROM customer, sales where customer.user_id = sales.user_id;

Subpartition management

For scenarios involving large volumes of business data where query patterns are well-defined, subpartitions can be created to take advantage of partition pruning and accelerate query performance. AP systems often involve queries for data from the most recent day or month, which are typically time-based. In such cases, it’s recommended to use a time-related field or time functions as the subpartition key and implement range partitioning for efficient range queries.

CREATE TABLE customer (
    user_id BIGINT NOT NULL,
    login_time TIMESTAMP NOT NULL,
    customer_name VARCHAR(100) NOT NULL,
    phone_num BIGINT NOT NULL,
    city_name VARCHAR(50) NOT NULL,
    sex INT NOT NULL,
    id_number VARCHAR(18) NOT NULL,
    home_address VARCHAR(255) NOT NULL,
    office_address VARCHAR(255) NOT NULL,
    age INT NOT NULL,
    -- The primary key contains all partitioning keys (user_id and age).
    PRIMARY KEY (user_id, age, login_time)
)
-- Primary partitions: distributed by hash on user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
    -- Example partitions: divided by age group
    SUBPARTITION p_youth VALUES LESS THAN (25),
    SUBPARTITION p_adult VALUES LESS THAN (40),
    SUBPARTITION p_middle_aged VALUES LESS THAN (60),
    SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE)
);

Considerations

Index trade-offs:
- JSON multi-valued indexes and full-text indexes can increase storage overhead. It is important to balance write frequency and query requirements when deciding to create these indexes.
Partition key selection:
- The primary key must include all partitioning keys.
- For hash- or key-partitioned tables, choose fields with high cardinality, minimal skew, and frequent queries as partition keys, such as user IDs or order numbers.
- For range-partitioned tables, use time columns or numeric columns to define ranges.
- Low-cardinality fields, such as province names or company names, are better suited as partition keys for list-partitioned tables.
Distributed design:
- Related tables should be placed in the same table group to ensure data co-location and minimize the network overhead of join operations.
Subpartitioning:
- In AP scenarios, subranges can be designed based on time or business dimension fields (for example, by quarter). While low-cardinality fields are acceptable, the partitioning logic must align with the query patterns.