In database system design and optimization, table design is a crucial task. A well-designed table not only ensures data integrity and consistency but also significantly improves query performance, storage efficiency, and system scalability. As data volume grows and query patterns become more complex, how to scientifically design tables, choose appropriate storage structures, and implement effective indexing strategies has become the core of database optimization.
This topic will cover best practices for data table design, including selecting table storage formats, designing columnar tables without primary keys, partitioning strategies, primary key design, and indexing in HTAP scenarios.
Choose the storage format of a table
OceanBase Database supports storing table data in columnar format or in a redundant format that combines rowstore and columnar storage. You can choose an appropriate storage format based on your business scenarios.
- Columnar storage: Use columnar storage for OLAP scenarios.
- Redundant storage: Use redundant storage for HTAP scenarios that contain both OLAP and OLTP scenarios. In this case, you can ensure the performance of point queries in addition to the performance of analysis queries.
You can set the default storage format of tables in a tenant by using the default_table_store_format parameter.
-- Set the default storage format of tables in the tenant to columnar storage.
ALTER SYSTEM SET default_table_store_format = "column";
-- Set the default storage format of tables in the tenant to redundant storage.
ALTER SYSTEM SET default_table_store_format = "compound";
Design the primary key
The primary key is a core part of the table design in a relational database. In OceanBase Database, the primary key ensures the uniqueness of data. In addition, data is stored in the order of the primary key, which allows you to optimize the performance of range queries. Different primary key design schemes are applicable to different business scenarios.
No primary key
Single-table columnar storage without a primary key
In a data warehouse or online analytical processing (OLAP) scenario, the amount of business data in the business data table is very large, and most queries involve only some columns. If the business has no special requirements, such as ensuring the uniqueness of data, avoiding duplicate writes, or performing transactional updates or inserts, creating a columnar storage table without a primary key is more efficient:
- Data import: In an AP scenario, data does not need to be sorted by the primary key. Therefore, the import performance of a columnar storage table without a primary key is higher than that of a columnar storage table with a primary key.
- Reduce the cost of primary key management.
show parameters like "%store_format%";
CREATE TABLE customer (
user_id bigint NOT NULL,
login_time timestamp NOT NULL,
customer_name varchar(100) NOT NULL, -- Assume that the customer name is at most 100 characters long.
phone_num bigint NOT NULL, -- Store the phone number as a bigint.
city_name varchar(50) NOT NULL, -- Assume that the city name is at most 50 characters long.
sex int NOT NULL, -- Store the gender as an int. For example, 0 indicates female, and 1 indicates male.
id_number varchar(18) NOT NULL, -- Assume that the ID number is at most 18 characters long.
home_address varchar(255) NOT NULL, -- Assume that the home address is at most 255 characters long.
office_address varchar(255) NOT NULL, -- Assume that the office address is at most 255 characters long.
age int NOT NULL -- Store the age as an int.
);
With primary key
In an AP scenario, a heap table is used by default. The features of a heap table are as follows:
Data storage: Data is stored in the order of insertion without physical sorting.
Primary key:
- Ensures the uniqueness of data. The primary key must contain a field that can uniquely identify a record (such as user_id).
- Supports partition key embedding. The primary key must contain all partition keys (such as user_id and age) to ensure even data distribution.
- Accelerates queries in conjunction with indexes or partitioning strategies (such as
ORDER BY).
Example: Design a heap-organized table with a primary key
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key contains all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Primary partition: distributed by hash on user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partition: partition by age group
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age ≥ 60
);
Add a primary key
- Ensure the uniqueness of data: If the data in the table must ensure the uniqueness of each record, you must design a primary key.
- Improve query efficiency: A primary key helps the optimizer generate more efficient query plans. If a query depends on the primary key, we recommend that you include the query fields in the primary key. This way, you can use the primary key to improve query performance.
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key contains all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Primary partition: distributed by hash on user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partition: partition by age group
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age ≥ 60
);
Optimize queries in HTAP scenarios
OceanBase Database provides various methods to support HTAP (Hybrid Transactional/Analytical Processing) scenarios, meeting both transactional (TP) and analytical (AP) business requirements. OceanBase Database supports rowstore, columnstore, and hybrid rowstore/columnstore storage formats for data tables, and also allows you to specify the storage format for index tables. You can leverage the advantages of rowstore and columnstore to handle different scenarios.
For more information about how to create a columnstore index, see Create a columnstore index in MySQL mode and Create a columnstore index in Oracle mode.
Rowstore base table + columnstore index
Applicable scenario: The table is a large wide table, and the business is primarily transactional with some analytical queries. Optimization strategy: Create a columnstore index to improve the performance of analytical queries.
- Advantages: Only some columns are redundantly stored in columnstore format.
- Disadvantages: You need to identify suitable columns based on the query statements to create columnstore indexes. Also, maintaining index table data during data writing may slightly reduce write performance.
Columnstore base table + rowstore index
Applicable scenario: The business is primarily analytical, but efficient simple queries (such as point queries) are required. Optimization strategy: Create a covering index to avoid table lookups.
- Advantages: Only some columns are redundantly stored. Currently, you cannot convert a columnstore table to a rowstore table or a hybrid rowstore/columnstore table online. You can create a rowstore index to optimize queries.
- Disadvantages: You need to identify suitable columns based on the query statements to create rowstore indexes. Also, maintaining index table data during data writing may slightly reduce write performance.
Hybrid rowstore/columnstore: Rowstore base table + columnstore base table
Applicable scenario: Both transactional and analytical queries are required, and the query methods are diverse. Optimization strategy: Redundantly store two base tables. The system generates an appropriate execution plan based on the query characteristics. By default, range scans use columnstore mode, and point queries use rowstore mode.
- Advantages: You do not need to analyze the characteristics of analytical queries. The results of analytical queries are strongly consistent.
- Disadvantages: It occupies more disk space, and the execution plans may be inaccurate.
Columnstore replica: 2F1A1C
Applicable scenario: Typical HTAP business. You do not want analytical queries to affect transactional queries, but you can accept weakly consistent reads for analytical queries. Optimization strategy: Use the isolation characteristics of columnstore replicas to execute queries for AP and TP businesses without affecting each other.
- Advantages:
- You do not need to store redundant data in multiple replicas, saving disk space.
- AP and TP businesses are isolated from each other.
- Data write performance is not affected.
- Disadvantages:
- Columnstore replicas support weakly consistent reads.
- You need an independent access entry.