In the design and optimization of database systems, data table design is a crucial task. A well-designed table not only ensures data integrity and consistency but also significantly improves query performance, storage efficiency, and system scalability. As data volumes grow and query patterns become more complex, how to scientifically design data tables, choose appropriate storage structures, and select suitable indexing strategies has become the core issue in database optimization.
This topic will cover best practices for data table design, including selecting the storage format, designing tables without primary keys, partitioning strategies, primary key design, and indexing in HTAP scenarios.
Choose the storage format for a table
OceanBase Database supports columnar and hybrid storage for tables. You can choose the appropriate storage format based on your business requirements.
- Columnar storage: Use columnar storage for OLAP scenarios.
- Hybrid storage: Use hybrid storage for HTAP scenarios that require both OLAP and OLTP capabilities.
You can set the default storage format for tables in a tenant by using the default_table_store_format parameter.
-- Set the default storage format for tables in the tenant to columnar storage.
ALTER SYSTEM SET default_table_store_format = "column";
-- Set the default storage format for tables in the tenant to hybrid storage.
ALTER SYSTEM SET default_table_store_format = "compound";
Design primary keys
Primary key design is a core aspect of relational database table design. In OceanBase Database, primary keys ensure data uniqueness and, since data is sorted by primary keys, they can optimize performance for range queries. Different primary key designs are suitable for different business scenarios.
No primary key scenario
Choose a columnstore table without a primary key
In data warehouse or online analytical processing (OLAP) scenarios, business data tables often contain a large volume of data, and most queries involve only specific columns. If there are no special requirements, such as ensuring data uniqueness, avoiding duplicate writes, or performing transactional updates or inserts, creating a columnstore table without a primary key is more efficient:
- Data import: In AP scenarios, data does not need to be sorted by the primary key, resulting in faster import performance compared to tables with primary keys.
- Reduced management costs for primary keys.
show parameters like "%store_format%";
CREATE TABLE customer (
user_id bigint NOT NULL,
login_time timestamp NOT NULL,
customer_name varchar(100) NOT NULL, -- Assuming customer names are up to 100 characters long
phone_num bigint NOT NULL, -- Storing phone numbers as BIGINT
city_name varchar(50) NOT NULL, -- Assuming city names are up to 50 characters long
sex int NOT NULL, -- Storing gender as INT (e.g., 0 for female, 1 for male)
id_number varchar(18) NOT NULL, -- Assuming ID numbers are up to 18 characters long
home_address varchar(255) NOT NULL, -- Assuming home addresses are up to 255 characters long
office_address varchar(255) NOT NULL, -- Assuming office addresses are up to 255 characters long
age int NOT NULL -- Storing age as INT
);
Scenarios with a primary key
In AP scenarios, the default table type is heap tables, characterized by:
Data storage: Data is stored in the order of insertion without physical sorting.
Role of primary key:
- Ensures data uniqueness by including a field that uniquely identifies each record (e.g., user_id).
- Supports partition key embedding, where the primary key must include all partition keys (e.g., user_id and age) to ensure even data distribution.
- Requires the use of indexes or partitioning strategies to accelerate queries (e.g.,
ORDER BY).
Example: Designing a heap-organized table with a primary key
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key includes all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Main partitioning: Hash partitioned by user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partitions: Partitioned by age groups
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age 60 and above
);
Scenarios with a primary key
- Ensure data uniqueness: When the uniqueness of each record in the table needs to be guaranteed, a primary key must be designed.
- Improve query efficiency: A primary key helps the optimizer generate more efficient query plans, especially when data queries rely on the primary key. It is recommended to include the fields that are always used in queries as primary key fields to leverage the primary key for performance improvement.
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key includes all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Main partitioning: Hash partitioned by user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partitions: Partitioned by age groups
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age 60 and above
);
Optimize queries in HTAP scenarios
OceanBase Database provides various methods to support hybrid transactional/analytical processing (HTAP) scenarios and meets both transactional (TP) and analytical (AP) business requirements. OceanBase Database supports rowstore, columnstore, and hybrid rowstore/columnstore redundancy for data tables and allows you to specify the storage format for index tables. You can combine the advantages of rowstore and columnstore to process scenarios.
For more information about how to create a columnstore index, see Create a columnstore index in MySQL mode and Create a columnstore index in Oracle mode.
Rowstore base table + columnstore index
Scenarios: The transactional business is the main business, and the analytical query requirements are relatively simple. The table is a large wide table. Optimization strategy: Create a columnstore index to improve the performance of analytical queries.
- Advantages: Only some columns are redundantly stored, and the storage format is columnstore.
- Disadvantages: You need to identify the fields to be indexed based on the query statements, and maintain the index table data during data writing. The write performance is relatively poor.
Columnstore base table + rowstore index
Scenarios: The analytical business is the main business, and the transactional business is relatively simple. You need to support efficient simple queries, such as point queries. Optimization strategy: Create a covering index to avoid table scans.
- Advantages: Only some columns are redundantly stored. Currently, you cannot convert a columnstore table to a rowstore table or a hybrid rowstore/columnstore table online. You can create a rowstore index to optimize queries.
- Disadvantages: You need to identify the fields to be indexed based on the query statements, and maintain the index table data during data writing. The write performance is relatively poor.
Rowstore and columnstore redundancy: rowstore base table + columnstore base table
Scenarios: You need to meet both transactional and analytical queries and support various query methods. Optimization strategy: Redundantly store the base table data in rowstore and columnstore. Let the system generate an appropriate execution plan based on the query characteristics. By default, the system performs range scans in columnstore mode and point queries in rowstore mode.
- Advantages: You do not need to analyze the query characteristics of analytical queries. The results of analytical queries are strongly consistent.
- Disadvantages: The system consumes more disk space. The execution plan may be inaccurate.
Columnstore replica: 2F1A1C
Scenarios: This is a typical HTAP scenario. You do not want analytical queries to affect transactional queries, but you can accept the weak consistency of analytical queries. Optimization strategy: You can use the isolation feature of columnstore replicas to execute analytical and transactional queries independently.
Advantages:
- You do not need to store redundant data in multiple replicas, saving disk space.
- Analytical and transactional queries are isolated.
- The data write performance is not affected.
Disadvantages:
- Columnstore replicas support weak consistency reads.
- You need to provide an independent access entry.
