In the design and optimization of a database system, data table design is a crucial task. A well-designed data table not only ensures data integrity and consistency but also significantly improves query performance, storage efficiency, and system scalability. As data volume grows and query patterns become more complex, how to scientifically design data tables, choose appropriate storage structures, and select suitable indexing strategies has become the core issue in database optimization.
This topic will introduce best practices for data table design by covering aspects such as choosing the storage format for a table, designing a table without a primary key, partitioning, primary key design, and using indexes in HTAP scenarios.
Select a table storage format
OceanBase Database supports columnar and hybrid storage formats for tables. You can choose a suitable storage format based on your business scenarios.
- Columnar storage: This storage format is suitable for OLAP scenarios.
- Hybrid storage: This storage format is suitable for HTAP scenarios. In this case, you can perform OLAP and OLTP operations in the same database instance. You can ensure the performance of analytical queries and point queries.
You can set the default storage format for tables in a tenant by using the default_table_store_format parameter.
-- Set the default table storage format to columnar.
ALTER SYSTEM SET default_table_store_format = "column";
-- Set the default table storage format to hybrid.
ALTER SYSTEM SET default_table_store_format = "compound";
Design the primary key
The primary key is a core part of the design of a relational database table. In OceanBase Database, the primary key ensures data uniqueness and optimizes the performance of range queries because data is sorted by the primary key. Different primary key designs are suitable for different business scenarios.
Scenarios without a primary key
Choose a columnar table without a primary key
In a data warehouse or online analytical processing (OLAP) scenario, the business data table contains a large amount of data, and most queries only involve certain columns. If the business does not have special requirements, such as ensuring data uniqueness, avoiding duplicate writes, or performing transactional updates or inserts, creating a columnar table without a primary key is more efficient:
- Data import: In an application processing (AP) scenario, data does not need to be sorted by the primary key. Therefore, the performance of importing large amounts of data is faster compared to a table with a primary key.
- Reduced management cost of the primary key.
show parameters like "%store_format%";
CREATE TABLE customer (
user_id bigint NOT NULL,
login_time timestamp NOT NULL,
customer_name varchar(100) NOT NULL, -- Assume that the customer name is at most 100 characters long.
phone_num bigint NOT NULL, -- Store the phone number as a bigint.
city_name varchar(50) NOT NULL, -- Assume that the city name is at most 50 characters long.
sex int NOT NULL, -- Store the gender as an int (for example, 0 for female and 1 for male).
id_number varchar(18) NOT NULL, -- Assume that the ID number is at most 18 characters long.
home_address varchar(255) NOT NULL, -- Assume that the home address is at most 255 characters long.
office_address varchar(255) NOT NULL, -- Assume that the office address is at most 255 characters long.
age int NOT NULL -- Store the age as an int.
);
Scenarios with a primary key
In an AP scenario, the default table type is a heap table, which has the following characteristics:
Data storage: Data is stored in the order of insertion without physical sorting.
Primary key:
- Ensures data uniqueness by including a field that uniquely identifies a record (such as user_id).
- Supports partition key embedding. The primary key must include all partition keys (such as user_id and age) to ensure even data distribution.
- Requires the use of indexes or partitioning strategies to accelerate queries (such as
ORDER BY).
Example: Design a heap-organized table with a primary key.
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key includes all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Partitioning: Hash partitioned by user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partitions: partitioned by age groups
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age ≥ 60
);
Scenarios with a primary key
- Ensure data uniqueness: When data in the table must be unique, a primary key must be designed.
- Improve query efficiency: A primary key helps the optimizer generate more efficient query plans. When data queries depend on the primary key, it is recommended to include the query fields in the primary key to leverage the primary key for improved query performance.
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key includes all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Partitioning: Hash partitioned by user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partitions: partitioned by age groups
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age ≥ 60
);
Optimize queries in HTAP scenarios
OceanBase Database provides various methods to support HTAP (Hybrid Transactional/Analytical Processing) scenarios. It can meet both transactional (TP) and analytical (AP) business requirements. OceanBase Database supports rowstore, columnstore, and hybrid rowstore-columnstore storage formats for data tables. It also allows you to specify the storage format for index tables. You can leverage the advantages of rowstore and columnstore to handle different scenarios.
For more information about how to create a columnstore index, see Create a columnstore index in MySQL mode and Create a columnstore index in Oracle mode.
Rowstore base table + columnstore index
Applicable scenario: The table is a wide table, and the business is mainly transactional. However, some analytical queries are required. Optimization strategy: Create a columnstore index to improve the performance of analytical queries.
- Advantages: Only part of the columns are redundantly stored in columnstore format.
- Disadvantages: You need to identify suitable fields based on the query statements to create columnstore indexes. You also need to maintain the index table data during data writes, which may result in lower write performance.
Columnstore base table + rowstore index
Applicable scenario: The business is mainly analytical. However, efficient simple queries (such as point queries) are required. Optimization strategy: Create a covering index to avoid table scans.
- Advantages: Only part of the columns are redundantly stored. You cannot convert a columnstore table to a rowstore table or a hybrid rowstore-columnstore table online. You can add a rowstore index to optimize queries.
- Disadvantages: You need to identify suitable fields based on the query statements to create rowstore indexes. You also need to maintain the index table data during data writes, which may result in lower write performance.
Rowstore-columnstore redundancy: Rowstore base table + columnstore base table
Applicable scenario: Both transactional and analytical queries are required, and the query methods are diverse. Optimization strategy: Redundantly store two base tables. The system generates an appropriate execution plan based on the query characteristics. By default, the system performs range scans in columnstore mode. For point queries, the system falls back to rowstore mode.
- Advantages: You do not need to analyze the query characteristics of analytical queries. The results of analytical queries are strongly consistent.
- Disadvantages: The system consumes more disk space. The execution plans may be inaccurate.
Columnstore replica: 2F1A1C
Applicable scenario: The business is typical HTAP, and you do not want analytical queries to affect transactional queries. However, you can accept weakly consistent reads for analytical queries. Optimization strategy: Use the isolation feature of columnstore replicas to execute queries for AP and TP businesses without affecting each other.
- Advantages:
- You do not need to store redundant data in multiple replicas, which saves disk space.
- AP and TP businesses are isolated.
- Data write performance is not affected.
- Disadvantages:
- Columnstore replicas support only weakly consistent reads.
- You need to provide independent access endpoints.
