In the design and optimization of database systems, data table design is a crucial task. A well-designed table not only ensures data integrity and consistency but also significantly enhances query performance, storage efficiency, and system scalability. As data volumes grow and query patterns become more complex, how to scientifically design data tables, choose appropriate storage structures, and implement effective indexing strategies has become a core issue in database optimization.
This topic will cover best practices for data table design, including the selection of storage formats, columnar storage for tables without a primary key, partitioning design, primary key design, and indexing strategies in HTAP scenarios.
Choose a table storage format
OceanBase Database supports columnar and hybrid storage for tables. You can choose a suitable storage format based on your business scenario.
- Columnar storage: If your business is purely OLAP, choose columnar storage.
- Hybrid storage: If your business is HTAP, which includes both OLAP and OLTP, choose hybrid storage to ensure both query performance and point query performance.
You can set the default table storage format at the tenant level by using the default_table_store_format parameter.
-- Set the default table storage format to columnar.
ALTER SYSTEM SET default_table_store_format = "column";
-- Set the default table storage format to hybrid.
ALTER SYSTEM SET default_table_store_format = "compound";
Design primary keys
Primary key design is a core aspect of relational database table design in OceanBase Database. Primary keys ensure data uniqueness and, since data is sorted by primary keys, they can optimize the performance of range queries. Different primary key design schemes are suitable for different business scenarios.
Scenarios without primary keys
Choose a single-table columnar storage table without a primary key
In data warehouse or online analytical processing (OLAP) scenarios, business data tables often contain a large volume of data, and most queries involve only certain columns. If there are no special requirements for data uniqueness, avoiding duplicate writes, or the need for transactional updates or inserts, creating a columnar storage table without a primary key is more efficient:
- Data import: In AP scenarios, data does not need to be sorted by the primary key, resulting in faster import performance compared to tables with primary keys.
- Reduced management costs of primary keys.
show parameters like "%store_format%";
CREATE TABLE customer (
user_id bigint NOT NULL,
login_time timestamp NOT NULL,
customer_name varchar(100) NOT NULL, -- Assuming the customer name is up to 100 characters
phone_num bigint NOT NULL, -- Storing phone numbers as BIGINT
city_name varchar(50) NOT NULL, -- Assuming the city name is up to 50 characters
sex int NOT NULL, -- Storing gender as INT (e.g., 0 for female, 1 for male)
id_number varchar(18) NOT NULL, -- Assuming the ID number is up to 18 characters
home_address varchar(255) NOT NULL, -- Assuming the home address is up to 255 characters
office_address varchar(255) NOT NULL, -- Assuming the office address is up to 255 characters
age int NOT NULL -- Storing age as INT
);
Scenarios with primary keys
In AP scenarios, the default table type is heap tables, which have the following characteristics:
Data storage: Data is stored in the order of insertion, without physical sorting.
Role of primary keys:
- Ensures data uniqueness by including a field that uniquely identifies each record (e.g., user_id).
- Supports partition key embedding, where the primary key must include all partition keys (e.g., user_id and age) to ensure even data distribution.
- Requires the use of indexes or partitioning strategies to accelerate queries (e.g.,
ORDER BY).
Example: Designing a heap-organized table with a primary key
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key includes all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Primary partitioning: Hash distributed by user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partitions: Divided by age groups
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age 60 and above
);
Scenarios with primary keys
- Ensure data uniqueness: When data uniqueness must be guaranteed for each record in the table, a primary key must be designed.
- Improve query efficiency: Primary keys help the optimizer generate more efficient query plans, especially when data queries rely on primary keys. It is recommended to include fields that are always carried in queries as primary key fields to leverage primary keys for performance improvements.
CREATE TABLE customer (
user_id BIGINT NOT NULL,
login_time TIMESTAMP NOT NULL,
customer_name VARCHAR(100) NOT NULL,
phone_num BIGINT NOT NULL,
city_name VARCHAR(50) NOT NULL,
sex INT NOT NULL,
id_number VARCHAR(18) NOT NULL,
home_address VARCHAR(255) NOT NULL,
office_address VARCHAR(255) NOT NULL,
age INT NOT NULL,
-- The primary key includes all partition keys (user_id and age)
PRIMARY KEY (user_id, age, login_time)
)
-- Primary partitioning: Hash distributed by user_id
PARTITION BY HASH(user_id)
PARTITIONS 128
SUBPARTITION BY RANGE(age)
SUBPARTITION TEMPLATE (
-- Example partitions: Divided by age groups
SUBPARTITION p_youth VALUES LESS THAN (25), -- Age < 25
SUBPARTITION p_adult VALUES LESS THAN (40), -- 25 ≤ age < 40
SUBPARTITION p_middle_aged VALUES LESS THAN (60),-- 40 ≤ age < 60
SUBPARTITION p_senior VALUES LESS THAN (MAXVALUE) -- Age 60 and above
);
Optimize queries in HTAP scenarios
OceanBase Database provides various methods to support HTAP (hybrid transactional and analytical processing) scenarios, meeting both transactional (TP) and analytical (AP) business needs. OceanBase Database supports rowstore, columnstore, and hybrid rowstore-columnstore redundancy for data tables, and also allows you to specify the storage format for index tables. You can leverage the advantages of both rowstore and columnstore to handle corresponding scenarios.
For more information about how to create a columnstore index, see Create a columnstore index in MySQL mode and Create a columnstore index in Oracle mode.
Rowstore base table + columnstore index
Applicable scenario: The table is a large wide table, and the business is primarily transactional with some analytical queries. Optimization strategy: Create a columnstore index to improve the performance of analytical queries.
- Advantages: Only part of the columns are redundantly stored, and the storage format is columnstore.
- Disadvantages: You need to identify the appropriate fields based on the query statements to create a columnstore index. Additionally, maintaining the index table data during data writes may slightly impact write performance.
Columnstore base table + rowstore index
Applicable scenario: The business is primarily analytical, but it requires efficient simple queries (such as point queries). Optimization strategy: Create a covering index to avoid table lookups.
- Advantages: Only part of the columns are redundantly stored. Currently, you cannot convert a columnstore table to a rowstore table or a hybrid rowstore-columnstore table online. You can add a rowstore index to optimize queries.
- Disadvantages: You need to identify the appropriate fields based on the query statements to create a rowstore index. Additionally, maintaining the index table data during data writes may slightly impact write performance.
Rowstore-columnstore redundancy: Rowstore base table + columnstore base table
Applicable scenario: The business requires both transactional and analytical queries, and the query methods are diverse. Optimization strategy: Redundantly store two base tables. The system generates an appropriate execution plan based on the query characteristics. By default, the system performs range scans in columnstore mode and point queries in rowstore mode.
- Advantages: You do not need to analyze the query characteristics of analytical queries. The results of analytical queries are strongly consistent.
- Disadvantages: It consumes more disk space. The execution plans may be inaccurate.
Columnstore replica: 2F1A1C
Applicable scenario: The business is a typical HTAP business. You do not want analytical queries to affect transactional queries, but you can accept weakly consistent reads for analytical queries. Optimization strategy: Use the isolation feature of columnstore replicas to execute queries for AP and TP businesses without affecting each other.
- Advantages:
- You do not need to store redundant data in multiple replicas, saving disk space.
- AP and TP businesses are isolated from each other.
- Data write performance is not affected.
- Disadvantages:
- Columnstore replicas support only weakly consistent reads.
- You need to provide an independent access entry.
