Data model design|V4.2.1|OceanBase Database| docs|Distributed Database

Data model design

Last Updated：2024-08-13 01:52:31 Updated

This topic describes how to create and manage database objects in OBKV-HBase.

Notice

Before you perform operations related to database objects, make sure that you have created a MySQL tenant.

Database

An OBKV cluster is actually an OceanBase cluster for a MySQL tenant. A database is a collection of database objects used for privilege management and namespace-based resource isolation.

Before you create a data table, you need to create a database in the MySQL tenant. For more information about how to create and manage databases, see the following topics:

Table group

A table group is a logical concept. It represents a collection of tables. By default, data is randomly distributed to the tables in a table group. By defining a table group, you can control the physical closeness among a group of tables.
When implementing the HBase model, OBKV-HBase maps a table in HBase as a table group in OceanBase Database. Assume that HBase has a table named htable1. You need to create a table group named htable1 in OceanBase Database.

Notice

All tables in the same table group must use the same partitioning rules.

Before you create a data table, you must create a corresponding table group.

CREATE TABLEGROUP htable1;

For more information about table groups, see the following topics:

Table

OBKV-HBase maps a column family in HBase as a normal table in OceanBase Database. Before you use HBase, you must create a table group and a table in the database first.

The following code shows a simple table creation statement. Assume that HBase has a table named htable1 with a column family named family1. You need to create a table group named htable1 and a normal table named htable1$family1 in OceanBase Database, and bind the table with the table group. A normal table is named in the format of TableGroupName$FamilyName.

Notice

At present, OBKV-HBase does not support multiple column families. One table group can be bound with only one table. However, you can perform get and scan operations independent of the column family on the client.

CREATE TABLEGROUP htable1;

create table htable1$family1 (
  K varbinary(1024),
  Q varbinary(256),
  T bigint,
  V varbinary(1048576) NOT NULL,
  primary key(K, Q, T))
TABLEGROUP =  htable1
partition by key(K) partitions 97;

Though OBKV-HBase supports the schemaless feature of HBase, any model implemented based on OceanBase Database involves schemas. The preceding table creation statement is the physical storage model of OBKV-HBase.

where:

htable1 specifies the name of the HBase table. You can use a custom name.
family1 specifies the name of the column family. You can use a custom name.

Note

The table name and column family name are joined by using a dollar sign ($) as the table name in OceanBase Database.
The K column stores the rowkeys of the HBase table.
The Q column stores column qualifiers.
The T column stores timestamps, which are the number of milliseconds since 1970-01-01 UTC.
The V column stores values of the varbinary type, whose maximum length is 1 MB. If the length is insufficient, the longblob type can be used.
The K, Q, and T columns comprise a composite primary key to identify a cell in the HBase model.

Notice

The column names are fixed to K, Q, T, and V, and cannot be modified.

In the relational table created by using the preceding statement in OceanBase Database, data from multiple columns of a single row in the HBase table is stored in adjacent rows. Each of the rows actually stores a cell, which is a <row, column family, column qualifier, timestamp, value> tuple, in the HBase table.

LOB

HBase is a KV database, in which user data is serialized into binary values. As a result, the amount of data in HBase can be very large. OceanBase Database does not support large objects (LOBs) that are larger than 512 MB.

As we all know, the DML performance of LOBs is low and LOBs should be avoided as much as possible. Normally, we recommend that you use the varbinary type for the V column, whose length is 1 MB at most, when you define the schema. If you need to use LOBs larger than 1 MB in actual business scenarios, you can specify the longblob type for the V column.

create table htable1$family1 (
  K varbinary(1024),
  Q varbinary(256),
  T bigint,
  V longblob NOT NULL,
  primary key(K, Q, T))
TABLEGROUP =  htable1
partition by key(K) partitions 97;

Index

After a table is created, you can create indexes on one or more columns of the table to accelerate statement execution. Well-designed indexes can reduce physical or logical I/O operations.

OceanBase Database supports global indexes, local indexes, unique indexes, and non-unique indexes.

Notice

At present, OBKV-HBase does not support secondary indexes.

For more information, see the following topics:

Partition

The preceding table creation procedure does not involve partitioning. The created table is a non-partitioned table and applies only to scenarios with a very small amount of data. Generally, HBase business tables contain large amounts of data. Therefore, you need to create a partitioned table in OceanBase Database. At present, the HBase model of OceanBase Database does not support automatic region splitting as in Apache HBase. Therefore, you need to select a partitioning method when you create a table.

OceanBase Database provides multiple partitioning strategies that specify how data is distributed across partitions. The basic partitioning strategies supported by OceanBase Database are RANGE, LIST, and HASH.

A single-level partitioning strategy uses only one data distribution method. For example, during single-level partitioning, you can use LIST partitioning or RANGE partitioning, but you cannot use them at the same time.

In composite partitioning, a table is partitioned by using a data distribution method and then subpartitioned by using another data distribution method.

For more information, see Partitioning types.

Notice

All tables in the same table group must use the same partitioning rules.

You can determine which partitioning method to use based on the following sections.

KEY partitioning

If your business system does not scan the scope, or if you only need the Get API, you can use KEY partitioning. The number of partitions must be odd, preferably a prime number. Recommended values are 97, 193, and 389. KEY partitioning features even data distribution, without data skew or hotspots. Here is an example:

CREATE TABLE htable1$family1 (
  K varbinary(1024),
  Q varbinary(256),
  T bigint,
  V varbinary(1048576) NOT NULL,
  PRIMARY KEY(K, Q, T))
  TABLEGROUP =  htable1
PARTITION BY KEY(K) PARTITIONS 97;

Virtual columns and KEY partitioning

If you must use scan in your business system but the scan scope is a fixed-length prefix (prefix scan) of the rowkey columns, you can define a virtual column on a rowkey column of the table and apply KEY partitioning for this virtual column. This way, only one partition is scanned each time. Unless in the case of prefix data skew, this partitioning method can avoid data hotspots. Here is an example:

CREATE TABLE htable1$family1 (
  K varbinary(1024),
  Q varbinary(256),
  T bigint,
  V varbinary(1048576),
  K_PREFIX varbinary(1024) GENERATED ALWAYS AS (substring(K, 1, 4)),
  PRIMARY KEY(K, Q, T))
  TABLEGROUP =  htable1
PARTITION BY KEY(K_PREFIX) PARTITIONS 97;

The substring(K, 1, 4) function takes the first four bytes of K (rowkey of the HBase table) as a substring. Here, the number 4 must be replaced with the prefix length based on the characteristics of the business rowkey.

Note

At present, virtual columns can be built only by using substrings.

RANGE partitioning

If you need both prefix scan and the Get API, you must use RANGE partitioning. We recommend that you first estimate or sample the distribution of rowkeys and then determine the range definitions. If the estimate is unreasonable, RANGE partitioning can easily cause data skew and hotspots. Therefore, you must avoid RANGE partitioning. Here is an example:

CREATE TABLE htable1$family1 (
  K varbinary(1024),
  Q varbinary(256),
  T bigint,
  V varbinary(1048576),
  PRIMARY KEY(K, Q, T)
)
TABLEGROUP =  htable1
PARTITION BY RANGE columns (K)
(
  PARTITION p0 VALUES LESS THAN ('a'),
  PARTITION p1 VALUES LESS THAN ('w'),
  PARTITION p2 VALUES LESS THAN MAXVALUE);

The value range of the rowkeys is divided into three segments by "a" and "w".

Manage partitions

For more information about partitions, see the following topics:

Primary key design

Design a KEY partitioning key

In OBKV-HBase, a properly designed partitioning key can effectively improve the query performance. As we all know, the performance will be compromised to ensure the atomicity, consistency, isolation, and durability (ACID) of distributed transactions and the snapshot consistency in distributed reads in a distributed system. This is why many open-source NoSQL databases do not support cross-row transactions and do not provide global snapshot consistency reads.

For performance purposes, OBKV-HBase does not support distributed transactions or global snapshot consistency reads. However, OBKV-HBase supports node-level transactions and snapshot consistency reads. Open-source HBase supports only single-row transactions. You can design a partitioning key to limit business requests to a single partition. This can achieve the performance and consistency of a standalone cluster while ensuring scalability.

The partitioning key is the prefix of the primary key. For example, in a user message system:

Each user is identified by a 64-bit user ID, namely, userid.
Each message is identified by a 64-bit message ID, namely, msgid.

The primary key is userid | msgid and the partitioning key is defined as follows:

CREATE TABLE htable1$family1 (
  K varbinary(1024),
  Q varbinary(256),
  T bigint,
  V varbinary(1048576),
  K_PREFIX varbinary(1024) GENERATED ALWAYS AS (substring(K, 1, 64)),
  PRIMARY KEY(K, Q, T))
  TABLEGROUP =  htable1
  PARTITION BY KEY(K_PREFIX) PARTITIONS 97;

As shown in the preceding definition, the substring function is used to extract a fixed-length partitioning key from the part of the primary key in the K column. All the scan, get, put, and batch operations of the same user will access the same partition. OBKV-HBase supports cross-row transactions and consistency reads within a single partition.

Considerations on the number of KEY partitions

Why should we pay attention to the number of partitions

Adjusting the number of partitions is a DDL operation, which takes a cost to execute.
An excessively small number is inappropriate. The balancing mechanism of a database ensures that the difference in partition quantities between two nodes does not exceed 1. Assume that the database has three nodes and seven partitions. The partition distribution on the three nodes will be <3,2,2>, which means a partition imbalance degree of (3-2)/3 = 33%. Provided that data skew does not exist, the number of partitions is almost directly proportional to the stored data amount and accessed data amount. Therefore, partition imbalance will lead to unbalanced resource usage among the nodes.
An excessively large number is also inappropriate. A partition is similar to a miniDB. In other words, a database consists of many miniDBs. Each miniDB independently manages its storage space and data access. A large number of miniDBs will lead to insufficient sharing of the storage space. A scan operation may span multiple miniDBs, causing performance loss. Batch business operations cannot be directly pushed down to the storage for performance optimization.

Impact of the number of partitions on performance

Generally, the partition size does not obviously affect performance.
Partition-level tasks, such as major compactions, migration, and backup, are mainly affected by the partition size. If partitions are too large, the execution time of a single task will be long and more storage space is required.

How to determine the number of partitions

We recommend that the amount of data in a single partition of a single replica do not exceed 100 GB. When you determine the number of partitions, we recommend that you select a prime number, such as 23, 59, 97, 193, 389, or 997. For example, for a cluster with three replicas that occupy 24 TB of storage space, 8 TB is occupied by each replica. It can be estimated that at least 80 partitions are required in this cluster. Therefore, you can select the number 97.
Adjusting the number of partitions is a DDL operation with a certain overhead. When you design the number of partitions, we recommend that you perform partition planning in advance to avoid frequent adjustment of the number. Assume that the current storage space is 24 TB and it is expected that the storage space will be scaled out to 50 TB in the coming year. You can design the number of partitions based on the storage space of 50 TB. Therefore, we recommend that you specify 193 partitions in this case.
The number of partitions cannot be excessively large. We recommend that you specify a number less than or equal to 997.

How to determine data skew among partitions

At present, the MurmurHash algorithm is used for KEY partitioning. The algorithm hashes the partitioning key value of a data record and then delivers the data record to a corresponding partition. In extreme conditions, for example, if 10 million data records out of 100 million data records have the same partitioning key value, severe data skew occurs. If only thousands of data records have the same partitioning key value in the scenario with 100 million data records in total, data skew does not occur. This is because the hashed partitioning key values are rather random when the sample is large enough. From the global perspective, the partitions contain similar amounts of data.

Considerations on the design of the primary key

Scattering of the primary key

The scattering of the primary key is essential in a distributed system. If you can evenly distribute requests to the servers in a cluster, you can make full use of the computing capabilities of the distributed system.

If RANGE partitioning is used, we recommend that you do not use an auto-increment column or a time column as the prefix of the primary key. For example, if the primary key is designed as timestamp | msgid in a message system, almost all write requests will be sent to one server.
We recommend that you do not use a prefix with a large data base as the prefix of the primary key. For example, if the primary key is designed as msgtype | msgid in a message system, almost all read and write requests will be sent to one server due to a few enumeration values, which means a large amount of data under each value, of msgtype.

Design of the primary key

We recommend that the primary key do not exceed 1 KB in length. An excessively long primary key will cause redundant storage and downgrade the access performance. Typically, the primary key cannot be of a LOB type.
If RANGE partitioning is used and the prefix of the primary key is an auto-increment column or a time column, data access hotspots will occur. In this case, you can hash the auto-increment column and then use the hashed value as the new prefix of the primary key. For example, you can change a non-optimal primary key timestamp | msgid to hashvalueoftime | timestamp | msgid. In KEY partitioning, the partitioning key values are hashed to scatter data to partitions. Therefore, the hash value of the prefix does not need to be extracted.

References

For more information about database objects in OceanBase Database, see Overview.
For more information about database object operations in OceanBase Database, see topics under Database object management.
For more information about the design specifications and limitations of OceanBase Database, see topics under Database design specifications and practices.