Java development guide for OBKV-Table|V4.3.5| docs|Distributed Database

Java development guide for OBKV-Table

Last Updated：2025-12-17 07:17:30 Updated

This topic outlines key considerations and best practices for developing with OBKV-Table. Before you begin, make sure you understand your business use cases and the product positioning of OBKV-Table.

Note

OBKV-Table supports only Java clients.

Recommendations

Batch operations

If the data in a batch is not logically related, avoid spanning multiple partitions with a single batch. The OBKV-Table client provides a partitioning interface. When using batch operations, calculate the partition ID first, then group and submit batches for each partition separately. This approach enables single-node transactions with minimal performance overhead and simplifies retry logic.

Indexes

Suggestions on using indexes:

Avoid global indexes unless necessary. Maintaining global indexes during insert, update, or delete operations on the primary table incurs considerable distributed transaction overhead.
Although local indexes are less costly to maintain than global indexes, keep their number under control. More indexes mean higher maintenance costs.

Partitioning strategy

Here we use key partitioning as an example.

Partitioning key design

The primary goal when designing partitioning keys in OBKV-Table is to ensure that requests for your business scenario do not cross partitions. Some guidelines for choosing partition keys:

The partitioning key is typically the primary key or a prefix of the primary key.
- If the partitioning key is the primary key: For example, in an order scenario, you might set orderid as the primary key. If queries are always exact lookups by orderid, use orderid as the partitioning key for efficient access to all or part of the information for a specific order.
- If the partitioning key is a primary key prefix: For example, in an order scenario, you might use a composite primary key of userid and orderid. If queries commonly scan by userid to retrieve all or some orders for a user, use userid as the partition key.
Queries in your business scenarios should always specify the partitioning key.

Design the number of partitions

Why the number of partitions matters

Changing the number of partitions requires modifying the data model (the table schema or DDL) and redistributing data, which is costly.
Avoid having too few partitions: The database's balancing algorithm ensures that the difference in partition counts between any two nodes does not exceed one. For example, with 3 nodes and 7 partitions, the distribution could be <3,2,2>, resulting in a partition imbalance of (3-2)/3 = 33%. Assuming no data skew, the number of partitions is nearly proportional to both storage and access volume, and partition imbalance will also lead to uneven resource usage across nodes. Therefore, too few partitions is not recommended.
Avoid having too many partitions: You can think of each partition as a mini-database, with multiple mini-databases within a single database. Each mini-database manages its own storage and data access independently. This structure means that excessive mini-databases can prevent optimal sharing of storage space, and scan operations may span multiple databases, causing some performance loss. Batch operations may also be harder to optimize for maximum performance at the storage layer. So, having too many partitions is also not recommended.

How partition count affects performance

Generally, the number of partitions does not significantly impact performance. However, for tasks such as major compaction, migration, or backup—which operate at the partition level—partition count does matter. Having too few partitions makes each partition very large, which can increase task execution time and require more resources.

How to choose partition count

For capacity planning, we recommend keeping the data size per partition per replica below 100 GB. When selecting a partition count, choose prime numbers like 23, 59, 97, 193, 389, or 997. For example, with three replicas and 24 TB of storage, if each partition per replica is 8 TB, you will need at least 80 partitions. Based on the recommended values, choose 97 partitions.
Changing the partition count requires DDL changes, which incurs some overhead. Plan your partitioning strategy in advance to avoid frequent changes. For example, if your current storage is 24 TB but you expect it to grow to 50 TB within a year, design for 50 TB and choose 193 partitions.
Do not use too many partitions; generally, do not exceed 997.

How to detect data skew across partitions

Key partitioning currently relies on the MurmurHash algorithm, which hashes the partitioning key and assigns data to the appropriate partition. Except in extreme cases—such as 100 million rows with 10 million sharing the same partitioning key—data skew is typically not an issue. For example, with 100 million rows and only a few thousand per partitioning key, hashing generally distributes data evenly. With large datasets, the randomness of the hash function ensures data is distributed fairly evenly across partitions.