Best practices for semi-structured storage performance-OceanBase

This topic describes the best practices for semi-structured storage in OceanBase Database, focusing on how to reduce the storage costs of JSON data using semi-structured encoding technology. By configuring and using this technology appropriately, users can significantly reduce storage space usage while ensuring data integrity.

Background information

Currently, semi-structured data such as JSON has the following two main issues:

Low storage efficiency:
- JSON data is stored as binary strings, which prevents the use of encoding algorithms such as integer compression to reduce storage costs.
- It contains a large amount of redundant metadata, such as repeatedly storing key strings in each JSON record.
Limited query performance:
- When querying specific fields, the entire data must be read, and partial reading is not possible.
- Query performance is limited by the overhead of reading the entire data.

To address these issues, OceanBase has introduced semi-structured encoding storage technology. This technology stores JSON data in a structured manner, significantly improving storage efficiency and providing efficient field-level query capabilities.

How it works

OceanBase's semi-structured encoding storage technology stores JSON data by splitting it into multiple sub-columns, with each sub-column encoded separately. This approach increases encoding compression rates and reduces overall storage space requirements.

An example is as follows:

Original JSON data:

{"id": 11111001, "name": "white", "score": 85.2}
{"id": 11111002, "name": "brown", "score": 93.5}
{"id": 11111003, "name": "mike", "score": 67.8}
{"id": 11111004, "name": "trump", "score": 72.0}

Storage format after splitting:

| id       | name  | score |
| -------- | ----- | ----- |
| 11111001 | white | 85.2  |
| 11111002 | brown | 93.5  |
| 11111003 | mike  | 67.8  |
| 11111004 | trump | 72.0  |

Using this method, the system can apply efficient, type-specific encoding schemes to each column, significantly increasing the compression ratio.

How to use

For instructions on how to use semi-structured encoding storage and how to optimize query performance with conditional filters, see Use semi-structured encoding.

Applicable scenarios

Recommended scenarios

Storage space-sensitive applications: Suitable for environments where storage space is limited or storage costs are a concern.
Scenarios with frequent conditional filtering on JSON fields: For example, configuration management systems, product catalog systems, user profiling systems, and event tracking systems. Semi-structured encoding can significantly improve query performance in these cases.
Scenarios where full JSON content is rarely read after data is written: Suitable for use cases that mainly involve data writing and partial field queries, such as log collection systems, monitoring data storage, and event tracking systems. Semi-structured encoding can optimize partial field queries. For example:
```
{"c1": "v1", "c2": "v2"}
{"c1": "v1", "c3": ["v121", "v22"]}
{"c1": "v3", "c2": "v5"}
{"c1": "v4", "c4": ["v121", "v22"]}
```
When processing this type of data, since the c1 field appears frequently, the system can identify it as a high-frequency field and automatically extract it as a separate column for efficient encoding and storage. This approach helps significantly reduce disk usage and improve query performance.
OBKV-HBase time series scenarios: For time series scenarios, OceanBase provides an optimized OBKV-HBase time series data model based on JSON semi-structured storage. This effectively addresses the issue of repeated storage of K and T fields in the native model, optimizing both write performance and disk usage. See below for details.

Not recommended scenarios

Scenarios with frequent full reads of JSON data: Since the data is split into multiple sub-columns, reading JSON data requires merging these columns back into the original JSON format. Compared to unencoded JSON, this process can significantly impact read performance. Therefore, if frequent full reads of JSON data are required, it is not recommended to enable this feature.
Scenarios with completely heterogeneous JSON data: Because JSON data often lacks a unified structure, encoding such unstructured data makes it difficult to achieve effective compression. In these cases, enabling the feature is not recommended.

Limitations

Due to some unsuitable scenarios, semi-structured encoding is not triggered in all cases. The following situations will prevent it from being enabled:

The storage space after splitting is larger than before: If there is no compression benefit, encoding is unnecessary.
Unable to extract a common schema: If the JSON data is highly heterogeneous, the encoding effect will be poor, so encoding is not triggered.
Presence of outrow stored data: To use this feature, about 95% of the data should be stored in-row.

An example of a public schema that cannot be extracted is as follows:

# Fields are completely different
{"c1": "v1", "c2": "v2"}
{"x1": "v1", "x2": "v2"}

Performance comparison

TPC-H benchmark results

We conducted comprehensive tests on the performance of semi-structured encoding storage, evaluating both storage space and conditional query performance. The tests used the standard TPC-H dataset and compared the results with those of various mainstream database systems to validate the practical effectiveness of semi-structured encoding.

Storage type	Storage space (MB)	Conditional query performance (s)
Relational table	387.086	0.094
Semi-structured JSON	438.405	0.223
JSON Binary	764.124	24.445
MongoDB	1.32	-
Lindorm	639	-
HBase	1.23	-
MySQL	2.10	-

Key findings:

Semi-structured storage reduces space usage by 42% compared to standard JSON storage.
Conditional filtering query performance improves significantly (by up to 100x).
Import and major compaction performance slightly decreases (by about 12-15%).

OBKV-HBase time series model

Notice

This feature is supported starting from V4.3.5 BP2.

Model optimization

For time series scenarios, OceanBase provides an optimized OBKV-HBase time series data model based on JSON semi-structured storage. This effectively addresses the issue of redundant storage of K and T fields in the native model.

Data conversion example:

Original data:

# 
{"K": "name1", "T": 1732206353081, "cf1:QualifierA": "v1"}
{"K": "name1", "T": 1732206353081, "cf1:QualifierB": "v2"}
{"K": "name2", "T": 1732206353082, "cf1:QualifierC": "v3"}
{"K": "name2", "T": 1732206353082, "cf1:QualifierD": "v4"}

Optimized OBKV-HBase storage format:

| K       | Q          | T             | V    |
| ------- | ---------- | ------------- | ---- |
| name1   | QualifierA | 1732206353081 | v1   |
| name1   | QualifierB | 1732206353081 | v2   |
| name2   | QualifierC | 1732206353082 | v3   |
| name2   | QualifierD | 1732206353082 | v4   |

In time series scenarios, the K (row key) and T (timestamp) fields are frequently repeated in the original data, especially when a single put operation writes multiple cells. In this case, the K and T fields are identical, leading to significant disk space waste. Additionally, if each V value contains many repeated qualifiers (column qualifiers), it can further increase storage overhead.

Semi-structured encoding allows us to:

Extract common schemas.
Optimize qualifier storage.
Reduce redundant data storage.

Based on these optimizations, we adjusted the OBKV-HBase storage format. Using the example data, the regenerated data structure is as follows:

-- S represents the current server time when the cell is inserted.
| K       | T             | S             | V                                  |
| ------- | ------------- | ------------- | ---------------------------------- |
| name1   | 1732206353081 | 1732206353081 | {"QualifierA":v1, "QualifierB":v2} |
| name2   | 1732206353082 | 1732206353082 | {"QualifierC":v3, "QualifierD":v4} |

Performance comparison

We tested the storage efficiency of the OBKV-HBase time series model:

Storage type	Disk usage
Semi-structured time series model	466.24 MB
General time series model	654.78 MB

Future plans

Semi-structured encoding storage technology offers broad applicability and scalability. In the future, OceanBase may extend its use to other multi-model types such as GIS, arrays, and vectors. These extensions will be based on the existing semi-structured encoding technology, further enhancing OceanBase’s capabilities in multi-model data processing and providing users with more comprehensive data storage and query solutions.

Appendix

Storage space query methods

This section describes general methods for querying storage space.

System tenant

-- Obtain the tablet_id.
SELECT tablet_id FROM oceanbase.CDB_OB_TABLE_LOCATIONS
WHERE tenant_id=xxxx AND table_name='xxxxx';

-- Query the storage space.
SELECT size/1024.0/1024.0 FROM oceanbase.GV$OB_SSTABLES
WHERE tenant_id=xxxx AND tablet_id=xxxxx;

User tenant

-- Obtain the tablet_id.
SELECT tablet_id FROM oceanbase.DBA_OB_TABLE_LOCATIONS
WHERE table_name='xxxxx';

-- Query the storage space.
SELECT size/1024.0/1024.0 FROM oceanbase.GV$OB_SSTABLES
WHERE tablet_id=xxxxx;

Terminology

Semi-structured encoding: a storage technology that splits JSON data into sub-columns and encodes them separately.
White-box filtering: a conditional filtering operation performed directly on encoded data.
Outrow storage: a mechanism that stores large field data separately.
Inrow storage: a mechanism that stores data along with the main record.

References

Best practices for large object storage performance