OceanBase logo

OceanBase

A unified distributed database ready for your transactional, analytical, and AI workloads.

DEPLOY YOUR WAY

OceanBase Cloud

The best way to deploy and scale OceanBase

OceanBase Enterprise

Run and manage OceanBase on your infra

TRY OPEN SOURCE

OceanBase Community Edition

The free, open-source distributed database

OceanBase seekdb

Open source AI native search database

Customer Stories

Real-world success stories from enterprises across diverse industries.

View All
BY USE CASES

Mission-Critical Transactions

Global & Multicloud Application

Elastic Scaling for Peak Traffic

Real-time Analytics

Active Geo-redundancy

Database Consolidation

Resources

Comprehensive knowledge hub for OceanBase.

Blog

Live Demos

Training & Certification

Documentation

Official technical guides, tutorials, API references, and manuals for all OceanBase products.

View All
PRODUCTS

OceanBase Cloud

OceanBase Database

Tools

Connectors and Middleware

QUICK START

OceanBase Cloud

OceanBase Database

BEST PRACTICES

Practical guides for utilizing OceanBase more effectively and conveniently

Company

Learn more about OceanBase – our company, partnerships, and trust and security initiatives.

About OceanBase

Partner

Trust Center

Contact Us

International - English
中国站 - 简体中文
日本 - 日本語
Sign In
Start on Cloud

A unified distributed database ready for your transactional, analytical, and AI workloads.

DEPLOY YOUR WAY

OceanBase Cloud

The best way to deploy and scale OceanBase

OceanBase Enterprise

Run and manage OceanBase on your infra

TRY OPEN SOURCE

OceanBase Community Edition

The free, open-source distributed database

OceanBase seekdb

Open source AI native search database

Customer Stories

Real-world success stories from enterprises across diverse industries.

View All
BY USE CASES

Mission-Critical Transactions

Global & Multicloud Application

Elastic Scaling for Peak Traffic

Real-time Analytics

Active Geo-redundancy

Database Consolidation

Comprehensive knowledge hub for OceanBase.

Blog

Live Demos

Training & Certification

Documentation

Official technical guides, tutorials, API references, and manuals for all OceanBase products.

View All
PRODUCTS
OceanBase CloudOceanBase Database
ToolsConnectors and Middleware
QUICK START
OceanBase CloudOceanBase Database
BEST PRACTICES

Practical guides for utilizing OceanBase more effectively and conveniently

Learn more about OceanBase – our company, partnerships, and trust and security initiatives.

About OceanBase

Partner

Trust Center

Contact Us

Start on Cloud
编组
All Products
    • Databases
    • iconOceanBase Database
    • iconOceanBase Cloud
    • iconOceanBase Tugraph
    • iconInteractive Tutorials
    • iconOceanBase Best Practices
    • Tools
    • iconOceanBase Cloud Platform
    • iconOceanBase Migration Service
    • iconOceanBase Developer Center
    • iconOceanBase Migration Assessment
    • iconOceanBase Admin Tool
    • iconOceanBase Loader and Dumper
    • iconOceanBase Deployer
    • iconKubernetes operator for OceanBase
    • iconOceanBase Diagnostic Tool
    • iconOceanBase Binlog Service
    • Connectors and Middleware
    • iconOceanBase Database Proxy
    • iconEmbedded SQL in C for OceanBase
    • iconOceanBase Call Interface
    • iconOceanBase Connector/C
    • iconOceanBase Connector/J
    • iconOceanBase Connector/ODBC
    • iconOceanBase Connector/NET
icon

OceanBase Best Practices

All Versions

  • Deploy
    • Configuration guide for read-write splitting in AP scenarios
    • Best practices for read-write splitting
  • Migrate
    • Data transfer solutions in OceanBase Database
    • Overview on data migration
    • Best practices for importing data files to OceanBase Database
    • Best practice for migrating data from other databases to OceanBase Database
    • Massive data migration strategy
    • Best practices for migrating data from MyCat to OceanBase Database
    • Best practices for migrating PostgreSQL to OceanBase MySQL-compatible mode
  • Route
    • ODP routing best practices
  • Table Design
    • Best practices for table design and index optimization
    • Best practices for creating indexes on large tables
    • Best practices for database development
  • Develop
    • Best practices for connecting Java applications to OceanBase Database
    • Best practices for integrating Spark Catalog with OceanBase Database
    • Best practices for achieving optimal performance in batch DML using JDBC and OBServer
    • Best practices for bulk data cleanup in OceanBase Database
    • Best practices for PDML processing in OceanBase Database
    • Best practices for hot tables in OceanBase Database
    • Best practices for auto-increment columns and sequences in OceanBase Database
  • Manage
    • Best practices for resource throttling
    • Best practices for data load balancing
    • Best practices for security certification
    • Best practices for access control
    • Best practices for data encryption
  • Diagnose
    • Best practices for log interpretation in common scenarios
    • Best practices for end-to-end tracing
    • Best practices for using obdiag to collect performance information
    • Best practices for using obdiag to collect diagnostic information of parallel and slow SQL statements
    • Best practices for troubleshooting OceanBase Database performance issues
  • Performance Tuning
    • Best practices for handling slow queries
    • Best practices for collecting statistics to generate an efficient execution plan
    • Best practices for updating hotspot rows
    • Best practices for large object storage performance
    • Best practices for semi-structured storage performance
    • Best practices for OceanBase materialized views
  • Cloud Database
    • Best practices for achieving high availability through cross-cloud active-active deployment
    • High availability through primary and standby databases across clouds
    • High host CPU usage
    • Best practices for read/write splitting in OceanBase Cloud

Download PDF

Configuration guide for read-write splitting in AP scenarios Best practices for read-write splitting Data transfer solutions in OceanBase Database Overview on data migration Best practices for importing data files to OceanBase Database Best practice for migrating data from other databases to OceanBase Database Massive data migration strategy Best practices for migrating data from MyCat to OceanBase Database Best practices for migrating PostgreSQL to OceanBase MySQL-compatible mode ODP routing best practices Best practices for table design and index optimization Best practices for creating indexes on large tables Best practices for database development Best practices for connecting Java applications to OceanBase Database Best practices for integrating Spark Catalog with OceanBase Database Best practices for achieving optimal performance in batch DML using JDBC and OBServer Best practices for bulk data cleanup in OceanBase Database Best practices for PDML processing in OceanBase Database Best practices for hot tables in OceanBase Database Best practices for auto-increment columns and sequences in OceanBase Database Best practices for resource throttling Best practices for data load balancing Best practices for security certification Best practices for access control Best practices for data encryption Best practices for log interpretation in common scenarios Best practices for end-to-end tracing Best practices for using obdiag to collect performance information Best practices for using obdiag to collect diagnostic information of parallel and slow SQL statements Best practices for troubleshooting OceanBase Database performance issues Best practices for handling slow queries Best practices for collecting statistics to generate an efficient execution plan Best practices for updating hotspot rows Best practices for large object storage performance Best practices for semi-structured storage performance Best practices for OceanBase materialized views Best practices for achieving high availability through cross-cloud active-active deployment High availability through primary and standby databases across clouds High host CPU usage Best practices for read/write splitting in OceanBase Cloud
OceanBase logo

The Unified Distributed Database for the AI Era.

Follow Us
Products
OceanBase CloudOceanBase EnterpriseOceanBase Community EditionOceanBase seekdb
Resources
DocsBlogLive DemosTraining & Certification
Company
About OceanBaseTrust CenterLegalPartnerContact Us
Follow Us

© OceanBase 2026. All rights reserved

Cloud Service AgreementPrivacy PolicySecurity
Contact Us
Document Feedback
  1. Documentation Center
  2. OceanBase Best Practices
  3. master
iconOceanBase Best Practices
master
  • master

Best practices for semi-structured storage performance

Last Updated:2025-07-15 09:43:03  Updated
share
What is on this page
Background information
How it works
How to use
Applicable scenarios
Recommended scenarios
Not recommended scenarios
Limitations
Performance comparison
TPC-H benchmark results
OBKV-HBase time series model
Model optimization
Performance comparison
Future plans
Appendix
Storage space query methods
Terminology
References

folded

share

This topic describes the best practices for semi-structured storage in OceanBase Database, focusing on how to reduce the storage costs of JSON data using semi-structured encoding technology. By configuring and using this technology appropriately, users can significantly reduce storage space usage while ensuring data integrity.

Background information

Currently, semi-structured data such as JSON has the following two main issues:

  • Low storage efficiency:
    • JSON data is stored as binary strings, which prevents the use of encoding algorithms such as integer compression to reduce storage costs.
    • It contains a large amount of redundant metadata, such as repeatedly storing key strings in each JSON record.
  • Limited query performance:
    • When querying specific fields, the entire data must be read, and partial reading is not possible.
    • Query performance is limited by the overhead of reading the entire data.

To address these issues, OceanBase has introduced semi-structured encoding storage technology. This technology stores JSON data in a structured manner, significantly improving storage efficiency and providing efficient field-level query capabilities.

How it works

OceanBase's semi-structured encoding storage technology stores JSON data by splitting it into multiple sub-columns, with each sub-column encoded separately. This approach increases encoding compression rates and reduces overall storage space requirements.

An example is as follows:

  • Original JSON data:

    {"id": 11111001, "name": "white", "score": 85.2}
    {"id": 11111002, "name": "brown", "score": 93.5}
    {"id": 11111003, "name": "mike", "score": 67.8}
    {"id": 11111004, "name": "trump", "score": 72.0}
    
  • Storage format after splitting:

    | id       | name  | score |
    | -------- | ----- | ----- |
    | 11111001 | white | 85.2  |
    | 11111002 | brown | 93.5  |
    | 11111003 | mike  | 67.8  |
    | 11111004 | trump | 72.0  |
    

Using this method, the system can apply efficient, type-specific encoding schemes to each column, significantly increasing the compression ratio.

How to use

For instructions on how to use semi-structured encoding storage and how to optimize query performance with conditional filters, see Use semi-structured encoding.

Applicable scenarios

Recommended scenarios

  • Storage space-sensitive applications: Suitable for environments where storage space is limited or storage costs are a concern.
  • Scenarios with frequent conditional filtering on JSON fields: For example, configuration management systems, product catalog systems, user profiling systems, and event tracking systems. Semi-structured encoding can significantly improve query performance in these cases.
  • Scenarios where full JSON content is rarely read after data is written: Suitable for use cases that mainly involve data writing and partial field queries, such as log collection systems, monitoring data storage, and event tracking systems. Semi-structured encoding can optimize partial field queries. For example:
    {"c1": "v1", "c2": "v2"}
    {"c1": "v1", "c3": ["v121", "v22"]}
    {"c1": "v3", "c2": "v5"}
    {"c1": "v4", "c4": ["v121", "v22"]}
    

    When processing this type of data, since the c1 field appears frequently, the system can identify it as a high-frequency field and automatically extract it as a separate column for efficient encoding and storage. This approach helps significantly reduce disk usage and improve query performance.

  • OBKV-HBase time series scenarios: For time series scenarios, OceanBase provides an optimized OBKV-HBase time series data model based on JSON semi-structured storage. This effectively addresses the issue of repeated storage of K and T fields in the native model, optimizing both write performance and disk usage. See below for details.

Not recommended scenarios

  • Scenarios with frequent full reads of JSON data: Since the data is split into multiple sub-columns, reading JSON data requires merging these columns back into the original JSON format. Compared to unencoded JSON, this process can significantly impact read performance. Therefore, if frequent full reads of JSON data are required, it is not recommended to enable this feature.
  • Scenarios with completely heterogeneous JSON data: Because JSON data often lacks a unified structure, encoding such unstructured data makes it difficult to achieve effective compression. In these cases, enabling the feature is not recommended.

Limitations

Due to some unsuitable scenarios, semi-structured encoding is not triggered in all cases. The following situations will prevent it from being enabled:

  • The storage space after splitting is larger than before: If there is no compression benefit, encoding is unnecessary.
  • Unable to extract a common schema: If the JSON data is highly heterogeneous, the encoding effect will be poor, so encoding is not triggered.
  • Presence of outrow stored data: To use this feature, about 95% of the data should be stored in-row.

An example of a public schema that cannot be extracted is as follows:

# Fields are completely different
{"c1": "v1", "c2": "v2"}
{"x1": "v1", "x2": "v2"}

Performance comparison

TPC-H benchmark results

We conducted comprehensive tests on the performance of semi-structured encoding storage, evaluating both storage space and conditional query performance. The tests used the standard TPC-H dataset and compared the results with those of various mainstream database systems to validate the practical effectiveness of semi-structured encoding.

Storage type Storage space (MB) Conditional query performance (s)
Relational table 387.086 0.094
Semi-structured JSON 438.405 0.223
JSON Binary 764.124 24.445
MongoDB 1.32 -
Lindorm 639 -
HBase 1.23 -
MySQL 2.10 -

Key findings:

  • Semi-structured storage reduces space usage by 42% compared to standard JSON storage.
  • Conditional filtering query performance improves significantly (by up to 100x).
  • Import and major compaction performance slightly decreases (by about 12-15%).

OBKV-HBase time series model

Notice

This feature is supported starting from V4.3.5 BP2.

Model optimization

For time series scenarios, OceanBase provides an optimized OBKV-HBase time series data model based on JSON semi-structured storage. This effectively addresses the issue of redundant storage of K and T fields in the native model.

Data conversion example:

Original data:

# 
{"K": "name1", "T": 1732206353081, "cf1:QualifierA": "v1"}
{"K": "name1", "T": 1732206353081, "cf1:QualifierB": "v2"}
{"K": "name2", "T": 1732206353082, "cf1:QualifierC": "v3"}
{"K": "name2", "T": 1732206353082, "cf1:QualifierD": "v4"}

Optimized OBKV-HBase storage format:

| K       | Q          | T             | V    |
| ------- | ---------- | ------------- | ---- |
| name1   | QualifierA | 1732206353081 | v1   |
| name1   | QualifierB | 1732206353081 | v2   |
| name2   | QualifierC | 1732206353082 | v3   |
| name2   | QualifierD | 1732206353082 | v4   |

In time series scenarios, the K (row key) and T (timestamp) fields are frequently repeated in the original data, especially when a single put operation writes multiple cells. In this case, the K and T fields are identical, leading to significant disk space waste. Additionally, if each V value contains many repeated qualifiers (column qualifiers), it can further increase storage overhead.

Semi-structured encoding allows us to:

  • Extract common schemas.
  • Optimize qualifier storage.
  • Reduce redundant data storage.

Based on these optimizations, we adjusted the OBKV-HBase storage format. Using the example data, the regenerated data structure is as follows:

-- S represents the current server time when the cell is inserted.
| K       | T             | S             | V                                  |
| ------- | ------------- | ------------- | ---------------------------------- |
| name1   | 1732206353081 | 1732206353081 | {"QualifierA":v1, "QualifierB":v2} |
| name2   | 1732206353082 | 1732206353082 | {"QualifierC":v3, "QualifierD":v4} |

Performance comparison

We tested the storage efficiency of the OBKV-HBase time series model:

Storage type Disk usage
Semi-structured time series model 466.24 MB
General time series model 654.78 MB

Future plans

Semi-structured encoding storage technology offers broad applicability and scalability. In the future, OceanBase may extend its use to other multi-model types such as GIS, arrays, and vectors. These extensions will be based on the existing semi-structured encoding technology, further enhancing OceanBase’s capabilities in multi-model data processing and providing users with more comprehensive data storage and query solutions.

Appendix

Storage space query methods

This section describes general methods for querying storage space.

System tenant

-- Obtain the tablet_id.
SELECT tablet_id FROM oceanbase.CDB_OB_TABLE_LOCATIONS
WHERE tenant_id=xxxx AND table_name='xxxxx';

-- Query the storage space.
SELECT size/1024.0/1024.0 FROM oceanbase.GV$OB_SSTABLES
WHERE tenant_id=xxxx AND tablet_id=xxxxx;

User tenant

-- Obtain the tablet_id.
SELECT tablet_id FROM oceanbase.DBA_OB_TABLE_LOCATIONS
WHERE table_name='xxxxx';

-- Query the storage space.
SELECT size/1024.0/1024.0 FROM oceanbase.GV$OB_SSTABLES
WHERE tablet_id=xxxxx;

Terminology

  • Semi-structured encoding: a storage technology that splits JSON data into sub-columns and encodes them separately.
  • White-box filtering: a conditional filtering operation performed directly on encoded data.
  • Outrow storage: a mechanism that stores large field data separately.
  • Inrow storage: a mechanism that stores data along with the main record.

References

  • Best practices for large object storage performance

Previous topic

Best practices for large object storage performance
Last

Next topic

Best practices for OceanBase materialized views
Next
What is on this page
Background information
How it works
How to use
Applicable scenarios
Recommended scenarios
Not recommended scenarios
Limitations
Performance comparison
TPC-H benchmark results
OBKV-HBase time series model
Model optimization
Performance comparison
Future plans
Appendix
Storage space query methods
Terminology
References