OceanBase logo

OceanBase

A unified distributed database ready for your transactional, analytical, and AI workloads.

DEPLOY YOUR WAY

OceanBase Cloud

The best way to deploy and scale OceanBase

OceanBase Enterprise

Run and manage OceanBase on your infra

TRY OPEN SOURCE

OceanBase Community Edition

The free, open-source distributed database

OceanBase seekdb

Open source AI native search database

Customer Stories

Real-world success stories from enterprises across diverse industries.

View All
BY USE CASES

Mission-Critical Transactions

Global & Multicloud Application

Elastic Scaling for Peak Traffic

Real-time Analytics

Active Geo-redundancy

Database Consolidation

Resources

Comprehensive knowledge hub for OceanBase.

Blog

Live Demos

Training & Certification

Documentation

Official technical guides, tutorials, API references, and manuals for all OceanBase products.

View All
PRODUCTS

OceanBase Cloud

OceanBase Database

Tools

Connectors and Middleware

QUICK START

OceanBase Cloud

OceanBase Database

BEST PRACTICES

Practical guides for utilizing OceanBase more effectively and conveniently

Company

Learn more about OceanBase – our company, partnerships, and trust and security initiatives.

About OceanBase

Partner

Trust Center

Contact Us

International - English
中国站 - 简体中文
日本 - 日本語
Sign In
Start on Cloud

A unified distributed database ready for your transactional, analytical, and AI workloads.

DEPLOY YOUR WAY

OceanBase Cloud

The best way to deploy and scale OceanBase

OceanBase Enterprise

Run and manage OceanBase on your infra

TRY OPEN SOURCE

OceanBase Community Edition

The free, open-source distributed database

OceanBase seekdb

Open source AI native search database

Customer Stories

Real-world success stories from enterprises across diverse industries.

View All
BY USE CASES

Mission-Critical Transactions

Global & Multicloud Application

Elastic Scaling for Peak Traffic

Real-time Analytics

Active Geo-redundancy

Database Consolidation

Comprehensive knowledge hub for OceanBase.

Blog

Live Demos

Training & Certification

Documentation

Official technical guides, tutorials, API references, and manuals for all OceanBase products.

View All
PRODUCTS
OceanBase CloudOceanBase Database
ToolsConnectors and Middleware
QUICK START
OceanBase CloudOceanBase Database
BEST PRACTICES

Practical guides for utilizing OceanBase more effectively and conveniently

Learn more about OceanBase – our company, partnerships, and trust and security initiatives.

About OceanBase

Partner

Trust Center

Contact Us

Start on Cloud
编组
All Products
    • Databases
    • iconOceanBase Database
    • iconOceanBase Cloud
    • iconOceanBase Tugraph
    • iconInteractive Tutorials
    • iconOceanBase Best Practices
    • Tools
    • iconOceanBase Cloud Platform
    • iconOceanBase Migration Service
    • iconOceanBase Developer Center
    • iconOceanBase Migration Assessment
    • iconOceanBase Admin Tool
    • iconOceanBase Loader and Dumper
    • iconOceanBase Deployer
    • iconKubernetes operator for OceanBase
    • iconOceanBase Diagnostic Tool
    • iconOceanBase Binlog Service
    • Connectors and Middleware
    • iconOceanBase Database Proxy
    • iconEmbedded SQL in C for OceanBase
    • iconOceanBase Call Interface
    • iconOceanBase Connector/C
    • iconOceanBase Connector/J
    • iconOceanBase Connector/ODBC
    • iconOceanBase Connector/NET
icon

OceanBase Best Practices

All Versions

  • Deploy
    • Configuration guide for read-write splitting in AP scenarios
    • Best practices for read-write splitting
  • Migrate
    • Data transfer solutions in OceanBase Database
    • Overview on data migration
    • Best practices for importing data files to OceanBase Database
    • Best practice for migrating data from other databases to OceanBase Database
    • Massive data migration strategy
    • Best practices for migrating data from MyCat to OceanBase Database
    • Best practices for migrating PostgreSQL to OceanBase MySQL-compatible mode
  • Route
    • ODP routing best practices
  • Table Design
    • Best practices for table design and index optimization
    • Best practices for creating indexes on large tables
    • Best practices for database development
  • Develop
    • Best practices for connecting Java applications to OceanBase Database
    • Best practices for integrating Spark Catalog with OceanBase Database
    • Best practices for achieving optimal performance in batch DML using JDBC and OBServer
    • Best practices for bulk data cleanup in OceanBase Database
    • Best practices for PDML processing in OceanBase Database
    • Best practices for hot tables in OceanBase Database
    • Best practices for auto-increment columns and sequences in OceanBase Database
  • Manage
    • Best practices for resource throttling
    • Best practices for data load balancing
    • Best practices for security certification
    • Best practices for access control
    • Best practices for data encryption
  • Diagnose
    • Best practices for log interpretation in common scenarios
    • Best practices for end-to-end tracing
    • Best practices for using obdiag to collect performance information
    • Best practices for using obdiag to collect diagnostic information of parallel and slow SQL statements
    • Best practices for troubleshooting OceanBase Database performance issues
  • Performance Tuning
    • Best practices for handling slow queries
    • Best practices for collecting statistics to generate an efficient execution plan
    • Best practices for updating hotspot rows
    • Best practices for large object storage performance
    • Best practices for semi-structured storage performance
    • Best practices for OceanBase materialized views
  • Cloud Database
    • Best practices for achieving high availability through cross-cloud active-active deployment
    • High availability through primary and standby databases across clouds
    • High host CPU usage
    • Best practices for read/write splitting in OceanBase Cloud

Download PDF

Configuration guide for read-write splitting in AP scenarios Best practices for read-write splitting Data transfer solutions in OceanBase Database Overview on data migration Best practices for importing data files to OceanBase Database Best practice for migrating data from other databases to OceanBase Database Massive data migration strategy Best practices for migrating data from MyCat to OceanBase Database Best practices for migrating PostgreSQL to OceanBase MySQL-compatible mode ODP routing best practices Best practices for table design and index optimization Best practices for creating indexes on large tables Best practices for database development Best practices for connecting Java applications to OceanBase Database Best practices for integrating Spark Catalog with OceanBase Database Best practices for achieving optimal performance in batch DML using JDBC and OBServer Best practices for bulk data cleanup in OceanBase Database Best practices for PDML processing in OceanBase Database Best practices for hot tables in OceanBase Database Best practices for auto-increment columns and sequences in OceanBase Database Best practices for resource throttling Best practices for data load balancing Best practices for security certification Best practices for access control Best practices for data encryption Best practices for log interpretation in common scenarios Best practices for end-to-end tracing Best practices for using obdiag to collect performance information Best practices for using obdiag to collect diagnostic information of parallel and slow SQL statements Best practices for troubleshooting OceanBase Database performance issues Best practices for handling slow queries Best practices for collecting statistics to generate an efficient execution plan Best practices for updating hotspot rows Best practices for large object storage performance Best practices for semi-structured storage performance Best practices for OceanBase materialized views Best practices for achieving high availability through cross-cloud active-active deployment High availability through primary and standby databases across clouds High host CPU usage Best practices for read/write splitting in OceanBase Cloud
OceanBase logo

The Unified Distributed Database for the AI Era.

Follow Us
Products
OceanBase CloudOceanBase EnterpriseOceanBase Community EditionOceanBase seekdb
Resources
DocsBlogLive DemosTraining & Certification
Company
About OceanBaseTrust CenterLegalPartnerContact Us
Follow Us

© OceanBase 2026. All rights reserved

Cloud Service AgreementPrivacy PolicySecurity
Contact Us
Document Feedback
  1. Documentation Center
  2. OceanBase Best Practices
  3. master
iconOceanBase Best Practices
master
  • master

Best practices for integrating Spark Catalog with OceanBase Database

Last Updated:2025-06-25 06:02:53  Updated
share
What is on this page
Background and architecture
Key advantages of Spark Catalog
OceanBase Connector integration solution
Optimize the resource configuration of a Spark cluster
Hardware resource planning strategy
Key configuration parameter adjustments
Optimization goal verification
OceanBase Catalog configuration practices
Basic connection configuration
Read operation optimization
Write operation optimization
Table management specifications
Production environment recommendations

folded

share

Background and architecture

Key advantages of Spark Catalog

Introduced in Apache Spark 3.0, Spark Catalog serves as a standardized metadata management interface. Through three core capabilities—unified metadata view, dynamic schema discovery, and standardized APIs—it enables consistent metadata management across heterogeneous data sources. Compared with the traditional Hive Metastore, its key advantages include:

Key advantage Technical implementation details
Unified metadata views Supports collaborative metadata management across heterogeneous data sources such as OceanBase, HDFS, and Iceberg.
Dynamic schema discovery Automatically infers the table structure of external data sources without requiring predefined schemas.
Standardized operations interfaces Executes DDL/DML operations through a unified API.

OceanBase Connector integration solution

Starting from V1.1, OceanBase Spark Connector has been deeply integrated with Spark Catalog:

  • Seamless access: The project is fully open source and available on GitHub.
  • Code-free access: Enables seamless integration using only SQL statements, with no need for additional code.
  • Performance improvement: Supports optimization strategies such as adaptive partitioning, parallel read/write, and predicate pushdown.
  • Cross-tenant access: Enables multi-tenant data source mapping through the catalog, allowing cross-business-unit joint queries.
  • Partitioned table optimization: Automatically identifies partitioned tables in OceanBase Database and optimizes their read performance.
  • Automatic schema inference: Automatically discovers and infers the table structure of OceanBase Database.

Optimize the resource configuration of a Spark cluster

Hardware resource planning strategy

Take a server with 128 CPU cores and 1 TB of memory as an example. The following resource configuration strategy is recommended.

Hardware component Physical specification Spark resource configuration strategy Calculation example
CPU 128 cores Worker cores = physical cores × 2.5 128 × 2.5 = 320 cores
Memory 1024 GB Reserve 24 GB system memory, and allocate the remaining memory to Spark. 1000 GB available

Key configuration parameter adjustments

System-level configuration (spark-env.sh): Configure hardware resources.

# Memory resource configuration
export SPARK_WORKER_MEMORY=1000G
# CPU resource configuration
export SPARK_WORKER_CORES=320

Job-level configuration (spark-defaults.conf): Adjust parameters related to Spark jobs.

# Driver resource configuration
spark.driver.cores=2
spark.driver.memory=4g

# Executor resource configuration
spark.executor.memory=16g
spark.executor.cores=4

# Serialization optimization
spark.serializer=org.apache.spark.serializer.KryoSerializer

Optimization goal verification

The following metrics are validated through stress testing:

  • The cluster resource utilization is greater than or equal to 95%.
  • The linear scalability of the system is verified. Generally, as the task parallelism increases, the throughput also increases.
  • The stability of the system is verified through 72 hours of continuous load testing.

OceanBase Catalog configuration practices

Basic connection configuration

Parameter Description
spark.sql.catalog.your_catalog_name.driver Specifies the JDBC driver class used by Spark to connect to OceanBase Database. Depending on the driver you use, you need to configure the appropriate driver class name:
  • If you are using the MySQL driver to connect to OceanBase Database, set this parameter to com.mysql.cj.jdbc.Driver.
  • If you are using the OceanBase driver to connect to OceanBase Database, set this parameter to com.oceanbase.jdbc.Driver.
Although this parameter is optional, it is recommended to configure it. Proper configuration ensures that Spark can locate the corresponding driver, enabling a smooth connection process.

Read operation optimization

To enhance Spark's read performance from OceanBase Database, adjust the following parameters based on your actual hardware specifications and resource conditions:

Parameter Default value Description Tuning suggestion
spark.sql.catalog.your_catalog_name.fetch-size 100 The number of rows that the JDBC driver fetches from OceanBase Database at a time. You can appropriately increase this parameter to reduce the number of network interactions and improve the read performance of each Spark task from OceanBase Database.
spark.sql.catalog.your_catalog_name.max_records_per_partition The maximum number of records in a partition when Spark reads data from OceanBase Database. The default value is empty, in which case Spark automatically calculates this value based on the data volume. We recommend that you do not manually set this value.
spark.sql.catalog.your_catalog_name.parallel_hint_degree 1 When Spark reads data from OceanBase Database, the SQL statement sent by Spark to OceanBase Database automatically carries the PARALLEL hint. This parameter specifies the value of /*+ PARALLEL(n) */. To increase parallelism based on the computing resources of OceanBase Database, you can set this parameter to a value between 4 and 8.

Write operation optimization

You can improve Spark's write performance to OceanBase Database by adjusting the following parameters:

JDBC write optimization

Parameter Description Tuning suggestion
spark.sql.catalog.your_catalog_name.batch-size The number of rows each Spark task accumulates before performing a write operation. You can increase this parameter to improve the write efficiency.

Optimization of direct load

Parameter Description Tuning suggestion
spark.sql.catalog.your_catalog_name.direct-load.batch-size The number of accumulated rows processed by each Spark task before a write operation is performed. We recommend that you increase the value of this parameter to improve the write performance.
spark.sql.catalog.your_catalog_name.direct-load.parallel The concurrency of the direct load service. The number of CPU cores used for processing the current import task depends on the value of this parameter. By default, the value of this parameter is 8. For large-scale direct load writes, you can increase the value of this parameter to significantly reduce the duration of the direct load commit phase, and improve the overall performance.
spark.sql.catalog.your_catalog_name.direct-load.load-method The direct load mode. Valid values: full, inc, and inc_replace.
  • full: full direct load. This is the default mode.
  • inc: normal incremental direct load. Primary key conflict check is performed. This mode is supported in OceanBase Database V4.3.2 and later. If you set direct-load.dup-action to REPLACE, this mode is not supported.
  • inc_replace: incremental direct load in replace mode. Primary key conflict check is not performed, and old data is directly overwritten (which is equivalent to the replace operation). This mode is supported in OceanBase Database V4.3.2 and later. The direct-load.dup-action parameter is ignored in this mode.
  • The full mode is suitable only for empty tables or tables with a small amount of data. The imported data is written to the major SSTable. For columnar tables, the data is stored in the columnar format after the write operation is completed, providing good query performance.
  • The inc_replace and inc modes support incremental direct load. These two modes are suitable for both empty and non-empty tables. Data is written to the dump. The dump does not support columnar storage. Therefore, for columnar tables, you need to perform a major compaction after the data is imported to enable columnar query performance.
  • The main difference between the inc_replace and inc modes is that the inc_replace mode does not check for primary key conflicts and directly overwrites existing primary keys.

Table management specifications

Non-partitioned tables

  • Limitations:

    1. Index creation is not supported.
    2. Default values for columns cannot be set.

Partitioned tables

  • Compatibility limitations:

    • Spark supports only BUCKET partitioning, which corresponds to KEY partitioning in OceanBase Database. For example:

      CREATE TABLE test.test1 (
        user_id BIGINT COMMENT 'test_for_key',
        name VARCHAR(255)
      )
      PARTITIONED BY (bucket(16, user_id))
      COMMENT 'test_for_table_create'
      TBLPROPERTIES('replica_num' = 2, COMPRESSION = 'zstd_1.0');
      
    • Multi-level partitioning is not supported. Only one level of partitioning is allowed.

  • Recommended practices:

    For complex partitioned tables, it is recommended to create them directly on OceanBase Database. Spark Catalog can automatically recognize the existing table structure.

Production environment recommendations

  1. Parameter tuning

    Dynamically adjust Spark resource configuration and OceanBase Database connection parameters based on business workload. Focus on key metrics such as thread concurrency and batch size.

  2. Partition management

    To improve query efficiency, consider partitioning tables based on business requirements.

  3. Privilege management

    Ensure proper management of user privileges in OceanBase Database and Spark to prevent unauthorized data access.

  4. Version management

    Regularly check for updates and upgrade to the latest version of spark-connector-oceanbase to benefit from performance optimizations and new features.

Previous topic

Best practices for connecting Java applications to OceanBase Database
Last

Next topic

Best practices for achieving optimal performance in batch DML using JDBC and OBServer
Next
What is on this page
Background and architecture
Key advantages of Spark Catalog
OceanBase Connector integration solution
Optimize the resource configuration of a Spark cluster
Hardware resource planning strategy
Key configuration parameter adjustments
Optimization goal verification
OceanBase Catalog configuration practices
Basic connection configuration
Read operation optimization
Write operation optimization
Table management specifications
Production environment recommendations