Best practices for integrating Spark Catalog with OceanBase Database-OceanBase

Background and architecture

Key advantages of Spark Catalog

Introduced in Apache Spark 3.0, Spark Catalog serves as a standardized metadata management interface. Through three core capabilities—unified metadata view, dynamic schema discovery, and standardized APIs—it enables consistent metadata management across heterogeneous data sources. Compared with the traditional Hive Metastore, its key advantages include:

Key advantage	Technical implementation details
Unified metadata views	Supports collaborative metadata management across heterogeneous data sources such as OceanBase, HDFS, and Iceberg.
Dynamic schema discovery	Automatically infers the table structure of external data sources without requiring predefined schemas.
Standardized operations interfaces	Executes DDL/DML operations through a unified API.

OceanBase Connector integration solution

Starting from V1.1, OceanBase Spark Connector has been deeply integrated with Spark Catalog:

Seamless access: The project is fully open source and available on GitHub.
Code-free access: Enables seamless integration using only SQL statements, with no need for additional code.
Performance improvement: Supports optimization strategies such as adaptive partitioning, parallel read/write, and predicate pushdown.
Cross-tenant access: Enables multi-tenant data source mapping through the catalog, allowing cross-business-unit joint queries.
Partitioned table optimization: Automatically identifies partitioned tables in OceanBase Database and optimizes their read performance.
Automatic schema inference: Automatically discovers and infers the table structure of OceanBase Database.

Optimize the resource configuration of a Spark cluster

Hardware resource planning strategy

Take a server with 128 CPU cores and 1 TB of memory as an example. The following resource configuration strategy is recommended.

Hardware component	Physical specification	Spark resource configuration strategy	Calculation example
CPU	128 cores	Worker cores = physical cores × 2.5	128 × 2.5 = 320 cores
Memory	1024 GB	Reserve 24 GB system memory, and allocate the remaining memory to Spark.	1000 GB available

Key configuration parameter adjustments

System-level configuration (spark-env.sh): Configure hardware resources.

# Memory resource configuration
export SPARK_WORKER_MEMORY=1000G
# CPU resource configuration
export SPARK_WORKER_CORES=320

Job-level configuration (spark-defaults.conf): Adjust parameters related to Spark jobs.

# Driver resource configuration
spark.driver.cores=2
spark.driver.memory=4g

# Executor resource configuration
spark.executor.memory=16g
spark.executor.cores=4

# Serialization optimization
spark.serializer=org.apache.spark.serializer.KryoSerializer

Optimization goal verification

The following metrics are validated through stress testing:

The cluster resource utilization is greater than or equal to 95%.
The linear scalability of the system is verified. Generally, as the task parallelism increases, the throughput also increases.
The stability of the system is verified through 72 hours of continuous load testing.

OceanBase Catalog configuration practices

Basic connection configuration

Parameter Description

Parameter	Description
`spark.sql.catalog.your_catalog_name.driver`	Specifies the JDBC driver class used by Spark to connect to OceanBase Database. Depending on the driver you use, you need to configure the appropriate driver class name: If you are using the MySQL driver to connect to OceanBase Database, set this parameter to `com.mysql.cj.jdbc.Driver`. If you are using the OceanBase driver to connect to OceanBase Database, set this parameter to `com.oceanbase.jdbc.Driver`. Although this parameter is optional, it is recommended to configure it. Proper configuration ensures that Spark can locate the corresponding driver, enabling a smooth connection process.

spark.sql.catalog.your_catalog_name.driver

Specifies the JDBC driver class used by Spark to connect to OceanBase Database. Depending on the driver you use, you need to configure the appropriate driver class name:

If you are using the MySQL driver to connect to OceanBase Database, set this parameter to com.mysql.cj.jdbc.Driver.
If you are using the OceanBase driver to connect to OceanBase Database, set this parameter to com.oceanbase.jdbc.Driver.

Although this parameter is optional, it is recommended to configure it. Proper configuration ensures that Spark can locate the corresponding driver, enabling a smooth connection process.

Read operation optimization

To enhance Spark's read performance from OceanBase Database, adjust the following parameters based on your actual hardware specifications and resource conditions:

Parameter	Default value	Description	Tuning suggestion
`spark.sql.catalog.your_catalog_name.fetch-size`	100	The number of rows that the JDBC driver fetches from OceanBase Database at a time.	You can appropriately increase this parameter to reduce the number of network interactions and improve the read performance of each Spark task from OceanBase Database.
`spark.sql.catalog.your_catalog_name.max_records_per_partition`		The maximum number of records in a partition when Spark reads data from OceanBase Database. The default value is empty, in which case Spark automatically calculates this value based on the data volume.	We recommend that you do not manually set this value.
`spark.sql.catalog.your_catalog_name.parallel_hint_degree`	1	When Spark reads data from OceanBase Database, the SQL statement sent by Spark to OceanBase Database automatically carries the PARALLEL hint. This parameter specifies the value of /+ PARALLEL(n) /.	To increase parallelism based on the computing resources of OceanBase Database, you can set this parameter to a value between `4` and `8`.

Write operation optimization

You can improve Spark's write performance to OceanBase Database by adjusting the following parameters:

JDBC write optimization

Parameter	Description	Tuning suggestion
`spark.sql.catalog.your_catalog_name.batch-size`	The number of rows each Spark task accumulates before performing a write operation.	You can increase this parameter to improve the write efficiency.

Optimization of direct load

Parameter Description Tuning suggestion

spark.sql.catalog.your_catalog_name.direct-load.batch-size The number of accumulated rows processed by each Spark task before a write operation is performed. We recommend that you increase the value of this parameter to improve the write performance.

spark.sql.catalog.your_catalog_name.direct-load.parallel The concurrency of the direct load service. The number of CPU cores used for processing the current import task depends on the value of this parameter. By default, the value of this parameter is 8. For large-scale direct load writes, you can increase the value of this parameter to significantly reduce the duration of the direct load commit phase, and improve the overall performance.

Parameter	Description	Tuning suggestion
`spark.sql.catalog.your_catalog_name.direct-load.batch-size`	The number of accumulated rows processed by each Spark task before a write operation is performed.	We recommend that you increase the value of this parameter to improve the write performance.
`spark.sql.catalog.your_catalog_name.direct-load.parallel`	The concurrency of the direct load service. The number of CPU cores used for processing the current import task depends on the value of this parameter. By default, the value of this parameter is 8.	For large-scale direct load writes, you can increase the value of this parameter to significantly reduce the duration of the direct load commit phase, and improve the overall performance.
`spark.sql.catalog.your_catalog_name.direct-load.load-method`	The direct load mode. Valid values: `full`, `inc`, and `inc_replace`. `full`: full direct load. This is the default mode. `inc`: normal incremental direct load. Primary key conflict check is performed. This mode is supported in OceanBase Database V4.3.2 and later. If you set `direct-load.dup-action` to `REPLACE`, this mode is not supported. `inc_replace`: incremental direct load in replace mode. Primary key conflict check is not performed, and old data is directly overwritten (which is equivalent to the replace operation). This mode is supported in OceanBase Database V4.3.2 and later. The `direct-load.dup-action` parameter is ignored in this mode.	The full mode is suitable only for empty tables or tables with a small amount of data. The imported data is written to the major SSTable. For columnar tables, the data is stored in the columnar format after the write operation is completed, providing good query performance. The `inc_replace` and `inc` modes support incremental direct load. These two modes are suitable for both empty and non-empty tables. Data is written to the dump. The dump does not support columnar storage. Therefore, for columnar tables, you need to perform a major compaction after the data is imported to enable columnar query performance. The main difference between the `inc_replace` and `inc` modes is that the `inc_replace` mode does not check for primary key conflicts and directly overwrites existing primary keys.

spark.sql.catalog.your_catalog_name.direct-load.load-method

The direct load mode. Valid values: full, inc, and inc_replace.

full: full direct load. This is the default mode.
inc: normal incremental direct load. Primary key conflict check is performed. This mode is supported in OceanBase Database V4.3.2 and later. If you set direct-load.dup-action to REPLACE, this mode is not supported.
inc_replace: incremental direct load in replace mode. Primary key conflict check is not performed, and old data is directly overwritten (which is equivalent to the replace operation). This mode is supported in OceanBase Database V4.3.2 and later. The direct-load.dup-action parameter is ignored in this mode.

The full mode is suitable only for empty tables or tables with a small amount of data. The imported data is written to the major SSTable. For columnar tables, the data is stored in the columnar format after the write operation is completed, providing good query performance.

The inc_replace and inc modes support incremental direct load. These two modes are suitable for both empty and non-empty tables. Data is written to the dump. The dump does not support columnar storage. Therefore, for columnar tables, you need to perform a major compaction after the data is imported to enable columnar query performance.

The main difference between the inc_replace and inc modes is that the inc_replace mode does not check for primary key conflicts and directly overwrites existing primary keys.

Table management specifications

Non-partitioned tables

Limitations:
1. Index creation is not supported.
2. Default values for columns cannot be set.

Partitioned tables

Compatibility limitations:
- Spark supports only BUCKET partitioning, which corresponds to KEY partitioning in OceanBase Database. For example:
```
CREATE TABLE test.test1 (
  user_id BIGINT COMMENT 'test_for_key',
  name VARCHAR(255)
)
PARTITIONED BY (bucket(16, user_id))
COMMENT 'test_for_table_create'
TBLPROPERTIES('replica_num' = 2, COMPRESSION = 'zstd_1.0');
```
- Multi-level partitioning is not supported. Only one level of partitioning is allowed.
Recommended practices:

For complex partitioned tables, it is recommended to create them directly on OceanBase Database. Spark Catalog can automatically recognize the existing table structure.

Production environment recommendations

Parameter tuning

Dynamically adjust Spark resource configuration and OceanBase Database connection parameters based on business workload. Focus on key metrics such as thread concurrency and batch size.
Partition management

To improve query efficiency, consider partitioning tables based on business requirements.
Privilege management

Ensure proper management of user privileges in OceanBase Database and Spark to prevent unauthorized data access.
Version management

Regularly check for updates and upgrade to the latest version of spark-connector-oceanbase to benefit from performance optimizations and new features.