Background and architecture
Key advantages of Spark Catalog
Introduced in Apache Spark 3.0, Spark Catalog serves as a standardized metadata management interface. Through three core capabilities—unified metadata view, dynamic schema discovery, and standardized APIs—it enables consistent metadata management across heterogeneous data sources. Compared with the traditional Hive Metastore, its key advantages include:
| Key advantage | Technical implementation details |
|---|---|
| Unified metadata views | Supports collaborative metadata management across heterogeneous data sources such as OceanBase, HDFS, and Iceberg. |
| Dynamic schema discovery | Automatically infers the table structure of external data sources without requiring predefined schemas. |
| Standardized operations interfaces | Executes DDL/DML operations through a unified API. |
OceanBase Connector integration solution
Starting from V1.1, OceanBase Spark Connector has been deeply integrated with Spark Catalog:
- Seamless access: The project is fully open source and available on GitHub.
- Code-free access: Enables seamless integration using only SQL statements, with no need for additional code.
- Performance improvement: Supports optimization strategies such as adaptive partitioning, parallel read/write, and predicate pushdown.
- Cross-tenant access: Enables multi-tenant data source mapping through the catalog, allowing cross-business-unit joint queries.
- Partitioned table optimization: Automatically identifies partitioned tables in OceanBase Database and optimizes their read performance.
- Automatic schema inference: Automatically discovers and infers the table structure of OceanBase Database.
Optimize the resource configuration of a Spark cluster
Hardware resource planning strategy
Take a server with 128 CPU cores and 1 TB of memory as an example. The following resource configuration strategy is recommended.
| Hardware component | Physical specification | Spark resource configuration strategy | Calculation example |
|---|---|---|---|
| CPU | 128 cores | Worker cores = physical cores × 2.5 | 128 × 2.5 = 320 cores |
| Memory | 1024 GB | Reserve 24 GB system memory, and allocate the remaining memory to Spark. | 1000 GB available |
Key configuration parameter adjustments
System-level configuration (spark-env.sh): Configure hardware resources.
# Memory resource configuration
export SPARK_WORKER_MEMORY=1000G
# CPU resource configuration
export SPARK_WORKER_CORES=320
Job-level configuration (spark-defaults.conf): Adjust parameters related to Spark jobs.
# Driver resource configuration
spark.driver.cores=2
spark.driver.memory=4g
# Executor resource configuration
spark.executor.memory=16g
spark.executor.cores=4
# Serialization optimization
spark.serializer=org.apache.spark.serializer.KryoSerializer
Optimization goal verification
The following metrics are validated through stress testing:
- The cluster resource utilization is greater than or equal to 95%.
- The linear scalability of the system is verified. Generally, as the task parallelism increases, the throughput also increases.
- The stability of the system is verified through 72 hours of continuous load testing.
OceanBase Catalog configuration practices
Basic connection configuration
| Parameter | Description |
|---|---|
spark.sql.catalog.your_catalog_name.driver |
Specifies the JDBC driver class used by Spark to connect to OceanBase Database. Depending on the driver you use, you need to configure the appropriate driver class name:
|
Read operation optimization
To enhance Spark's read performance from OceanBase Database, adjust the following parameters based on your actual hardware specifications and resource conditions:
| Parameter | Default value | Description | Tuning suggestion |
|---|---|---|---|
spark.sql.catalog.your_catalog_name.fetch-size |
100 | The number of rows that the JDBC driver fetches from OceanBase Database at a time. | You can appropriately increase this parameter to reduce the number of network interactions and improve the read performance of each Spark task from OceanBase Database. |
spark.sql.catalog.your_catalog_name.max_records_per_partition |
The maximum number of records in a partition when Spark reads data from OceanBase Database. The default value is empty, in which case Spark automatically calculates this value based on the data volume. | We recommend that you do not manually set this value. | |
spark.sql.catalog.your_catalog_name.parallel_hint_degree |
1 | When Spark reads data from OceanBase Database, the SQL statement sent by Spark to OceanBase Database automatically carries the PARALLEL hint. This parameter specifies the value of /*+ PARALLEL(n) */. | To increase parallelism based on the computing resources of OceanBase Database, you can set this parameter to a value between 4 and 8. |
Write operation optimization
You can improve Spark's write performance to OceanBase Database by adjusting the following parameters:
JDBC write optimization
| Parameter | Description | Tuning suggestion |
|---|---|---|
spark.sql.catalog.your_catalog_name.batch-size |
The number of rows each Spark task accumulates before performing a write operation. | You can increase this parameter to improve the write efficiency. |
Optimization of direct load
| Parameter | Description | Tuning suggestion |
|---|---|---|
spark.sql.catalog.your_catalog_name.direct-load.batch-size |
The number of accumulated rows processed by each Spark task before a write operation is performed. | We recommend that you increase the value of this parameter to improve the write performance. |
spark.sql.catalog.your_catalog_name.direct-load.parallel |
The concurrency of the direct load service. The number of CPU cores used for processing the current import task depends on the value of this parameter. By default, the value of this parameter is 8. | For large-scale direct load writes, you can increase the value of this parameter to significantly reduce the duration of the direct load commit phase, and improve the overall performance. |
spark.sql.catalog.your_catalog_name.direct-load.load-method |
The direct load mode. Valid values: full, inc, and inc_replace.
|
|
Table management specifications
Non-partitioned tables
Limitations:
- Index creation is not supported.
- Default values for columns cannot be set.
Partitioned tables
Compatibility limitations:
Spark supports only
BUCKETpartitioning, which corresponds toKEYpartitioning in OceanBase Database. For example:CREATE TABLE test.test1 ( user_id BIGINT COMMENT 'test_for_key', name VARCHAR(255) ) PARTITIONED BY (bucket(16, user_id)) COMMENT 'test_for_table_create' TBLPROPERTIES('replica_num' = 2, COMPRESSION = 'zstd_1.0');Multi-level partitioning is not supported. Only one level of partitioning is allowed.
Recommended practices:
For complex partitioned tables, it is recommended to create them directly on OceanBase Database. Spark Catalog can automatically recognize the existing table structure.
Production environment recommendations
Parameter tuning
Dynamically adjust Spark resource configuration and OceanBase Database connection parameters based on business workload. Focus on key metrics such as thread concurrency and batch size.
Partition management
To improve query efficiency, consider partitioning tables based on business requirements.
Privilege management
Ensure proper management of user privileges in OceanBase Database and Spark to prevent unauthorized data access.
Version management
Regularly check for updates and upgrade to the latest version of
spark-connector-oceanbaseto benefit from performance optimizations and new features.