Use Spark to process and analyze large volumes of data in OceanBase Database quickly|V4.4.2|OceanBase Database| docs|Distributed Database

Use Spark to process and analyze large volumes of data in OceanBase Database quickly

Last Updated：2026-04-02 06:23:56 Updated

Background information

In the era of digital transformation, data has become a core asset for enterprises. Efficiently handling large volumes of data and extracting its value is crucial for enhancing competitiveness. OceanBase Database is a high-performance distributed database that combines transaction processing and data warehousing capabilities, supporting the storage and querying of massive data. Apache Spark, a distributed computing framework, is renowned for its efficient computational power, making it the preferred tool for processing large volumes of data.

This topic aims to provide a detailed guide on how to quickly process and analyze large volumes of data in OceanBase Database using Spark. By leveraging Spark's distributed computing framework and OceanBase's high scalability, we will optimize data processing speed and system response capabilities to their maximum potential.

Business scenario

In the e-commerce industry, analyzing user behavior is a core method for optimizing business processes and enhancing user experience. For example, by analyzing user browsing and purchasing behaviors, companies can uncover user preferences, predict potential customers, and refine their product recommendation strategies.

This article uses e-commerce user behavior analysis as an example. The following steps outline the complete data processing chain:

Data preparation: Obtain the user behavior dataset in CSV format.
Data storage: Import the data into OceanBase Database.
Data processing: Use Spark SQL for data cleaning, transformation, and analysis.
Result storage: Write the analysis results back to OceanBase.
Deep analysis: Identify key metrics such as user preferences, popular products, and potential customers.

Field descriptions of the dataset:

UserID: The unique identifier for a user.
Timestamp: The time when the user behavior occurred.
PageURL: The URL of the accessed page.
ProductID: The unique identifier for a product.
ActionType: The type of user action (e.g., browse, cart, purchase).

Tutorial objectives

This tutorial aims to achieve the following user behavior analysis goals:

User preference mining: Analyze user behavior on the platform (such as browsing and purchasing) to understand their interests or preferences in certain products.
Potential customer prediction: Analyze user behavior data to identify potential high-value customers who are likely to make a purchase.
Popular product analysis: Track popular products over a recent period to provide insights for optimizing product recommendations.

Prerequisites

Before you start, make sure that the following environment is ready:

Spark cluster: It is deployed and configured to support distributed computing tasks through Spark SQL. For more information, see Deploy a Spark cluster.

You can also download the Spark installation package and extract it to complete this tutorial.
OceanBase Database: It is deployed and accessible via JDBC or MySQL protocol. For more information, see Connect to OceanBase Database using MySQL Connector/J.
Dataset: A prepared user behavior CSV file (e.g., user_behavior.csv). You can download the dataset from https://www.kaggle.com/datasets.
Download the required jar packages:

Download spark-connector-oceanbase

Download mysql-connector-j

You can also click the links to download the latest versions of spark-connector-oceanbase and mysql-connector-j.

Procedure

Step 1: Configure the connection between Spark and OceanBase

Obtain the OceanBase connection information:
```
jdbc:mysql://<oceanbase_host>:<port>/<database>?useSSL=false
```
- Parameters:
  - oceanbase_host: The IP address or domain name of the OceanBase database.
  - port: The port number (ODP listens on port 2883 by default, and direct connection listens on port 2881 by default).
  - database: The name of the target database.

Install the required drivers:

Move the spark-connector-oceanbase and mysql-connector-j jar files to the Spark directory:

cp spark-connector-oceanbase-3.4_2.12-1.1.jar $SPARK_HOME/jars/
cp mysql-connector-j-8.2.0.jar $SPARK_HOME/jars/

Configure the Spark catalog:

Edit $SPARK_HOME/conf/spark-defaults.conf and add the following configurations:

spark.sql.catalog.ob=com.oceanbase.spark.catalog.OceanBaseCatalog
spark.sql.catalog.ob.url=jdbc:mysql://localhost:2881
spark.sql.catalog.ob.username=root@test
spark.sql.catalog.ob.password=******
spark.sql.catalog.ob.schema-name=test

Restart Spark to ensure the configurations take effect.

Step 2: Load the dataset into OceanBase

Prepare the data file:

Example CSV file user_behavior.csv structure:

UserID,Timestamp,PageURL,ProductID,ActionType
U001,2023-01-01 10:00:00,/product/1001,1001,browse
U002,2023-01-01 10:15:00,/product/1002,1002,purchase

If you downloaded and installed the Spark installation package, you can use ./bin/spark-sql to start spark-sql.

cd spark-3.x.x-bin-hadoop3
./bin/spark-sql

After the startup is successful, the output will be as follows:

...
Spark master: local[*], Application Id: local-1744093026660
spark-sql (default)>

Create the target table:

spark-sql (default)>CREATE TABLE user_behavior (
    UserID STRING,
    Timestamp TIMESTAMP,
    PageURL STRING,
    ProductID STRING,
    ActionType STRING
);

Load the data:

Load the CSV file using Spark SQL and write it to OceanBase:

-- Load the CSV file into a temporary view
spark-sql (default)>CREATE OR REPLACE TEMP VIEW user_behavior_data
USING csv
OPTIONS (
    path '/path/to/user_behavior.csv',
    header 'true',
    inferSchema 'true'
);

-- Write to the OceanBase table
spark-sql (default)>INSERT INTO user_behavior SELECT * FROM user_behavior_data;

Step 3: Data cleaning and transformation

Remove missing values:

spark-sql (default)>SELECT * FROM user_behavior
WHERE UserID IS NULL OR Timestamp IS NULL OR ProductID IS NULL OR ActionType IS NULL;

Filter valid behaviors:

Retain only browse and purchase actions:

spark-sql (default)>CREATE OR REPLACE TEMP VIEW cleaned_data AS
SELECT *
FROM user_behavior
WHERE ActionType IN ('browse', 'purchase')
  AND UserID IS NOT NULL
  AND ProductID IS NOT NULL;

Verify the cleaning results:

spark-sql (default)>SELECT COUNT(*) AS total_records FROM cleaned_data;

Step 4: Behavior Analysis and Key Metric Extraction

1. User Behavior Statistics

Count actions per user:

spark-sql (default)>SELECT UserID, ProductID, ActionType, COUNT(*) AS ActionCount
FROM cleaned_data
GROUP BY UserID, ProductID, ActionType
ORDER BY UserID, ActionCount DESC;

2. Popular Product Analysis

Count actions per product:

spark-sql (default)>SELECT ProductID, COUNT(*) AS TotalActions
FROM cleaned_data
GROUP BY ProductID
ORDER BY TotalActions DESC
LIMIT 10;

3. Identify Potential Customers

Filter users who browsed more than 5 times:

spark-sql (default)>SELECT UserID, ProductID, COUNT(*) AS BrowseCount
FROM cleaned_data
WHERE ActionType = 'browse'
GROUP BY UserID, ProductID
HAVING BrowseCount > 5;

Step 5: Write Analysis Results to OceanBase

Create a target table:

spark-sql (default)>CREATE TABLE potential_users (
    UserID STRING,
    ProductID STRING,
    BrowseCount INT
);

Save potential customer data:

spark-sql (default)>INSERT INTO potential_users
SELECT UserID, ProductID, BrowseCount
FROM (
    SELECT UserID, ProductID, COUNT(*) AS BrowseCount
    FROM cleaned_data
    WHERE ActionType = 'browse'
    GROUP BY UserID, ProductID
    HAVING BrowseCount > 5
);

Step 6: Advanced Analysis and Extensions

1. Time Distribution of Actions

Analyze the time distribution of product views:

spark-sql (default)>SELECT ProductID, HOUR(Timestamp) AS Hour, COUNT(*) AS BrowseCount
FROM cleaned_data
WHERE ActionType = 'browse'
GROUP BY ProductID, HOUR(Timestamp)
ORDER BY BrowseCount DESC;

2. Funnel Analysis of User Behavior

Analyze the conversion rate from browsing to purchasing:

spark-sql (default)>SELECT UserID,
       COUNT(CASE WHEN ActionType = 'browse' THEN 1 END) AS BrowseCount,
       COUNT(CASE WHEN ActionType = 'purchase' THEN 1 END) AS PurchaseCount
FROM cleaned_data
GROUP BY UserID;

Summary of the process

By integrating Spark + OceanBase, we achieved the following complete process:

Data Import: Load CSV data into OceanBase using Spark.
Data Cleaning: Filter out invalid records and retain key behavioral data.
Analysis and Mining:
- User preference analysis (behavioral statistics).
- Hot product discovery (product popularity ranking).
- Potential customer screening (high-traffic users).
Result Storage: Write the analysis results back to OceanBase for future use.

Value

Decoupling of storage and computing:
- OceanBase handles the efficient storage and high-concurrency querying of large volumes of data.
- Spark manages distributed computing tasks, enabling federated data analysis across heterogeneous data sources, thereby expanding the boundaries of data analysis and improving efficiency.
End-to-end optimization:
- Covers the entire ETL process from data import to result output.
- Supports the integration of real-time analysis with historical data mining.

Conclusion

By leveraging OceanBase's high scalability in storage and Spark's distributed computing capabilities, enterprises can efficiently process massive amounts of user behavior data, quickly identify key business metrics, and support refined operations and intelligent decision-making. This solution is not only applicable to e-commerce but can also be extended to other industries such as finance and logistics, serving as a crucial technological foundation for digital transformation.

OceanBase

Customer Stories

Documentation