Quickly process and analyze massive data in OceanBase Database by using Spark|V4.3.5|OceanBase Database| docs|Distributed Database

Quickly process and analyze massive data in OceanBase Database by using Spark

Last Updated：2026-07-14 08:02:21 Updated

Background information

In the wave of digital transformation, data has become a core asset for enterprises. How to efficiently process massive amounts of data and extract value from it is crucial for enhancing competitiveness. OceanBase Database is a high-performance distributed database that combines transaction processing and data warehousing capabilities, enabling large-scale data storage and querying. Apache Spark, as a distributed computing framework, is the preferred tool for handling massive amounts of data due to its efficient computing power.

This topic aims to detail how to use Spark to quickly process and analyze massive amounts of data stored in OceanBase Database. By leveraging the distributed computing framework of Spark and the high scalability of OceanBase Database, the data processing speed and system response are optimized.

Business scenario

User behavior analysis is a core approach to optimizing business processes and enhancing user experience in the e-commerce industry. For example, by analyzing user behaviors such as browsing and purchasing, enterprises can uncover user preferences, identify potential customers, and refine their product recommendation strategies.

This example uses e-commerce user behavior analysis. It outlines the following steps to build a complete data processing pipeline:

Data preparation: Obtain the user behavior dataset in CSV format.
Data storage: Import the data to OceanBase Database.
Data processing: Use Spark SQL for data cleaning, transformation, and analysis.
Result storage: Write the analysis results back to OceanBase Database.
Deep analysis: Uncover key metrics such as user preferences, popular products, and potential customers.

Field descriptions of the dataset:

UserID: the unique identifier of a user.
Timestamp: the time when the user performed the action.
PageURL: the URL of the page visited by the user.
ProductID: the unique identifier of a product.
ActionType: the type of user action (e.g., browse, cart, purchase).

Tutorial objectives

This tutorial helps you achieve the following user behavior analysis objectives:

User preference mining: Analyze user behavior such as browsing and purchasing on the platform to understand their interests or preferences in certain products.
Potential customer prediction: Analyze user behavior data to identify high-value potential customers who are likely to complete a purchase.
Popular product analysis: Identify the most popular products in the recent period to optimize product recommendations.

Prerequisites

Ensure that the following environment is ready before you start:

Spark cluster: Deployed and configured, and supports distributed computing tasks through Spark SQL. See Deploy a Spark cluster.

You can also download the Spark installation package, decompress it, and follow the tutorial.
OceanBase Database: Deployed and accessible through JDBC or MySQL protocol. See Connect to OceanBase Database by using MySQL Connector/J.
Dataset: A ready user behavior CSV file, such as user_behavior.csv. You can download the dataset from [https://www.kaggle.com/datasets].
Download and install the required jar packages:
- spark-connector-oceanbase
- mysql-connector-j

Procedure

Step 1: Configure the connection between Spark and OceanBase

Obtain the connection information of OceanBase Database:
```
jdbc:mysql://<oceanbase_host>:<port>/<database>?useSSL=false
```
- Parameter description:
  - oceanbase_host: the IP address or domain name of OceanBase Database.
  - port: the port number. The default value is 2883 for ODP and 2881 for direct connection.
  - database: the name of the target database.
Install the dependency drivers:
- Move the jar packages of spark-connector-oceanbase and mysql-connector-j to the directory of Spark:
```
cp spark-connector-oceanbase-3.4_2.12-1.1.jar $SPARK_HOME/jars/
cp mysql-connector-j-8.2.0.jar $SPARK_HOME/jars/
```

Configure the Spark Catalog:

Edit the $SPARK_HOME/conf/spark-defaults.conf file and add the following configurations:

spark.sql.catalog.ob=com.oceanbase.spark.catalog.OceanBaseCatalog
spark.sql.catalog.ob.url=jdbc:mysql://localhost:2881
spark.sql.catalog.ob.username=root@test
spark.sql.catalog.ob.password=******
spark.sql.catalog.ob.schema-name=test

Restart Spark to ensure that the configuration takes effect.

Step 2: Load the dataset to OceanBase

Prepare the data file:

Sample CSV file user_behavior.csv schema:

UserID,Timestamp,PageURL,ProductID,ActionType
U001,2023-01-01 10:00:00,/product/1001,1001,browse
U002,2023-01-01 10:15:00,/product/1002,1002,purchase

If you have downloaded and installed the Spark package to try this tutorial, you can run ./bin/spark-sql to start spark-sql.

cd spark-3.x.x-bin-hadoop3
./bin/spark-sql

The following sample code is echoed upon successful startup:

...
Spark master: local[*], Application Id: local-1744093026660
spark-sql (default)>

Create a table:

spark-sql (default)>CREATE TABLE user_behavior (
    UserID STRING,
    Timestamp TIMESTAMP,
    PageURL STRING,
    ProductID STRING,
    ActionType STRING
);

Load the data:

Load the CSV file by using Spark SQL and write the data to OceanBase:

-- Load the CSV file to a temporary view.
spark-sql (default)>CREATE OR REPLACE TEMP VIEW user_behavior_data
USING csv
OPTIONS (
    path '/path/to/user_behavior.csv',
    header 'true',
    inferSchema 'true'
);

-- Write data to the OceanBase table.
spark-sql (default)>INSERT INTO user_behavior SELECT * FROM user_behavior_data;

Step 3: Clean and transform the data

Remove rows with missing values:

spark-sql (default)>SELECT * FROM user_behavior  
WHERE UserID IS NULL OR Timestamp IS NULL OR ProductID IS NULL OR ActionType IS NULL;

Filter valid actions:

Retain only browse and purchase actions:

spark-sql (default)>CREATE OR REPLACE TEMP VIEW cleaned_data AS
SELECT *
FROM user_behavior
WHERE ActionType IN ('browse', 'purchase')
  AND UserID IS NOT NULL
  AND ProductID IS NOT NULL;

Verify the cleaning result:

spark-sql (default)>SELECT COUNT(*) AS total_records FROM cleaned_data;

Step 4: Behavioral analysis and key metric mining

1. User behavior statistics

Count the number of actions for each user:

spark-sql (default)>SELECT UserID, ProductID, ActionType, COUNT(*) AS ActionCount
FROM cleaned_data
GROUP BY UserID, ProductID, ActionType
ORDER BY UserID, ActionCount DESC;

2. Popular product analysis

Count the total number of actions for each product:

spark-sql (default)>SELECT ProductID, COUNT(*) AS TotalActions
FROM cleaned_data
GROUP BY ProductID
ORDER BY TotalActions DESC
LIMIT 10;

3. Identify potential customers

Filter users with more than 5 views:

spark-sql (default)>SELECT UserID, ProductID, COUNT(*) AS BrowseCount
FROM cleaned_data
WHERE ActionType = 'browse'
GROUP BY UserID, ProductID
HAVING BrowseCount > 5;

Step 5: Write the analysis results back to OceanBase

Create the destination table:

spark-sql (default)>CREATE TABLE potential_users (
    UserID STRING,
    ProductID STRING,
    BrowseCount INT
);

Save the data of potential customers:

spark-sql (default)>INSERT INTO potential_users
SELECT UserID, ProductID, BrowseCount
FROM (
    SELECT UserID, ProductID, COUNT(*) AS BrowseCount
    FROM cleaned_data
    WHERE ActionType = 'browse'
    GROUP BY UserID, ProductID
    HAVING BrowseCount > 5
);

Step 6: Perform complex analysis and scaling

1. Behavioral time distribution analysis

Count the number of views for each product by hour:

spark-sql (default)>SELECT ProductID, HOUR(Timestamp) AS Hour, COUNT(*) AS BrowseCount
FROM cleaned_data
WHERE ActionType = 'browse'
GROUP BY ProductID, HOUR(Timestamp)
ORDER BY BrowseCount DESC;

2. User behavior funnel analysis

Analyze the conversion rate from browsing to purchasing:

spark-sql (default)>SELECT UserID,
       COUNT(CASE WHEN ActionType = 'browse' THEN 1 END) AS BrowseCount,
       COUNT(CASE WHEN ActionType = 'purchase' THEN 1 END) AS PurchaseCount
FROM cleaned_data
GROUP BY UserID;

Summary

The Spark + OceanBase solution enables the following complete process:

Data import: Load CSV data to OceanBase by using Spark.
Data cleansing: Filter out invalid records and retain key behavioral data.
Analysis and mining:
- User preference analysis (behavior statistics).
- Popular product mining (product popularity ranking).
- Potential customer screening (high-traffic users).
Result storage: Write the analysis results back to OceanBase for future analysis.

Benefits

Separation of storage and computing:
- OceanBase Database efficiently stores massive data and supports high-concurrency queries.
- Apache Spark is responsible for distributed computing tasks. It can perform federated data analysis across heterogeneous data sources, thereby expanding the boundaries of data analysis and improving analysis efficiency.
End-to-end optimization:
- It covers the entire ETL process from data import to result output.
- It supports the combination of real-time analysis and historical data mining.

Conclusion

With the highly scalable storage capability of OceanBase Database and the distributed computing capability of Spark, enterprises can efficiently process massive user behavior data, quickly uncover key business metrics, and provide data support for refined operations and intelligent decision-making. This solution is not only suitable for e-commerce scenarios but can also be extended to multiple fields such as finance and logistics, serving as a crucial technical foundation for enterprise digital transformation.

OceanBase

Customer Stories

Documentation