Scenarios
- Typical scenarios
- Real-time reporting: Combine OceanBase real-time transaction data with offline wide tables.
- Cross-system joint analysis: User behavior logs (Hive/Iceberg) + user profile data (OceanBase).
- Core value
- No data migration is required, avoiding redundant storage and synchronization latency.
- A unified SQL engine reduces development complexity.
- On-demand reading saves computing and I/O costs.
Overview
OceanBase Database has supported HMS Catalog since V4.4.1. This topic is provided in the V4.3.5 documentation set to help you understand and plan for the AP capabilities of OceanBase Database.
The core principles are as follows:
Catalog mechanism
- A catalog is a logical container for metadata. Each catalog corresponds to an external data source, such as a MaxCompute project or a Hive Metastore.
- OceanBase Database provides an internal catalog for managing local tables.
Federated query
- The query engine identifies cross-catalog table references, such as
hms_prod.project.table. - Metadata retrieval: The catalog is used to obtain the table structure from HMS.
- Data reading: Based on the underlying storage type indicated by the table's location, the corresponding access protocol and authentication method are used.
- If it is HDFS, Kerberos authentication is supported.
- If it is an object storage service compatible with the S3 protocol, such as OSS, the compatible file system (such as S3A or OSS-HDFS) is used for access, and AccessKey/SecretKey (AK/SK) signature authentication is performed.
Tutorial objectives
- Learn how to use OceanBase to perform ETL-free federated queries in AP scenarios by using a catalog.
- Learn how to use HMS Catalog to access Hive and Iceberg data.
Environment requirements
- OceanBase Database V4.4.1 or later (with HMS Catalog support)
- Network connectivity:
- HMS: Hive Metastore, HDFS, S3, and OSS must be accessible from OceanBase Database nodes.
- Permissions:
- Hive: Read permissions for HMS and HDFS files (configured through Location). For more information, see HMS Catalog.
Procedure for accessing Hive/Iceberg tables using HMS Catalog
Prerequisites
Before you start, make sure that the following environment is ready:
| Condition | Description |
|---|---|
| Hive Metastore (HMS) is running | It provides metadata services. The URI is in the thrift://host:port format. |
| The underlying storage is accessible | The storage system, such as HDFS, OSS, or S3, is configured, and the OBServer nodes have read permissions. |
| Network connectivity | The OBServer nodes can access HMS (on the default port 9083 or a custom port) and the underlying storage. |
| OceanBase is deployed and Java support is enabled (optional) | If you need to access HDFS, deploy the OceanBase Database Java SDK environment in advance. |
Note
If Kerberos authentication is enabled for HMS, you must also provide the KERBEROS_PRINCIPAL and KEYTAB parameters.
Assume that the data directory name in the HMS catalog is hive_meta, and the test_table table exists in the test_database database in the hive_meta catalog. The data in the test_table table is as follows:
+------------+--------+---------+--------+------------+
| product_id | amount | country | region | event_date |
+------------+--------+---------+--------+------------+
| 1001 | 2500 | CN | East | 2023-10-01 |
| 1003 | 1800 | JP | Kanto | 2023-10-03 |
| 1002 | 3500 | US | West | 2023-10-02 |
+------------+--------+---------+--------+------------+
The table structure in the HMS catalog is shown here to make the verification results in the following steps clearer. If you are not familiar with the table structure in your actual scenario, run the SHOW CREATE TABLE ${catalog_name}.${hive_db_name}.${table_name}; statement to view the table creation statement.
Step 1: Create an HMS catalog in OceanBase Database
-- Create an external catalog.
CREATE EXTERNAL CATALOG hive_meta
PROPERTIES (
TYPE = 'HMS',
URI = 'thrift://xx.xx.xx.xx:9083' -- Please replace the URI with your actual configuration.
);
-- Verify the catalog creation.
SHOW CATALOGS;
If Kerberos authentication is enabled for HMS, you must also configure the KERBEROS_PRINCIPAL and KEYTAB parameters.
The execution result is as follows:
+-----------+
| Catalog |
+-----------+
| hive_meta |
| internal |
+-----------+
Step 2: Prepare a real-time orders table in the OceanBase internal catalog
This step is performed in the internal catalog of OceanBase Database.
Create a database and a table
-- Switch to the internal catalog (default).
SET CATALOG internal;
-- Create a new database (optional; you can also use an existing database).
CREATE DATABASE IF NOT EXISTS federated_demo;
USE federated_demo;
-- Create an internal table.
CREATE TABLE test_table_internal (
product_id INT,
amount DOUBLE,
country VARCHAR(10),
region VARCHAR(20),
event_date DATE
);
Insert test data
INSERT INTO test_table_internal VALUES
(1004, 4200, 'DE', 'Berlin', '2023-10-04'),
(1005, 1900, 'FR', 'Paris', '2023-10-05'),
(1001, 3000, 'CN', 'South', '2023-10-06'); -- The product_id value is duplicated with that in the Hive table.
-- Verify the data.
SELECT * FROM test_table_internal;
Run the following SQL statement in OceanBase Database to verify the data:
-- Query the internal table.
SELECT * FROM sales_data_internal;
-- Output:
+------------+--------+---------+--------+------------+
| product_id | amount | country | region | event_date |
+------------+--------+---------+--------+------------+
| 1004 | 4200 | DE | Berlin | 2023-10-04 |
| 1005 | 1900 | FR | Paris | 2023-10-05 |
| 1001 | 3000 | CN | South | 2023-10-06 |
+------------+--------+---------+--------+------------+
3 rows in set
-- Run the following statement in OceanBase Database to query the table in the configured HMS catalog (to confirm that the table can be accessed).
SELECT * FROM hive_meta.test_database.test_table;
-- Output:
+------------+--------+---------+--------+------------+
| product_id | amount | country | region | event_date |
+------------+--------+---------+--------+------------+
| 1001 | 2500 | CN | East | 2023-10-01 |
| 1003 | 1800 | JP | Kanto | 2023-10-03 |
| 1002 | 3500 | US | West | 2023-10-02 |
+------------+--------+---------+--------+------------+
3 rows in set
Step 3: Federated query (internal + Hive)
Scenario 1: Merge all sales data
SELECT
product_id,
amount,
country,
region,
event_date,
'internal' AS data_source
FROM federated_demo.test_table_internal
UNION ALL
SELECT
product_id,
amount,
country,
region,
STR_TO_DATE(event_date, '%Y-%m-%d') AS event_date, -- Hive's event_date is a string, convert it to DATE
'hive' AS data_source
FROM hive_meta.test_database.test_table
ORDER BY event_date;
Notice
The event_date column in Hive tables is of the string type. You must use the STR_TO_DATE() function to convert it to the DATE type to ensure consistency in data types.
+------------+--------+---------+--------+------------+-------------+
| product_id | amount | country | region | event_date | data_source |
+------------+--------+---------+--------+------------+-------------+
| 1001 | 2500 | CN | East | 2023-10-01 | hive |
| 1002 | 3500 | US | West | 2023-10-02 | hive |
| 1003 | 1800 | JP | Kanto | 2023-10-03 | hive |
| 1004 | 4200 | DE | Berlin | 2023-10-04 | internal |
| 1005 | 1900 | FR | Paris | 2023-10-05 | internal |
| 1001 | 3000 | CN | South | 2023-10-06 | internal |
+------------+--------+---------+--------+------------+-------------+
6 rows in set
Scenario 2: Calculate the total sales by country (cross-source aggregation)
SELECT
country,
SUM(amount) AS total_sales,
COUNT(*) AS record_count
FROM (
SELECT product_id, amount, country, region, event_date
FROM federated_demo.test_table_internal
UNION ALL
SELECT product_id, amount, country, region, event_date
FROM hive_meta.test_database.test_table
) AS all_sales
GROUP BY country
ORDER BY total_sales DESC;
Expected output:
+---------+-------------+--------------+
| country | total_sales | record_count |
+---------+-------------+--------------+
| US | 3500 | 1 |
| CN | 5500 | 2 | -- 2500 (Hive) + 3000 (Internal)
| DE | 4200 | 1 |
| FR | 1900 | 1 |
| JP | 1800 | 1 |
+---------+-------------+--------------+
Scenario 3: Find products that exist only in the internal data source
SELECT i.product_id, i.country
FROM federated_demo.test_table_internal i
LEFT JOIN hive_meta.test_database.test_table h
ON i.product_id = h.product_id
WHERE h.product_id IS NULL;
Output:
+------------+---------+
| product_id | country |
+------------+---------+
| 1004 | DE |
| 1005 | FR |
+------------+---------+
Federated aggregation
-- Federated aggregation
SELECT country, SUM(amount) AS total
FROM (
SELECT product_id, amount, country FROM federated_demo.test_table_internal
UNION ALL
SELECT product_id, amount, country FROM hive_meta.test_database.test_table
) t
GROUP BY country;
Expected output:
+---------+-------+
| country | total |
+---------+-------+
| DE | 4200 |
| FR | 1900 |
| CN | 5500 |
| JP | 1800 |
| US | 3500 |
+---------+-------+
5 rows in set