Scenarios
- Typical scenarios
- Real-time reporting: Combine OceanBase real-time transaction data with offline wide tables.
- Cross-system joint analysis: User behavior logs (Hive/Iceberg) + user profile data (OceanBase).
- Core values
- No data migration is required, avoiding redundant storage and synchronization delays.
- Unified SQL engine reduces development complexity.
- On-demand reading saves computing and I/O costs.
Overview
OceanBase Database supports HMS Catalog since V4.4.1. The core principle is as follows:
Catalog mechanism
- A catalog is a logical container for metadata. Each catalog corresponds to an external data source, such as a MaxCompute project or Hive Metastore.
- OceanBase Database provides an internal catalog for managing local tables.
Federated query
- The query engine identifies references to tables across catalogs, such as
hms_prod.project.table. - Metadata retrieval: The metadata is obtained from HMS through the catalog.
- Data retrieval: The data is read based on the underlying storage type indicated by the table's location. The corresponding access protocol and authentication method are used.
- If the storage is HDFS, Kerberos authentication is supported.
- If the storage is OSS or an object storage compatible with the S3 protocol, the compatible file system (such as S3A or OSS-HDFS) is used for access, and AccessKey/SecretKey (AK/SK) signature authentication is performed.
Tutorial objectives
- Master the core capability of OceanBase to achieve ETL-free federated queries in AP scenarios using Catalog.
- Learn how to use HMS Catalog to access Hive and Iceberg data.
Environment requirements
- OceanBase Database V4.4.1 or later (with HMS Catalog support)
- Network connectivity:
- HMS: Hive Metastore and HDFS/S3/OSS must be accessible from OceanBase nodes
- Permissions:
- Hive: HMS read permission and HDFS file read permission (configured via Location)
Procedure for accessing Hive/Iceberg tables in HMS Catalog
Prerequisites
Before you start, make sure that the following environment is ready:
Condition |
Description |
|---|---|
| Hive Metastore (HMS) is running | Provides metadata services. The URI format is thrift://host:port. |
| Underlying storage is accessible | The underlying storage system, such as HDFS, OSS, or S3, is configured and the OBServer nodes have read permissions. |
| Network connectivity | The OBServer nodes can access HMS (default port 9083 or a custom port) and the underlying storage. |
| OceanBase Database is deployed and Java support is enabled (optional) | If you need to access HDFS, deploy the OceanBase Database Java SDK environment in advance. |
Note
If Kerberos authentication is enabled for HMS, you must also provide the KERBEROS_PRINCIPAL and KEYTAB parameters.
Assume that the data directory name in the HMS catalog is hive_meta, and the test_table table is in the test_database database in the hive_meta catalog. The table data is as follows:
+------------+--------+---------+--------+------------+
| product_id | amount | country | region | event_date |
+------------+--------+---------+--------+------------+
| 1001 | 2500 | CN | East | 2023-10-01 |
| 1003 | 1800 | JP | Kanto | 2023-10-03 |
| 1002 | 3500 | US | West | 2023-10-02 |
+------------+--------+---------+--------+------------+
The table structure in the HMS catalog is shown here for clarity in the verification results. If you are unsure about the table structure in the actual scenario, run the SHOW CREATE TABLE ${catalog_name}.${hive_db_name}.${table_name}; statement to view the table creation statement.
Step 1: Create an HMS catalog in OceanBase Database
-- Create an external catalog.
CREATE EXTERNAL CATALOG hive_meta
PROPERTIES (
TYPE = 'HMS',
URI = 'thrift://xx.xx.xx.xx:9083' -- Please replace this with the URI of your actual HMS.
);
-- Verify the catalog creation.
SHOW CATALOGS;
If Kerberos authentication is enabled for HMS, you must also configure the KERBEROS_PRINCIPAL and KEYTAB parameters.
The execution result is as follows:
+-----------+
| Catalog |
+-----------+
| hive_meta |
| internal |
+-----------+
Step 2: Prepare a real-time orders table in the OceanBase internal catalog
This step is executed in the internal catalog of OceanBase Database.
Create a database and a table
-- Switch to the internal catalog (default).
SET CATALOG internal;
-- Create a new database (optional; you can also use an existing database).
CREATE DATABASE IF NOT EXISTS federated_demo;
USE federated_demo;
-- Create an internal table.
CREATE TABLE test_table_internal (
product_id INT,
amount DOUBLE,
country VARCHAR(10),
region VARCHAR(20),
event_date DATE
);
Insert test data
INSERT INTO test_table_internal VALUES
(1004, 4200, 'DE', 'Berlin', '2023-10-04'),
(1005, 1900, 'FR', 'Paris', '2023-10-05'),
(1001, 3000, 'CN', 'South', '2023-10-06'); -- The product_id value is the same as that in the Hive table.
-- Verify the data.
SELECT * FROM test_table_internal;
Run the following SQL statements in OceanBase Database to verify the data:
-- Query the internal table.
SELECT * FROM sales_data_internal;
-- Output:
+------------+--------+---------+--------+------------+
| product_id | amount | country | region | event_date |
+------------+--------+---------+--------+------------+
| 1004 | 4200 | DE | Berlin | 2023-10-04 |
| 1005 | 1900 | FR | Paris | 2023-10-05 |
| 1001 | 3000 | CN | South | 2023-10-06 |
+------------+--------+---------+--------+------------+
3 rows in set
-- Run the following statement in OceanBase Database to query the table in the configured HMS catalog (to confirm that the table can be accessed).
SELECT * FROM hive_meta.test_database.test_table;
-- Output:
+------------+--------+---------+--------+------------+
| product_id | amount | country | region | event_date |
+------------+--------+---------+--------+------------+
| 1001 | 2500 | CN | East | 2023-10-01 |
| 1003 | 1800 | JP | Kanto | 2023-10-03 |
| 1002 | 3500 | US | West | 2023-10-02 |
+------------+--------+---------+--------+------------+
3 rows in set
Step 3: Federated query (internal + Hive)
Scenario 1: Merge all sales data
SELECT
product_id,
amount,
country,
region,
event_date,
'internal' AS data_source
FROM federated_demo.test_table_internal
UNION ALL
SELECT
product_id,
amount,
country,
region,
STR_TO_DATE(event_date, '%Y-%m-%d') AS event_date, -- Hive's event_date is a string, convert it to DATE
'hive' AS data_source
FROM hive_meta.test_database.test_table
ORDER BY event_date;
Notice
In Hive, the event_date is a string. You must use the STR_TO_DATE() function to convert it to DATE to ensure that the types are consistent.
+------------+--------+---------+--------+------------+-------------+
| product_id | amount | country | region | event_date | data_source |
+------------+--------+---------+--------+------------+-------------+
| 1001 | 2500 | CN | East | 2023-10-01 | hive |
| 1002 | 3500 | US | West | 2023-10-02 | hive |
| 1003 | 1800 | JP | Kanto | 2023-10-03 | hive |
| 1004 | 4200 | DE | Berlin | 2023-10-04 | internal |
| 1005 | 1900 | FR | Paris | 2023-10-05 | internal |
| 1001 | 3000 | CN | South | 2023-10-06 | internal |
+------------+--------+---------+--------+------------+-------------+
6 rows in set
Scenario 2: Calculate the total sales by country (cross-source aggregation)
SELECT
country,
SUM(amount) AS total_sales,
COUNT(*) AS record_count
FROM (
SELECT product_id, amount, country, region, event_date
FROM federated_demo.test_table_internal
UNION ALL
SELECT product_id, amount, country, region, event_date
FROM hive_meta.test_database.test_table
) AS all_sales
GROUP BY country
ORDER BY total_sales DESC;
Expected output:
+---------+-------------+--------------+
| country | total_sales | record_count |
+---------+-------------+--------------+
| US | 3500 | 1 |
| CN | 5500 | 2 | -- 2500 (Hive) + 3000 (Internal)
| DE | 4200 | 1 |
| FR | 1900 | 1 |
| JP | 1800 | 1 |
+---------+-------------+--------------+
Scenario 3: Find products that exist only in internal
SELECT i.product_id, i.country
FROM federated_demo.test_table_internal i
LEFT JOIN hive_meta.test_database.test_table h
ON i.product_id = h.product_id
WHERE h.product_id IS NULL;
Output:
+------------+---------+
| product_id | country |
+------------+---------+
| 1004 | DE |
| 1005 | FR |
+------------+---------+
Federated aggregation
-- Federated aggregation
SELECT country, SUM(amount) AS total
FROM (
SELECT product_id, amount, country FROM federated_demo.test_table_internal
UNION ALL
SELECT product_id, amount, country FROM hive_meta.test_database.test_table
) t
GROUP BY country;
Expected output:
+---------+-------+
| country | total |
+---------+-------+
| DE | 4200 |
| FR | 1900 |
| CN | 5500 |
| JP | 1800 |
| US | 3500 |
+---------+-------+
5 rows in set
