OceanBase Database V4.4.1 and later allow you to connect to Hive Metastore (HMS) and create an HMS Catalog. This feature enables you to access tables managed by HMS, including traditional Hive tables and Iceberg tables whose metadata is registered in HMS. You can query data managed by HMS in the Hadoop ecosystem without data migration, making it ideal for accelerating data lake queries.
- Unified metadata management: Automatically synchronize table structures from HMS.
- Federated queries: Join and analyze data with internal OceanBase tables.
- Read-only security: Prevent accidental modifications to source data.
OceanBase Database V4.4.1 and later support the HMS Catalog feature. This document provides usage instructions for the AP capabilities of OceanBase Database in the V4.3.5 documentation set to help you get started.
Supported features
Read-only access: All objects under HMS Catalog are read-only. Currently,
INSERT,UPDATE, andDROP TABLEoperations are not supported.Supported table types:
Hive tables: ORC, Parquet, TextFile, and CSV.
- In the current version, only the ARRAY type in Parquet format is supported for complex types.
- For more information about the ARRAY type, see Overview of array element types.
Iceberg tables: V1 and V2. We recommend Parquet format.
Advanced features:
- Iceberg schema evolution (column addition and deletion)
- Iceberg partition transformation (such as bucket() and year() partition expressions)
Hive version compatibility
- Supported versions: Hive 1.2.x, Hive 2.3.x, Hive 3.1.x, and Hive 4.x.
- Compatibility with Iceberg tables:
Hive 4.x and later: Hive natively supports Iceberg tables. You can create and update metadata of Iceberg tables by using Hive. OceanBase Database can directly access Iceberg tables through HMS Catalog.
Hive 1.2.x, 2.3.x, and 3.1.x: Hive does not support Iceberg tables. In such environments, Iceberg tables are usually created and written to by computing engines such as Spark and Flink, and then registered in the Hive Metastore. OceanBase Database identifies and reads Iceberg tables based on the metadata information stored in HMS.
Therefore, as long as the metadata of Iceberg tables is correctly registered in the Hive Metastore (regardless of the engine used for writing), OceanBase Database can access them.
Prerequisites
Environment dependencies: If the underlying storage is HDFS, you must deploy a Java SDK environment. For more information, see Deploy the OceanBase Java SDK environment.
Privilege requirements
HMS service: The OceanBase cluster must be able to access Hive Metastore (Thrift protocol). The address and port of the HMS service must be provided by the customer's O&M team. By default, open-source Hive uses port 9083, but this may vary in actual deployment. Please refer to the on-site configuration.
Storage system: All OBServer nodes must be readable for HDFS, S3, and OSS.
Syntax for creating an HMS Catalog
CREATE EXTERNAL CATALOG [IF NOT EXISTS] catalog_name
PROPERTIES (
TYPE = 'HMS',
URI = 'thrift://host:port',
[PRINCIPAL = '...'],
[KEYTAB = '...'],
[KRB5CONF = '...'],
[MAX_CLIENT_POOL_SIZE = 20],
[SOCKET_TIMEOUT = 10000000]
);
Parameters
| Parameter | Required | Description |
|---|---|---|
| TYPE | Yes | Fixed to 'HMS'. |
| URI | Yes | The Thrift address of HMS. Format: thrift://$host:$port
"thrift://<HMS IP address 1>:<HMS port number 1>,thrift://<HMS IP address 2>:<HMS port number 2>,thrift://<HMS IP address 3>:<HMS port number 3>". |
| PRINCIPAL | No | The Kerberos principal, usually in the service/HOST@REGION.com format. Example: hive/hadoop@QA.COM. |
| KEYTAB | No | The path of the KEYTAB key file required to access the HMS service with Kerberos authentication. If OceanBase Database is deployed in a distributed manner, the file must exist on all nodes of the corresponding OBServer node. Notice: This parameter is required only when Kerberos authentication is enabled for HMS. |
| KRB5CONF | No | The path of the Kerberos configuration file, such as /etc/krb5.conf. If OceanBase Database is deployed in a distributed manner, the file must exist on all nodes of the corresponding OBServer node. |
| MAX_CLIENT_POOL_SIZE | No | The HMS client connection pool size (default is 20, meaning the current HMS Catalog can start up to 20 clients connecting to the HMS service). |
| SOCKET_TIMEOUT | No | The timeout for connecting to the Hive metastore (in microseconds). Default value: 10,000,000 (10s). |
Create an HMS Catalog with SIMPLE authentication
In the SIMPLE authentication mode, no Location authentication is required.
obclient> CREATE EXTERNAL CATALOG test_hms_catalog
PROPERTIES = (
TYPE = 'HMS',
URI = "thrift://xxx.xxx.xxx.xxx:xxxx"
);
For more information about Hadoop authentication modes, see the "Authentication mechanism of HMS Catalog and HDFS storage" section below.
Create an HMS Catalog with Kerberos authentication
For more information about Hadoop authentication modes, see the section titled "Authentication mechanism of HMS Catalog and HDFS storage" below.
Step 1: Create an HMS Catalog with Kerberos authentication (metadata layer)
CREATE EXTERNAL CATALOG hms_kerberos
PROPERTIES (
TYPE = 'HMS',
URI = 'thrift://hms.example.com:9083',
PRINCIPAL = 'hive/hms.example.com@EXAMPLE.COM',
KEYTAB = '/etc/ob/hive.keytab',
KRB5CONF = '/etc/krb5.conf'
);
Step 2: Create an HDFS location with Kerberos authentication (data layer)
CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://namenode:8020/'
CREDENTIAL (
PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
KEYTAB = "/etc/ob/hdfs.keytab",
KRB5CONF = "/etc/krb5.conf",
CONFIGS = '...' -- HA configurations as described above
);
The HMS Catalog can discover the table structure, and the location can securely read HDFS data files.
Switch Catalog
SET CATALOG hms_catalog;
Query external data sources through Catalog
-- Hive table
SELECT city, COUNT(*) FROM hive_db.customer WHERE dt >= '2025-04-01' GROUP BY city;
-- Iceberg table
SELECT product_id, SUM(sales) FROM iceberg_db.sales_iceberg WHERE event_time >= '2025-04-01' GROUP BY product_id;
-- Federation query
SELECT o.order_id, h.city
FROM internal.test_db.orders o
JOIN hms_catalog.hive_db.customer h ON o.user_id = h.id;
Authentication mechanism of HMS Catalog and HDFS storage
In OceanBase Database, accessing Hive Metastore (HMS) involves two independent authentication layers:
Catalog layer: Used to connect to Hive Metastore (to retrieve table structure, partition, and other metadata).
Location layer: Used to read data files stored in file systems such as HDFS.
The authentication methods for these two layers must be configured separately and must align with the security policies of the Hadoop cluster.
The Hadoop authentication mode is determined by the server
The Hadoop cluster defines its security mode through the hadoop.security.authentication configuration in hdfs-site.xml:
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value> <!-- or simple -->
</property>
kerberos: Enables Kerberos authentication. All clients, including HMS Client and HDFS Client, must pass Kerberos authentication.simple: The Hadoop server trusts the client's declared operating system username and does not perform password or credential verification. Permission control is based solely on the owner, group, and POSIX permission bits (e.g.,drwxr-xr-x (755)) of HDFS files and directories, without Kerberos authentication.
Notice
This configuration is located on the Hadoop server ($HADOOP_HOME/etc/hadoop/hdfs-site.xml). OceanBase does not directly read this file but adapts to this mode through its own Catalog and Location configurations.
Authentication configuration of OceanBase Database
| Hadoop mode | HMS Catalog configuration | HDFS Location configuration |
|---|---|---|
| SIMPLE | No authentication method is specified when creating a catalog (default is SIMPLE). |
|
| KERBEROS | Specify AUTHENTICATION = 'KERBEROS' when creating a catalog, and provide PRINCIPAL, KEYTAB, and KRB5CONF. |
You must create an HDFS location with Kerberos authentication, and provide the principal, keytab, and krb5.conf. |
Important configuration rules: If Kerberos is enabled for either HMS or HDFS, check whether the deployment is correct.
Overview of two authentication modes
SIMPLE mode (common in development and testing environments)
Applicable scenarios:
Kerberos is not enabled for HDFS, and Linux file permissions are used for access control.
Tables are written by specific users (for example, the owner of an HDFS path written by Impala is impala, and the owner of an HDFS path written by Hive on Tez is the user who submitted the job, such as hive).
Configuration requirements:
Anonymous read path: ● HDFS directory permissions are open (for example, drwxr-xr-x (755)), and any user can read without creating a location.
- Meaning: The owner has read, write, and execute permissions (rwx = 4+2+1 = 7); The group and other users have only read and execute permissions (r-x = 4+0+1 = 5);
- For such paths, OceanBase Database does not need to create a location and can directly access the path.
Restricted path: If the user of the OceanBase Database observer process (such as admin) is not in the authorized user or group of the directory, you must:
Create a location and specify a user with permissions:
CREATE LOCATION my_loc URI = 'hdfs://mycluster/' CREDENTIAL = (USER = 'username');
Scenario 1: No Kerberos authentication and no need to specify an HDFS user (default anonymous access)
Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (that is, hadoop.security.authentication=simple).
Operation: No Location needs to be created.
Scenario 2: No Kerberos authentication is required, but an HDFS user must be specified
Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (that is,
hadoop.security.authentication=simple), but the data files of the target Hive table are stored in a restricted HDFS path (for example, written by Impala, Hive, or another engine, with the path owner being a specific user and without global read permissions).Typical cases:
- The path owner is impala when Impala writes to the HDFS path.
- The path owner is hive or etl_user when a Hive job writes to the HDFS path.
- By default, the OceanBase Database observer process uses the operating system user (such as admin) to access HDFS. If the user does not have read permissions, the query fails.
Example: Assume that a Hive table is written by Impala, and the HDFS path is
hdfs://namenode:8020/warehouse/sales.db/click_log.
CREATE LOCATION hdfs_impala_data
URL = 'hdfs://namenode:8020/' -- URL, see the description below
CREDENTIAL (
USERNAME = 'username' -- Specify the username for accessing HDFS
);
How to determine the correct URL:
Execute the following command in Hive CLI or Beeline:
SHOW CREATE TABLE your_db.your_table;View the LOCATION field in the output, for example:
LOCATION 'hdfs://namenode:8020/warehouse/your_db.db/your_table'Extract the protocol, service name/host name, and service port as the URL. That is, in the CREATE LOCATION statement, set the URL to
hdfs://namenode:8020/.
Scenario 3: Kerberos is not enabled, and HDFS is highly available (HA)
Operation: No need to set the PRINCIPAL, KEYTAB, or KRB5CONF parameters.
CREATE LOCATION hdfs_location_ha
URL = 'hdfs://${nameservice_id}' -- It is recommended to use the logical service name
CREDENTIAL (
CONFIGS = 'dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
Kerberos mode (standard for production environments)
Applicable scenarios:
- Enterprise-level Hadoop clusters where Kerberos authentication is enabled for Hive Metastore Service (HMS) or Hadoop Distributed File System (HDFS).
Configuration requirements:
HMS Catalog:
CREATE EXTERNAL CATALOG hms_krb PROPERTIES ( 'type' = 'hms', 'hive.metastore.uris' = 'thrift://metastore:9083', 'authentication' = 'KERBEROS', 'kerberos.principal' = 'ob@REALM', 'kerberos.keytab' = '/path/to/ob.keytab', 'kerberos.krb5.conf' = '/etc/krb5.conf' );HDFS Location:
CREATE LOCATION hdfs_krb_loc URI = 'hdfs://nn:8020', CREDENTIAL = ( KERBEROS = ( PRINCIPAL = 'ob@REALM', KEYTAB = '/path/to/ob.keytab', KRB5CONF = '/etc/krb5.conf' ) );
Scenario 4: Kerberos authentication is enabled, and HDFS is a single NameNode (non-HA mode)
-- Create a LOCATION: Kerberos authentication + single HDFS
CREATE LOCATION hdfs_kerberos_single
URL = 'hdfs://namenode.example.com:8020/' -- Obtain the complete HDFS path of the table by executing SHOW CREATE TABLE, and extract the root URL.
CREDENTIAL (
PRINCIPAL = "hdfs/TEST@EXAMPLE.COM",
KEYTAB = "/data/hdfs.keytab",
KRB5CONF = "/data/krb5.conf",
CONFIGS = 'dfs.data.transfer.protection=integrity'
);
Note
- The
dfs.data.transfer.protectionparameter inCONFIGSspecifies the security level of the HDFS data transmission channel, such asauthentication,integrity,privacy, ornull.
- This value must be consistent with the
dfs.data.transfer.protectionparameter configured in thehdfs-site.xmlfile of the Hadoop cluster. Otherwise, data read operations may fail due to mismatched security policies.
- We recommend that you confirm the actual value of this parameter from the Hadoop administrator or directly check the
hdfs-site.xmlfile of the HDFS cluster.
Scenario 5: Kerberos authentication is enabled, and HDFS is highly available (HA)
CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://${nameservice id}' -- We recommend that you use the logical service name.
CREDENTIAL (
PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
KEYTAB = "/etc/ob/hdfs.keytab",
KRB5CONF = "/etc/krb5.conf",
CONFIGS = 'dfs.data.transfer.protection=integrity#dfs.nameservices=mycluster#dfs.ha.namenodes.mycluster=nn1,nn2#dfs.namenode.rpc-address.mycluster.nn1=nn1:8020#dfs.namenode.rpc-address.mycluster.nn2=nn2:8020#dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
Note
- Deploy the same keytab and krb5.conf files on all OBServer nodes.
- If the "Unknown Host" error occurs, add the HDFS node mapping to the /etc/hosts file.
For more information about the Location parameters, see Create Location.
Configuration notes
- Authentication separation: The authentication for Catalog (metadata) and Location (data) are configured separately.
- Mode alignment: The authentication mode of OceanBase must be consistent with the
hadoop.security.authenticationsetting of the Hadoop cluster. - Kerberos consistency: If Kerberos is enabled for either HMS or HDFS, Kerberos credentials must be configured for both.
- SIMPLE mode requires permission checks: Ensure that the Observer user or specified USER has read permissions for the HDFS path.