OceanBase Database V4.4.1 and later allow you to connect to Hive Metastore (HMS) and create an HMS catalog. This way, you can access Hive-managed tables, including traditional Hive tables and Iceberg tables whose metadata is registered in Hive Metastore. You can directly query data managed by HMS in the Hadoop ecosystem without data migration, which is suitable for accelerating data lake queries.
- Unified metadata management: The table schema is automatically synchronized with Hive Metastore.
- Federated queries: Join and analyze data in HMS tables with internal tables in OceanBase Database.
- Read-only security: Prevent accidental modification of source data.
Supported features
Read-only access: All objects under HMS Catalog are read-only. Currently, you cannot perform DML/DDL operations such as
INSERT,UPDATE, andDROP TABLE.Supported table types:
Hive tables: ORC, Parquet, TextFile, and CSV.
- In the current version, only the ARRAY type of Parquet format is supported for complex types.
- For more information about the ARRAY type, see Overview of array element types.
Iceberg tables: V1 and V2. We recommend that you use the Parquet format.
Advanced features:
- Iceberg schema evolution (add and drop columns)
- Iceberg partition transformation (such as bucket() and year() partition expressions)
Hive version compatibility
- Supported versions: Hive 1.2.x, Hive 2.3.x, Hive 3.1.x, and Hive 4.x.
- Note: The following information applies to Iceberg tables.
Hive 4.x and later: Hive engine natively supports Iceberg tables. You can create and update metadata of Iceberg tables by using Hive. OceanBase Database can directly access Iceberg tables through HMS Catalog.
Hive 1.2.x, 2.3.x, and 3.1.x: Hive does not support Iceberg tables. In this case, Iceberg tables are usually created and written to by computing engines such as Spark and Flink, and their metadata is registered in the Hive Metastore. OceanBase Database identifies and reads Iceberg tables based on the metadata stored in HMS.
Therefore, as long as the metadata of an Iceberg table is correctly registered in the Hive Metastore (regardless of which engine writes to the table), OceanBase Database can access the table.
Prerequisites
Environment dependency: If the underlying storage is HDFS, you must deploy the Java SDK environment. For more information, see Deploy the OceanBase Java SDK environment.
Permissions
HMS service: The OceanBase cluster must have access to the Hive Metastore (Thrift protocol). The address and port of the HMS service must be provided by the customer's O&M team. By default, open-source Hive uses port 9083, but this may vary in actual deployment. Please refer to the on-site configuration.
Storage system: All OBServer nodes must have read access to HDFS, S3, or OSS.
Syntax for creating an HMS catalog
CREATE EXTERNAL CATALOG [IF NOT EXISTS] catalog_name
PROPERTIES (
TYPE = 'HMS',
URI = 'thrift://host:port',
[PRINCIPAL = '...'],
[KEYTAB = '...'],
[KRB5CONF = '...'],
[MAX_CLIENT_POOL_SIZE = 20],
[SOCKET_TIMEOUT = 10000000]
);
Parameter description
Parameter |
Required |
Description |
|---|---|---|
| TYPE | Yes | Fixed to 'HMS' |
| URI | Yes | The Thrift address of HMS. Format: thrift://$host:$port
"thrift://<HMS IP address 1>:<HMS port number 1>,thrift://<HMS IP address 2>:<HMS port number 2>,thrift://<HMS IP address 3>:<HMS port number 3>". |
| PRINCIPAL | No | The Kerberos principal, which is usually in the service/HOST@REGION.com format. For example, hive/hadoop@QA.COM. |
| KEYTAB | No | The path of the KEYTAB key file required to access the HMS service with Kerberos authentication. If the OceanBase cluster is deployed in a distributed manner, the corresponding OBServer nodes on all machines must have this file. Notice: This parameter is required only when Kerberos authentication is enabled for HMS. |
| KRB5CONF | No | The path of the Kerberos configuration file, such as /etc/krb5.conf. If the OceanBase cluster is deployed in a distributed manner, the corresponding OBServer nodes on all machines must have this file. |
| MAX_CLIENT_POOL_SIZE | No | The size of the HMS client connection pool. The default value is 20, which indicates that up to 20 clients can be started to connect to the HMS service. |
| SOCKET_TIMEOUT | No | The timeout period for connecting to the Hive metastore, in microseconds. The default value is 10000000 (10 seconds). |
Create an HMS catalog with SIMPLE authentication
In the SIMPLE authentication mode, you do not need to set up location authentication.
obclient> CREATE EXTERNAL CATALOG test_hms_catalog
PROPERTIES = (
TYPE = 'HMS',
URI = "thrift://xxx.xxx.xxx.xxx:xxxx"
);
For more information about the Hadoop authentication mode, see the section titled "Authentication mechanisms of HMS Catalog and HDFS storage" in the following sections.
Create an HMS catalog with Kerberos authentication
For more information about the authentication modes of Hadoop, see the Authentication mechanism of an HMS catalog and HDFS storage section.
Step 1: Create an HMS catalog with Kerberos authentication (metadata layer)
CREATE EXTERNAL CATALOG hms_kerberos
PROPERTIES (
TYPE = 'HMS',
URI = 'thrift://hms.example.com:9083',
PRINCIPAL = 'hive/hms.example.com@EXAMPLE.COM',
KEYTAB = '/etc/ob/hive.keytab',
KRB5CONF = '/etc/krb5.conf'
);
Step 2: Create an HDFS location with Kerberos authentication (data layer)
CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://namenode:8020/'
CREDENTIAL (
PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
KEYTAB = "/etc/ob/hdfs.keytab",
KRB5CONF = "/etc/krb5.conf",
CONFIGS = '...' -- HA configurations as mentioned above
);
The HMS catalog can discover the table structure, and the HDFS location can securely read the HDFS data files.
Switch Catalog
SET CATALOG hms_catalog;
Query external data sources by using a catalog
-- Hive table
SELECT city, COUNT(*) FROM hive_db.customer WHERE dt >= '2025-04-01' GROUP BY city;
-- Iceberg table
SELECT product_id, SUM(sales) FROM iceberg_db.sales_iceberg WHERE event_time >= '2025-04-01' GROUP BY product_id;
-- Federated query
SELECT o.order_id, h.city
FROM internal.test_db.orders o
JOIN hms_catalog.hive_db.customer h ON o.user_id = h.id;
Authentication mechanisms for HMS Catalog and HDFS storage
In OceanBase Database, accessing Hive Metastore (HMS) involves two independent authentication layers:
Catalog layer: This layer is used to connect to Hive Metastore to retrieve metadata such as table structures and partitions.
Location layer: This layer is used to read data files stored in file systems such as HDFS.
The authentication methods for these two layers must be configured separately and must align with the security policies of the Hadoop cluster.
The authentication mode of Hadoop is determined by the server
The Hadoop cluster defines its security mode through the hadoop.security.authentication configuration item in hdfs-site.xml:
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value> <!-- or simple -->
</property>
kerberos: Enables Kerberos authentication. All clients (including HMS Client and HDFS Client) must pass Kerberos authentication.simple: Uses the operating system user (i.e., the user who started the process) for permission verification, without Kerberos.
Notice
This configuration is located on the Hadoop server side ($HADOOP_HOME/etc/hadoop/hdfs-site.xml). OceanBase does not directly read this file but adapts to the mode through its own Catalog and Location configurations.
Authentication configurations of OceanBase Database
Hadoop mode |
HMS Catalog configuration |
HDFS Location configuration |
|---|---|---|
| SIMPLE | Do not specify the authentication method when you create a catalog (the default method is SIMPLE). |
|
| KERBEROS | Specify AUTHENTICATION = 'KERBEROS' when you create a catalog, and provide PRINCIPAL, KEYTAB, and KRB5CONF. |
You must create an HDFS location for Kerberos authentication, and provide the principal, keytab, and krb5.conf file. |
Important configuration rules: If Kerberos authentication is enabled for HMS or HDFS, check whether the deployment is correct.
Two authentication modes
SIMPLE mode (common in development and testing environments)
Scenarios:
Kerberos is not enabled for HDFS, and Linux file permissions are used.
Tables are written by specific users (e.g., Impala writes to an HDFS path with owner impala; Hive on Tez writes to a path with owner as the job submission user, such as hive).
Configuration requirements:
Anonymous readable path: The HDFS directory has open permissions (e.g., drwxr-xr-x), allowing any user to read. No Location needs to be created.
Restricted path: If the OceanBase Observer process user (e.g., admin) is not in the directory's authorized users or groups, you must:
Create a Location and specify a user with permissions:
CREATE LOCATION my_loc URI = 'hdfs://mycluster/' CREDENTIAL = (USER = 'username');
Scenario 1: No Kerberos and no HDFS user specified (default anonymous)
Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (i.e.,
hadoop.security.authentication=simple).Action: No Location needs to be created.
Scenario 2: No Kerberos, but HDFS user must be specified
Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (i.e.,
hadoop.security.authentication=simple), but the data files of the target Hive table are stored on a restricted HDFS path (e.g., written by Impala, Hive, or another engine, with the path owner as a specific user and without global read permissions).Typical cases:
- Impala writes to an HDFS path with owner impala.
- Hive job submission user writes to a path with owner as hive, etl_user, etc.
- OceanBase Observer accesses HDFS as the operating system user (e.g., admin). If this user does not have read permissions, queries will fail.
Example: Assume a Hive table is written by Impala, and its HDFS path is
hdfs://namenode:8020/warehouse/sales.db/click_log.
CREATE LOCATION hdfs_impala_data
URL = 'hdfs://namenode:8020/' -- URL, see the description below
CREDENTIAL (
USER = 'username' -- Specify the username for accessing HDFS
);
How to determine the correct URL:
Execute the following command in Hive CLI or Beeline:
SHOW CREATE TABLE your_db.your_table;Look for the LOCATION field in the output, for example:
LOCATION 'hdfs://namenode:8020/warehouse/your_db.db/your_table'Extract the protocol, service name/host, and service port as the URL. That is, in the CREATE LOCATION statement, set
hdfs://namenode:8020/as the URL.
Scenario 3: Kerberos is not enabled, and HDFS is highly available (HA)
Action: No need to set the PRINCIPAL, KEYTAB, and KRB5CONF parameters.
CREATE LOCATION hdfs_location_ha
URL = 'hdfs://${nameservice_id}' -- Recommended to use the logical service name
CREDENTIAL (
CONFIGS = 'dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
KERBEROS mode (Standard for production environment)
Applicable scenarios:
- Enterprise-level Hadoop cluster with Kerberos security authentication enabled for HMS or HDFS.
Configuration requirements:
HMS Catalog:
CREATE EXTERNAL CATALOG hms_krb PROPERTIES ( 'type' = 'hms', 'hive.metastore.uris' = 'thrift://metastore:9083', 'authentication' = 'KERBEROS', 'kerberos.principal' = 'ob@REALM', 'kerberos.keytab' = '/path/to/ob.keytab', 'kerberos.krb5.conf' = '/etc/krb5.conf' );HDFS Location:
CREATE LOCATION hdfs_krb_loc URI = 'hdfs://nn:8020', CREDENTIAL = ( KERBEROS = ( PRINCIPAL = 'ob@REALM', KEYTAB = '/path/to/ob.keytab', KRB5CONF = '/etc/krb5.conf' ) );
Scenario 4: Kerberos is enabled, and HDFS is a single NameNode (non-HA mode)
-- Create a LOCATION with Kerberos authentication and a single HDFS node
CREATE LOCATION hdfs_kerberos_single
URL = 'hdfs://namenode.example.com:8020/' -- Obtain the complete HDFS path of the table by executing the SHOW CREATE TABLE command, and extract the root URL.
CREDENTIAL (
PRINCIPAL = "hdfs/TEST@EXAMPLE.COM",
KEYTAB = "/data/hdfs.keytab",
KRB5CONF = "/data/krb5.conf",
CONFIGS = 'dfs.data.transfer.protection=integrity'
);
Note
- The
dfs.data.transfer.protectionparameter inCONFIGSspecifies the security level of the HDFS data transmission channel, such asauthentication,integrity,privacy, ornull.
- This value must be consistent with the
dfs.data.transfer.protectionvalue configured in thehdfs-site.xmlfile of the Hadoop cluster. Otherwise, the data read operation may fail due to mismatched security policies.
- We recommend that you obtain the actual value of this parameter from the Hadoop administrator or directly view the
hdfs-site.xmlfile of the HDFS cluster.
Scenario 5: Kerberos is enabled, and HDFS is highly available (HA)
CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://${nameservice id}' -- We recommend that you use the logical service name.
CREDENTIAL (
PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
KEYTAB = "/etc/ob/hdfs.keytab",
KRB5CONF = "/etc/krb5.conf",
CONFIGS = 'dfs.data.transfer.protection=integrity#dfs.nameservices=mycluster#dfs.ha.namenodes.mycluster=nn1,nn2#dfs.namenode.rpc-address.mycluster.nn1=nn1:8020#dfs.namenode.rpc-address.mycluster.nn2=nn2:8020#dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
Note
- Deploy the same keytab and krb5.conf files on all OBServer nodes.
- If the "Unknown Host" error occurs, add the HDFS node mapping to the /etc/hosts file.
For more information about the Location parameter configuration, see Create Location.
Configuration notes
- Authentication separation: Catalog (metadata) and Location (data) authentication are configured separately.
- Mode alignment: The authentication method of OceanBase must be consistent with the
hadoop.security.authenticationsetting of the Hadoop cluster. - Kerberos consistency: If either HMS or HDFS is enabled for Kerberos, both must be configured with Kerberos credentials.
- Permissions in SIMPLE mode: Ensure that the Observer user or specified USER has read permissions for the HDFS path.
