HMS Catalog|V4.3.5|OceanBase Database| docs|Distributed Database

OceanBase Database V4.4.1 and later allow you to connect to Hive Metastore (HMS) and create an HMS catalog. This way, you can access Hive-managed tables, including traditional Hive tables and Iceberg tables whose metadata is registered in Hive Metastore. You can directly query data managed by HMS in the Hadoop ecosystem without data migration, which is suitable for accelerating data lake queries.

Unified metadata management: The table schema is automatically synchronized with Hive Metastore.
Federated queries: Join and analyze data in HMS tables with internal tables in OceanBase Database.
Read-only security: Prevent accidental modification of source data.

Supported features

Read-only access: All objects under HMS Catalog are read-only. Currently, you cannot perform DML/DDL operations such as INSERT, UPDATE, and DROP TABLE.
Supported table types:
- Hive tables: ORC, Parquet, TextFile, and CSV.
  - In the current version, only the ARRAY type of Parquet format is supported for complex types.
  - For more information about the ARRAY type, see Overview of array element types.
- Iceberg tables: V1 and V2. We recommend that you use the Parquet format.
Advanced features:
- Iceberg schema evolution (add and drop columns)
- Iceberg partition transformation (such as bucket() and year() partition expressions)
Hive version compatibility
- Supported versions: Hive 1.2.x, Hive 2.3.x, Hive 3.1.x, and Hive 4.x.
- Note: The following information applies to Iceberg tables.
  - Hive 4.x and later: Hive engine natively supports Iceberg tables. You can create and update metadata of Iceberg tables by using Hive. OceanBase Database can directly access Iceberg tables through HMS Catalog.
  - Hive 1.2.x, 2.3.x, and 3.1.x: Hive does not support Iceberg tables. In this case, Iceberg tables are usually created and written to by computing engines such as Spark and Flink, and their metadata is registered in the Hive Metastore. OceanBase Database identifies and reads Iceberg tables based on the metadata stored in HMS.
    
    Therefore, as long as the metadata of an Iceberg table is correctly registered in the Hive Metastore (regardless of which engine writes to the table), OceanBase Database can access the table.

Prerequisites

Environment dependency: If the underlying storage is HDFS, you must deploy the Java SDK environment. For more information, see Deploy the OceanBase Java SDK environment.
Permissions
- HMS service: The OceanBase cluster must have access to the Hive Metastore (Thrift protocol). The address and port of the HMS service must be provided by the customer's O&M team. By default, open-source Hive uses port 9083, but this may vary in actual deployment. Please refer to the on-site configuration.
- Storage system: All OBServer nodes must have read access to HDFS, S3, or OSS.

Syntax for creating an HMS catalog

CREATE EXTERNAL CATALOG [IF NOT EXISTS] catalog_name
PROPERTIES (
    TYPE = 'HMS',
    URI = 'thrift://host:port',
    [PRINCIPAL = '...'],
    [KEYTAB = '...'],
    [KRB5CONF = '...'],
    [MAX_CLIENT_POOL_SIZE = 20],
    [SOCKET_TIMEOUT = 10000000]
);

Parameter description

Parameter	Required	Description
TYPE	Yes	Fixed to `'HMS'`
URI	Yes	The Thrift address of HMS. Format: `thrift://$host:$port` `$host`: the IP address of the Thrift service. `$port`: the port of the Thrift service. The IP address and port of the HMS service are provided by the customer's O&M team. In an open-source Hive cluster, the default port is 9083, but the actual port may vary. Please refer to the on-site configuration. If high availability mode is enabled for your HMS, you can specify multiple HMS addresses separated by commas. For example: `"thrift://<HMS IP address 1>:<HMS port number 1>,thrift://<HMS IP address 2>:<HMS port number 2>,thrift://<HMS IP address 3>:<HMS port number 3>"`.
PRINCIPAL	No	The Kerberos principal, which is usually in the `service/HOST@REGION.com` format. For example, `hive/hadoop@QA.COM`.
KEYTAB	No	The path of the KEYTAB key file required to access the HMS service with Kerberos authentication. If the OceanBase cluster is deployed in a distributed manner, the corresponding OBServer nodes on all machines must have this file. Notice: This parameter is required only when Kerberos authentication is enabled for HMS.
KRB5CONF	No	The path of the Kerberos configuration file, such as `/etc/krb5.conf`. If the OceanBase cluster is deployed in a distributed manner, the corresponding OBServer nodes on all machines must have this file.
MAX_CLIENT_POOL_SIZE	No	The size of the HMS client connection pool. The default value is 20, which indicates that up to 20 clients can be started to connect to the HMS service.
SOCKET_TIMEOUT	No	The timeout period for connecting to the Hive metastore, in microseconds. The default value is 10000000 (10 seconds).

Create an HMS catalog with SIMPLE authentication

In the SIMPLE authentication mode, you do not need to set up location authentication.

obclient> CREATE EXTERNAL CATALOG test_hms_catalog
    PROPERTIES = (
        TYPE = 'HMS',
        URI = "thrift://xxx.xxx.xxx.xxx:xxxx"
    );

For more information about the Hadoop authentication mode, see the section titled "Authentication mechanisms of HMS Catalog and HDFS storage" in the following sections.

Create an HMS catalog with Kerberos authentication

For more information about the authentication modes of Hadoop, see the Authentication mechanism of an HMS catalog and HDFS storage section.

Step 1: Create an HMS catalog with Kerberos authentication (metadata layer)

CREATE EXTERNAL CATALOG hms_kerberos
PROPERTIES (
    TYPE = 'HMS',
    URI = 'thrift://hms.example.com:9083',
    PRINCIPAL = 'hive/hms.example.com@EXAMPLE.COM',
    KEYTAB = '/etc/ob/hive.keytab',
    KRB5CONF = '/etc/krb5.conf'
);

Step 2: Create an HDFS location with Kerberos authentication (data layer)

CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://namenode:8020/'
CREDENTIAL (
    PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
    KEYTAB   = "/etc/ob/hdfs.keytab",
    KRB5CONF = "/etc/krb5.conf",
    CONFIGS  = '...'  -- HA configurations as mentioned above
);

The HMS catalog can discover the table structure, and the HDFS location can securely read the HDFS data files.

Switch Catalog

SET CATALOG hms_catalog;

Query external data sources by using a catalog

-- Hive table
SELECT city, COUNT(*) FROM hive_db.customer WHERE dt >= '2025-04-01' GROUP BY city;

-- Iceberg table
SELECT product_id, SUM(sales) FROM iceberg_db.sales_iceberg WHERE event_time >= '2025-04-01' GROUP BY product_id;

-- Federated query
SELECT o.order_id, h.city
FROM internal.test_db.orders o
JOIN hms_catalog.hive_db.customer h ON o.user_id = h.id;

Authentication mechanisms for HMS Catalog and HDFS storage

In OceanBase Database, accessing Hive Metastore (HMS) involves two independent authentication layers:

Catalog layer: This layer is used to connect to Hive Metastore to retrieve metadata such as table structures and partitions.
Location layer: This layer is used to read data files stored in file systems such as HDFS.

The authentication methods for these two layers must be configured separately and must align with the security policies of the Hadoop cluster.

The authentication mode of Hadoop is determined by the server

The Hadoop cluster defines its security mode through the hadoop.security.authentication configuration item in hdfs-site.xml:

<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value> <!-- or simple -->
</property>

kerberos: Enables Kerberos authentication. All clients (including HMS Client and HDFS Client) must pass Kerberos authentication.
simple: Uses the operating system user (i.e., the user who started the process) for permission verification, without Kerberos.

Notice

This configuration is located on the Hadoop server side ($HADOOP_HOME/etc/hadoop/hdfs-site.xml). OceanBase does not directly read this file but adapts to the mode through its own Catalog and Location configurations.

Authentication configurations of OceanBase Database

Hadoop mode	HMS Catalog configuration	HDFS Location configuration
SIMPLE	Do not specify the authentication method when you create a catalog (the default method is SIMPLE).	If the HDFS directory is readable by all users (for example, the directory has the `drwxr-xr-x` permission), you do not need to create a location. If the directory is accessible only by specific users (for example, the owner is hive or impala), you must explicitly specify the HDFS user who has the permission by using `CREATE LOCATION ... CREDENTIAL (USER = 'xxx')`.
KERBEROS	Specify `AUTHENTICATION = 'KERBEROS'` when you create a catalog, and provide `PRINCIPAL`, `KEYTAB`, and `KRB5CONF`.	You must create an HDFS location for Kerberos authentication, and provide the principal, keytab, and krb5.conf file.

Hadoop mode

HMS Catalog configuration

HDFS Location configuration

SIMPLE

Do not specify the authentication method when you create a catalog (the default method is SIMPLE).

If the HDFS directory is readable by all users (for example, the directory has the drwxr-xr-x permission), you do not need to create a location.

If the directory is accessible only by specific users (for example, the owner is hive or impala), you must explicitly specify the HDFS user who has the permission by using CREATE LOCATION ... CREDENTIAL (USER = 'xxx').

KERBEROS Specify AUTHENTICATION = 'KERBEROS' when you create a catalog, and provide PRINCIPAL, KEYTAB, and KRB5CONF. You must create an HDFS location for Kerberos authentication, and provide the principal, keytab, and krb5.conf file.

Important configuration rules: If Kerberos authentication is enabled for HMS or HDFS, check whether the deployment is correct.

Two authentication modes

SIMPLE mode (common in development and testing environments)

Scenarios:

Kerberos is not enabled for HDFS, and Linux file permissions are used.
Tables are written by specific users (e.g., Impala writes to an HDFS path with owner impala; Hive on Tez writes to a path with owner as the job submission user, such as hive).

Configuration requirements:

Anonymous readable path: The HDFS directory has open permissions (e.g., drwxr-xr-x), allowing any user to read. No Location needs to be created.
Restricted path: If the OceanBase Observer process user (e.g., admin) is not in the directory's authorized users or groups, you must:
- Create a Location and specify a user with permissions:
```
CREATE LOCATION my_loc URI = 'hdfs://mycluster/'
CREDENTIAL = (USER = 'username');
```

Scenario 1: No Kerberos and no HDFS user specified (default anonymous)

Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (i.e., hadoop.security.authentication=simple).
Action: No Location needs to be created.

Scenario 2: No Kerberos, but HDFS user must be specified

Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (i.e., hadoop.security.authentication=simple), but the data files of the target Hive table are stored on a restricted HDFS path (e.g., written by Impala, Hive, or another engine, with the path owner as a specific user and without global read permissions).
Typical cases:
- Impala writes to an HDFS path with owner impala.
- Hive job submission user writes to a path with owner as hive, etl_user, etc.
- OceanBase Observer accesses HDFS as the operating system user (e.g., admin). If this user does not have read permissions, queries will fail.
Example: Assume a Hive table is written by Impala, and its HDFS path is hdfs://namenode:8020/warehouse/sales.db/click_log.

CREATE LOCATION hdfs_impala_data
URL = 'hdfs://namenode:8020/'    -- URL, see the description below
CREDENTIAL (
  USER = 'username'  -- Specify the username for accessing HDFS
);

How to determine the correct URL:

Execute the following command in Hive CLI or Beeline:
```
SHOW CREATE TABLE your_db.your_table;
```

Look for the LOCATION field in the output, for example:

LOCATION 'hdfs://namenode:8020/warehouse/your_db.db/your_table'

Extract the protocol, service name/host, and service port as the URL. That is, in the CREATE LOCATION statement, set hdfs://namenode:8020/ as the URL.

Scenario 3: Kerberos is not enabled, and HDFS is highly available (HA)

Action: No need to set the PRINCIPAL, KEYTAB, and KRB5CONF parameters.

CREATE LOCATION hdfs_location_ha
    URL = 'hdfs://${nameservice_id}'  -- Recommended to use the logical service name
    CREDENTIAL (
        CONFIGS = 'dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
        );

KERBEROS mode (Standard for production environment)

Applicable scenarios:

Enterprise-level Hadoop cluster with Kerberos security authentication enabled for HMS or HDFS.

Configuration requirements:

HMS Catalog:

CREATE EXTERNAL CATALOG hms_krb
  PROPERTIES (
    'type' = 'hms',
    'hive.metastore.uris' = 'thrift://metastore:9083',
    'authentication' = 'KERBEROS',
    'kerberos.principal' = 'ob@REALM',
    'kerberos.keytab' = '/path/to/ob.keytab',
    'kerberos.krb5.conf' = '/etc/krb5.conf'
  );

HDFS Location:

CREATE LOCATION hdfs_krb_loc
  URI = 'hdfs://nn:8020',
  CREDENTIAL = (
    KERBEROS = (
      PRINCIPAL = 'ob@REALM',
      KEYTAB = '/path/to/ob.keytab',
      KRB5CONF = '/etc/krb5.conf'
    )
  );

Scenario 4: Kerberos is enabled, and HDFS is a single NameNode (non-HA mode)

-- Create a LOCATION with Kerberos authentication and a single HDFS node
CREATE LOCATION hdfs_kerberos_single
URL = 'hdfs://namenode.example.com:8020/'   -- Obtain the complete HDFS path of the table by executing the SHOW CREATE TABLE command, and extract the root URL.
CREDENTIAL (
  PRINCIPAL = "hdfs/TEST@EXAMPLE.COM",
  KEYTAB   = "/data/hdfs.keytab",
  KRB5CONF = "/data/krb5.conf",
  CONFIGS  = 'dfs.data.transfer.protection=integrity'
);

Note

The dfs.data.transfer.protection parameter in CONFIGS specifies the security level of the HDFS data transmission channel, such as authentication, integrity, privacy, or null.

This value must be consistent with the dfs.data.transfer.protection value configured in the hdfs-site.xml file of the Hadoop cluster. Otherwise, the data read operation may fail due to mismatched security policies.

We recommend that you obtain the actual value of this parameter from the Hadoop administrator or directly view the hdfs-site.xml file of the HDFS cluster.

Scenario 5: Kerberos is enabled, and HDFS is highly available (HA)

CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://${nameservice id}'  -- We recommend that you use the logical service name.
CREDENTIAL (
    PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
    KEYTAB   = "/etc/ob/hdfs.keytab",
    KRB5CONF = "/etc/krb5.conf",
    CONFIGS  = 'dfs.data.transfer.protection=integrity#dfs.nameservices=mycluster#dfs.ha.namenodes.mycluster=nn1,nn2#dfs.namenode.rpc-address.mycluster.nn1=nn1:8020#dfs.namenode.rpc-address.mycluster.nn2=nn2:8020#dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

Note

Deploy the same keytab and krb5.conf files on all OBServer nodes.

If the "Unknown Host" error occurs, add the HDFS node mapping to the /etc/hosts file.

For more information about the Location parameter configuration, see Create Location.

Configuration notes

Authentication separation: Catalog (metadata) and Location (data) authentication are configured separately.
Mode alignment: The authentication method of OceanBase must be consistent with the hadoop.security.authentication setting of the Hadoop cluster.
Kerberos consistency: If either HMS or HDFS is enabled for Kerberos, both must be configured with Kerberos credentials.
Permissions in SIMPLE mode: Ensure that the Observer user or specified USER has read permissions for the HDFS path.

References

Unified metadata management: The table schema is automatically synchronized with Hive Metastore.
Federated queries: Join and analyze data in HMS tables with internal tables in OceanBase Database.
Read-only security: Prevent accidental modification of source data.

Supported features

Read-only access: All objects under HMS Catalog are read-only. Currently, you cannot perform DML/DDL operations such as INSERT, UPDATE, and DROP TABLE.
Supported table types:
- Hive tables: ORC, Parquet, TextFile, and CSV.
  - In the current version, only the ARRAY type of Parquet format is supported for complex types.
  - For more information about the ARRAY type, see Overview of array element types.
- Iceberg tables: V1 and V2. We recommend that you use the Parquet format.
Advanced features:
- Iceberg schema evolution (add and drop columns)
- Iceberg partition transformation (such as bucket() and year() partition expressions)
Hive version compatibility
- Supported versions: Hive 1.2.x, Hive 2.3.x, Hive 3.1.x, and Hive 4.x.
- Note: The following information applies to Iceberg tables.
  - Hive 4.x and later: Hive engine natively supports Iceberg tables. You can create and update metadata of Iceberg tables by using Hive. OceanBase Database can directly access Iceberg tables through HMS Catalog.
  - Hive 1.2.x, 2.3.x, and 3.1.x: Hive does not support Iceberg tables. In this case, Iceberg tables are usually created and written to by computing engines such as Spark and Flink, and their metadata is registered in the Hive Metastore. OceanBase Database identifies and reads Iceberg tables based on the metadata stored in HMS.
    
    Therefore, as long as the metadata of an Iceberg table is correctly registered in the Hive Metastore (regardless of which engine writes to the table), OceanBase Database can access the table.

Prerequisites

Environment dependency: If the underlying storage is HDFS, you must deploy the Java SDK environment. For more information, see Deploy the OceanBase Java SDK environment.
Permissions
- HMS service: The OceanBase cluster must have access to the Hive Metastore (Thrift protocol). The address and port of the HMS service must be provided by the customer's O&M team. By default, open-source Hive uses port 9083, but this may vary in actual deployment. Please refer to the on-site configuration.
- Storage system: All OBServer nodes must have read access to HDFS, S3, or OSS.

Syntax for creating an HMS catalog

CREATE EXTERNAL CATALOG [IF NOT EXISTS] catalog_name
PROPERTIES (
    TYPE = 'HMS',
    URI = 'thrift://host:port',
    [PRINCIPAL = '...'],
    [KEYTAB = '...'],
    [KRB5CONF = '...'],
    [MAX_CLIENT_POOL_SIZE = 20],
    [SOCKET_TIMEOUT = 10000000]
);

Parameter description

Parameter	Required	Description
TYPE	Yes	Fixed to `'HMS'`
URI	Yes	The Thrift address of HMS. Format: `thrift://$host:$port` `$host`: the IP address of the Thrift service. `$port`: the port of the Thrift service. The IP address and port of the HMS service are provided by the customer's O&M team. In an open-source Hive cluster, the default port is 9083, but the actual port may vary. Please refer to the on-site configuration. If high availability mode is enabled for your HMS, you can specify multiple HMS addresses separated by commas. For example: `"thrift://<HMS IP address 1>:<HMS port number 1>,thrift://<HMS IP address 2>:<HMS port number 2>,thrift://<HMS IP address 3>:<HMS port number 3>"`.
PRINCIPAL	No	The Kerberos principal, which is usually in the `service/HOST@REGION.com` format. For example, `hive/hadoop@QA.COM`.
KEYTAB	No	The path of the KEYTAB key file required to access the HMS service with Kerberos authentication. If the OceanBase cluster is deployed in a distributed manner, the corresponding OBServer nodes on all machines must have this file. Notice: This parameter is required only when Kerberos authentication is enabled for HMS.
KRB5CONF	No	The path of the Kerberos configuration file, such as `/etc/krb5.conf`. If the OceanBase cluster is deployed in a distributed manner, the corresponding OBServer nodes on all machines must have this file.
MAX_CLIENT_POOL_SIZE	No	The size of the HMS client connection pool. The default value is 20, which indicates that up to 20 clients can be started to connect to the HMS service.
SOCKET_TIMEOUT	No	The timeout period for connecting to the Hive metastore, in microseconds. The default value is 10000000 (10 seconds).

Create an HMS catalog with SIMPLE authentication

In the SIMPLE authentication mode, you do not need to set up location authentication.

obclient> CREATE EXTERNAL CATALOG test_hms_catalog
    PROPERTIES = (
        TYPE = 'HMS',
        URI = "thrift://xxx.xxx.xxx.xxx:xxxx"
    );

For more information about the Hadoop authentication mode, see the section titled "Authentication mechanisms of HMS Catalog and HDFS storage" in the following sections.

Create an HMS catalog with Kerberos authentication

For more information about the authentication modes of Hadoop, see the Authentication mechanism of an HMS catalog and HDFS storage section.

Step 1: Create an HMS catalog with Kerberos authentication (metadata layer)

CREATE EXTERNAL CATALOG hms_kerberos
PROPERTIES (
    TYPE = 'HMS',
    URI = 'thrift://hms.example.com:9083',
    PRINCIPAL = 'hive/hms.example.com@EXAMPLE.COM',
    KEYTAB = '/etc/ob/hive.keytab',
    KRB5CONF = '/etc/krb5.conf'
);

Step 2: Create an HDFS location with Kerberos authentication (data layer)

CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://namenode:8020/'
CREDENTIAL (
    PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
    KEYTAB   = "/etc/ob/hdfs.keytab",
    KRB5CONF = "/etc/krb5.conf",
    CONFIGS  = '...'  -- HA configurations as mentioned above
);

The HMS catalog can discover the table structure, and the HDFS location can securely read the HDFS data files.

Switch Catalog

SET CATALOG hms_catalog;

Query external data sources by using a catalog

-- Hive table
SELECT city, COUNT(*) FROM hive_db.customer WHERE dt >= '2025-04-01' GROUP BY city;

-- Iceberg table
SELECT product_id, SUM(sales) FROM iceberg_db.sales_iceberg WHERE event_time >= '2025-04-01' GROUP BY product_id;

-- Federated query
SELECT o.order_id, h.city
FROM internal.test_db.orders o
JOIN hms_catalog.hive_db.customer h ON o.user_id = h.id;

Authentication mechanisms for HMS Catalog and HDFS storage

In OceanBase Database, accessing Hive Metastore (HMS) involves two independent authentication layers:

Catalog layer: This layer is used to connect to Hive Metastore to retrieve metadata such as table structures and partitions.
Location layer: This layer is used to read data files stored in file systems such as HDFS.

The authentication methods for these two layers must be configured separately and must align with the security policies of the Hadoop cluster.

The authentication mode of Hadoop is determined by the server

The Hadoop cluster defines its security mode through the hadoop.security.authentication configuration item in hdfs-site.xml:

<property>
  <name>hadoop.security.authentication</name>
  <value>kerberos</value> <!-- or simple -->
</property>

kerberos: Enables Kerberos authentication. All clients (including HMS Client and HDFS Client) must pass Kerberos authentication.
simple: Uses the operating system user (i.e., the user who started the process) for permission verification, without Kerberos.

Notice

Authentication configurations of OceanBase Database

Hadoop mode	HMS Catalog configuration	HDFS Location configuration
SIMPLE	Do not specify the authentication method when you create a catalog (the default method is SIMPLE).	If the HDFS directory is readable by all users (for example, the directory has the `drwxr-xr-x` permission), you do not need to create a location. If the directory is accessible only by specific users (for example, the owner is hive or impala), you must explicitly specify the HDFS user who has the permission by using `CREATE LOCATION ... CREDENTIAL (USER = 'xxx')`.
KERBEROS	Specify `AUTHENTICATION = 'KERBEROS'` when you create a catalog, and provide `PRINCIPAL`, `KEYTAB`, and `KRB5CONF`.	You must create an HDFS location for Kerberos authentication, and provide the principal, keytab, and krb5.conf file.

Hadoop mode

HMS Catalog configuration

HDFS Location configuration

SIMPLE

Do not specify the authentication method when you create a catalog (the default method is SIMPLE).

If the HDFS directory is readable by all users (for example, the directory has the drwxr-xr-x permission), you do not need to create a location.

If the directory is accessible only by specific users (for example, the owner is hive or impala), you must explicitly specify the HDFS user who has the permission by using CREATE LOCATION ... CREDENTIAL (USER = 'xxx').

Important configuration rules: If Kerberos authentication is enabled for HMS or HDFS, check whether the deployment is correct.

Two authentication modes

SIMPLE mode (common in development and testing environments)

Scenarios:

Kerberos is not enabled for HDFS, and Linux file permissions are used.
Tables are written by specific users (e.g., Impala writes to an HDFS path with owner impala; Hive on Tez writes to a path with owner as the job submission user, such as hive).

Configuration requirements:

Anonymous readable path: The HDFS directory has open permissions (e.g., drwxr-xr-x), allowing any user to read. No Location needs to be created.
Restricted path: If the OceanBase Observer process user (e.g., admin) is not in the directory's authorized users or groups, you must:
- Create a Location and specify a user with permissions:
```
CREATE LOCATION my_loc URI = 'hdfs://mycluster/'
CREDENTIAL = (USER = 'username');
```

Scenario 1: No Kerberos and no HDFS user specified (default anonymous)

Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (i.e., hadoop.security.authentication=simple).
Action: No Location needs to be created.

Scenario 2: No Kerberos, but HDFS user must be specified

Applicable to: Development or testing environments where Kerberos authentication is not enabled for the HDFS cluster (i.e., hadoop.security.authentication=simple), but the data files of the target Hive table are stored on a restricted HDFS path (e.g., written by Impala, Hive, or another engine, with the path owner as a specific user and without global read permissions).
Typical cases:
- Impala writes to an HDFS path with owner impala.
- Hive job submission user writes to a path with owner as hive, etl_user, etc.
- OceanBase Observer accesses HDFS as the operating system user (e.g., admin). If this user does not have read permissions, queries will fail.
Example: Assume a Hive table is written by Impala, and its HDFS path is hdfs://namenode:8020/warehouse/sales.db/click_log.

CREATE LOCATION hdfs_impala_data
URL = 'hdfs://namenode:8020/'    -- URL, see the description below
CREDENTIAL (
  USER = 'username'  -- Specify the username for accessing HDFS
);

How to determine the correct URL:

Execute the following command in Hive CLI or Beeline:
```
SHOW CREATE TABLE your_db.your_table;
```

Look for the LOCATION field in the output, for example:

LOCATION 'hdfs://namenode:8020/warehouse/your_db.db/your_table'

Extract the protocol, service name/host, and service port as the URL. That is, in the CREATE LOCATION statement, set hdfs://namenode:8020/ as the URL.

Scenario 3: Kerberos is not enabled, and HDFS is highly available (HA)

Action: No need to set the PRINCIPAL, KEYTAB, and KRB5CONF parameters.

CREATE LOCATION hdfs_location_ha
    URL = 'hdfs://${nameservice_id}'  -- Recommended to use the logical service name
    CREDENTIAL (
        CONFIGS = 'dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
        );

KERBEROS mode (Standard for production environment)

Applicable scenarios:

Enterprise-level Hadoop cluster with Kerberos security authentication enabled for HMS or HDFS.

Configuration requirements:

HMS Catalog:

CREATE EXTERNAL CATALOG hms_krb
  PROPERTIES (
    'type' = 'hms',
    'hive.metastore.uris' = 'thrift://metastore:9083',
    'authentication' = 'KERBEROS',
    'kerberos.principal' = 'ob@REALM',
    'kerberos.keytab' = '/path/to/ob.keytab',
    'kerberos.krb5.conf' = '/etc/krb5.conf'
  );

HDFS Location:

CREATE LOCATION hdfs_krb_loc
  URI = 'hdfs://nn:8020',
  CREDENTIAL = (
    KERBEROS = (
      PRINCIPAL = 'ob@REALM',
      KEYTAB = '/path/to/ob.keytab',
      KRB5CONF = '/etc/krb5.conf'
    )
  );

Scenario 4: Kerberos is enabled, and HDFS is a single NameNode (non-HA mode)

-- Create a LOCATION with Kerberos authentication and a single HDFS node
CREATE LOCATION hdfs_kerberos_single
URL = 'hdfs://namenode.example.com:8020/'   -- Obtain the complete HDFS path of the table by executing the SHOW CREATE TABLE command, and extract the root URL.
CREDENTIAL (
  PRINCIPAL = "hdfs/TEST@EXAMPLE.COM",
  KEYTAB   = "/data/hdfs.keytab",
  KRB5CONF = "/data/krb5.conf",
  CONFIGS  = 'dfs.data.transfer.protection=integrity'
);

Note

The dfs.data.transfer.protection parameter in CONFIGS specifies the security level of the HDFS data transmission channel, such as authentication, integrity, privacy, or null.

This value must be consistent with the dfs.data.transfer.protection value configured in the hdfs-site.xml file of the Hadoop cluster. Otherwise, the data read operation may fail due to mismatched security policies.

We recommend that you obtain the actual value of this parameter from the Hadoop administrator or directly view the hdfs-site.xml file of the HDFS cluster.

Scenario 5: Kerberos is enabled, and HDFS is highly available (HA)

CREATE LOCATION hdfs_kerberos_ha
URL = 'hdfs://${nameservice id}'  -- We recommend that you use the logical service name.
CREDENTIAL (
    PRINCIPAL = "ob_hdfs@EXAMPLE.COM",
    KEYTAB   = "/etc/ob/hdfs.keytab",
    KRB5CONF = "/etc/krb5.conf",
    CONFIGS  = 'dfs.data.transfer.protection=integrity#dfs.nameservices=mycluster#dfs.ha.namenodes.mycluster=nn1,nn2#dfs.namenode.rpc-address.mycluster.nn1=nn1:8020#dfs.namenode.rpc-address.mycluster.nn2=nn2:8020#dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

Note

Deploy the same keytab and krb5.conf files on all OBServer nodes.

If the "Unknown Host" error occurs, add the HDFS node mapping to the /etc/hosts file.

For more information about the Location parameter configuration, see Create Location.

Configuration notes

Authentication separation: Catalog (metadata) and Location (data) authentication are configured separately.
Mode alignment: The authentication method of OceanBase must be consistent with the hadoop.security.authentication setting of the Hadoop cluster.
Kerberos consistency: If either HMS or HDFS is enabled for Kerberos, both must be configured with Kerberos credentials.
Permissions in SIMPLE mode: Ensure that the Observer user or specified USER has read permissions for the HDFS path.

OceanBase

Customer Stories

Documentation

HMS Catalog

Supported features

Prerequisites

Syntax for creating an HMS catalog

Parameter description

Create an HMS catalog with SIMPLE authentication

Create an HMS catalog with Kerberos authentication

Step 1: Create an HMS catalog with Kerberos authentication (metadata layer)

Step 2: Create an HDFS location with Kerberos authentication (data layer)

Switch Catalog

Query external data sources by using a catalog

Authentication mechanisms for HMS Catalog and HDFS storage

The authentication mode of Hadoop is determined by the server

Notice

Authentication configurations of OceanBase Database

Two authentication modes

SIMPLE mode (common in development and testing environments)

Scenario 1: No Kerberos and no HDFS user specified (default anonymous)

Scenario 2: No Kerberos, but HDFS user must be specified

Scenario 3: Kerberos is not enabled, and HDFS is highly available (HA)

KERBEROS mode (Standard for production environment)

Scenario 4: Kerberos is enabled, and HDFS is a single NameNode (non-HA mode)

Note

Scenario 5: Kerberos is enabled, and HDFS is highly available (HA)

Note

Configuration notes

References

HMS Catalog

Supported features

Prerequisites

Syntax for creating an HMS catalog

Parameter description

Create an HMS catalog with SIMPLE authentication

Create an HMS catalog with Kerberos authentication

Step 1: Create an HMS catalog with Kerberos authentication (metadata layer)

Step 2: Create an HDFS location with Kerberos authentication (data layer)

Switch Catalog

Query external data sources by using a catalog

Authentication mechanisms for HMS Catalog and HDFS storage

The authentication mode of Hadoop is determined by the server

Notice

Authentication configurations of OceanBase Database

Two authentication modes

SIMPLE mode (common in development and testing environments)

Scenario 1: No Kerberos and no HDFS user specified (default anonymous)

Scenario 2: No Kerberos, but HDFS user must be specified

Scenario 3: Kerberos is not enabled, and HDFS is highly available (HA)

KERBEROS mode (Standard for production environment)

Scenario 4: Kerberos is enabled, and HDFS is a single NameNode (non-HA mode)

Note

Scenario 5: Kerberos is enabled, and HDFS is highly available (HA)

Note

Configuration notes

References