Generally, the data of a normal table in a database is stored in the storage space of the database, whereas the data of an external table is stored in an external storage service. When you create an external table, you must specify the path to and format of the relevant data files. After you create an external table, you can read data from files in the external storage service by using the external table.
An external table can be linked, aggregated, and sorted with another table like a normal table. External tables differ from normal tables in the following aspects:
The data of external tables is stored in external files, whereas that of normal tables is stored in databases.
External tables are read-only and can be used in query statements. However, you cannot perform Data Manipulation Language (DML) operations on external tables.
You cannot add constraints or create indexes on external tables.
Generally, the access speed of an external table is lower than that of a normal table.
HDFS external table
Read data from an HDFS external table
Hadoop Distributed File System (HDFS) is a core component of the Hadoop ecosystem, used for storing and processing large-scale datasets. Therefore, starting from V4.3.5 BP1, OceanBase Database has added support for reading data from HDFS external tables, allowing direct access to data stored in HDFS.
For detailed instructions on creating an HDFS external table, see CREATE EXTERNAL TABLE.
Since the HDFS SDK is developed in Java and OceanBase is developed in C++, it is necessary to use the Java Native Interface (JNI) framework as a bridge between the two. Similarly, the Java SDK for ODPS also requires a Java environment to run. In order to use the HDFS external table feature, you need to configure the Java environment and control it through specific parameters, enabling the creation of tables that can access files on HDFS. The relevant parameters are as follows:
- ob_enable_java_env
- ob_java_home
- ob_java_connector_path
- ob_java_opts
For more information on configuring the Java environment, see Deploy the Java SDK environment for OceanBase Database.
ODPS external tables
MaxCompute (ODPS) provides the Storage API and Tunnel API:
- Storage API: a data service interface that provides efficient, low-latency, and secure data read methods.
- Tunnel API: a data upload/download interface primarily used for bulk operations on table data, such as importing and exporting full table data.
OceanBase Database can access tables in ODPS by adapting the ODPS APIs. When creating an external table in ODPS, OceanBase Database provides parameter configuration options for both the Storage API and Tunnel API. For more information, see CREATE EXTERNAL TABLE. The following table describes the selection of the Storage API and Tunnel API.
Note
For OceanBase Database V4.3.5, the ODPS Storage API is supported starting from V4.3.5 BP3.
| Dimension | Storage API | Tunnel API | Scenarios |
|---|---|---|---|
| Features | Supports fine-grained data access (such as partition filtering and predicate pushdown). | Focuses on efficient full-table data import/export and lacks conditional filtering capabilities. | Integrated for HTAP mixed workloads, partitioned table conditional queries, and deep integration with computing engines (such as Spark and ODPS). |
| Sharding strategy | Automatic sharding: dynamically splits tasks by byte or row count to improve parallel efficiency. | Manual sharding: requires developers to calculate partition size or row count, with complex configuration. Performance is lower compared to the Storage API, and it does not have special requirements for ODPS resource configuration, being compatible with all ODPS configurations. |
|
| Performance optimization | Low resource consumption: predicate pushdown reduces the amount of data transmitted, with computations pushed to the database side, resulting in faster query speeds. | High resource consumption: full transmission can occupy a large amount of bandwidth and storage. | When you need to reduce the amount of data transmitted and improve HTAP efficiency, choose the Storage API. For simple ETL or full backup tasks, choose the Tunnel API. |
| Environment requirements | OceanBase Database V4.3.5 BP3. | No special limitations. | When the environment supports VPS, prioritize the Storage API. For older versions or simple scenarios, choose the Tunnel API. |
| Data Filtering Capability | Supported: data can be filtered using SQL conditions (such as WHERE), transmitting only the required data. |
Not supported: full transmission is required, followed by local filtering. | When you need to filter data based on conditions (such as analyzing specific user behavior), choose the Storage API. |