Usually, table data in a database is stored within the database's storage space, whereas the data of an external table is stored in an external storage service. When creating an external table, you need to specify the path and format of the data files. After the external table is created, you can use it to read data from the external storage service.
External tables can be used just like regular tables—they can be joined, aggregated, sorted, and so on. The differences between external tables and regular tables are as follows:
The data of an external table is stored in external files, while the data of a regular table is stored within the database.
External tables are read-only. You can use them in query statements, but you cannot perform DML operations on them.
External tables do not support adding constraints or creating indexes.
In general, accessing external tables is slower than accessing regular tables.
HDFS external tables
Read data from HDFS external tables
The Hadoop Distributed File System (HDFS) is a core component of the Hadoop ecosystem, designed to store and process large-scale datasets. To allow direct access to data in HDFS, OceanBase Database now supports reading external tables from HDFS.
For more information about creating an HDFS external table (where files are stored in HDFS), see CREATE EXTERNAL TABLE.
Since the HDFS SDK is developed in Java while OceanBase Database is built using C++, a bridge between the two is required, which is achieved through the Java Native Interface (JNI) framework. Similarly, the Java SDK for ODPS also requires a Java environment to run. To use the HDFS external table feature, you need to configure a Java environment and control it using specific parameters to create tables that can access HDFS files. The relevant parameters are as follows:
For more information about configuring the Java environment, see Deploy the Java SDK environment for OceanBase Database.
Write data to HDFS external tables
OceanBase Database supports the feature of writing data to HDFS external tables in V4.4.0. For more information about this feature, see SELECT INTO.
ODPS external tables
MaxCompute (ODPS) provides two open APIs: Storage API and Tunnel API.
- Storage API: A data service interface that offers efficient, low-latency, and secure data reading.
- Tunnel API: A data upload and download interface, mainly used for batch operations on table data (such as full table import and export).
By adapting the ODPS APIs, OceanBase Database can access tables in ODPS through external tables. When you create an external table for ODPS, OceanBase Database provides parameter configuration options for both the Storage API and Tunnel API. For more information, see CREATE EXTERNAL TABLE. The table below outlines the differences between the Storage API and Tunnel API.
| Dimension | Storage API | Tunnel API | Applicable scenarios |
|---|---|---|---|
| Features | Supports fine-grained data access (such as partition filtering and predicate pushdown). | Focuses on efficient full-table data import and export, without support for conditional filtering. | Suitable for HTAP mixed workloads, conditional queries on partitioned tables, and deep integration with computing engines (such as Apache Spark and ODPS). |
| Sharding strategy | Automatic sharding: dynamically splits tasks by bytes or rows to improve parallel efficiency. | Manual sharding: you need to calculate the partition size or row count yourself, and configuration is relatively complex. Performance is lower than that of the Storage API, with no special requirements for ODPS resource configuration. Compatible with all ODPS configurations. |
|
| Performance optimization | Low resource consumption: Predicate pushdown reduces the amount of data to be transmitted, and computation is pushed to the database side, resulting in faster queries. | High resource consumption: Full data transmission may occupy a large amount of bandwidth and storage space. | Choose the Storage API when reducing data transfer and improving HTAP efficiency is needed. Choose the Tunnel API for simple ETL tasks or full backups. |
| Environment requirements | OceanBase Database V4.4.0 | No special requirements. | Use the Storage API if your environment supports VPS. Otherwise, use the Tunnel API for earlier versions or simpler scenarios. |
| Data filtering capability | Supported: Data can be filtered using SQL conditions (such as WHERE), and only the required subset is transmitted. |
Not supported: Full data transmission is required, and local filtering is performed. | Choose the Storage API when conditional data filtering is required (such as analyzing specific user behavior). |
Catalog external tables
OceanBase Database uses the catalog (data directory) feature to enable unified management and efficient querying of external data sources. This feature adds a Catalog-Database-Table three-layer data hierarchy, allowing direct access to table data in external data sources (such as ODPS and HMS) without manually creating mapping tables. For more information, see Catalog overview.
The following table describes the catalog data sources supported by OceanBase Database:
| Type | Supported version | Data source type | Table format support | Description |
|---|---|---|---|---|
| ODPS catalog | OceanBase Database V4.3.5 BP2 and later | ODPS | MaxCompute table | Suitable for querying data on Alibaba Cloud MaxCompute. |
| HMS catalog | OceanBase Database V4.4.1 and later | HMS | Hive, Iceberg table | Manages metadata through Hive Metastore and supports the open-source data lake ecosystem. |