This topic describes how to create an external table by using SQL statements. It also provides the prerequisites, overview, and considerations for creating an external table, and gives some examples.
Overview
An external table is a logical table object. The actual data of an external table is stored in an external storage service, not in the database.
For more information about external tables, see Overview.
Prerequisites
Before you create an external table, make sure that the following conditions are met:
You have deployed an OceanBase cluster and created an Oracle tenant. For more information, see Overview.
You have connected to an Oracle tenant of OceanBase Database. For more information, see Overview.
You have the
CREATE TABLEprivilege. For more information about how to query the privileges of the current user, see Query the privileges of the current user. If you do not have the privilege, contact the administrator to grant the privilege. For more information, see Grant privileges.
Considerations
Externals only support query operations and do not support DML operations.
If the external file accessed by the external table is deleted, the system does not return an error but returns an empty row.
Since the files accessed by the external table are managed by the external storage system, if the external storage is unavailable, querying the external table will result in an error.
Since the data of the external table is stored in an external data source, querying it involves factors such as network and file system, which may affect query performance. Therefore, when creating an external table, you need to select an appropriate data source and optimization strategy to improve query efficiency.
Create an external table by using the command-line tool
Use the CREATE EXTERNAL TABLE statement to create an external table.
Define the name of the external table
When you create an external table, you must specify its name. To avoid confusion and ambiguity, we recommend that you use specific naming rules or prefixes to distinguish external tables from regular tables. For example, you can add the _csv suffix to the name of an external table.
Here is an example:
When you create an external table that stores information about students, you can name it students_csv.
CREATE EXTERNAL TABLE students_csv external_options
Notice
Because no other attributes are specified for the external table, the preceding SQL statement cannot be executed.
Define columns
You cannot define constraints such as DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY for columns of an external table.
The column types supported by external tables are the same as those supported by regular tables. For more information about the data types supported by OceanBase Database in Oracle mode, see Built-in data types.
Define LOCATION
The LOCATION option specifies the path where the external table files are stored. Typically, the data files of an external table are stored in a separate directory, which can contain subdirectories. When you create an external table, it automatically collects all files in the directory.
OceanBase Database supports the following two path formats:
Local location format:
LOCATION = '[file://] local_file_path'Notice
For the local location format, you must set the
secure_file_privsystem variable to specify the accessible path. For more information, see secure_file_priv.Remote location format:
Notice
When you use an object storage path, the parameters of the object storage path are separated by the
&symbol. Make sure that the parameter values you enter contain only uppercase and lowercase letters, digits,/-_$+=, and wildcards. If you enter other characters, the setting may fail.LOCATION = '{oss|S3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path', where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required for accessing Alibaba Cloud OSS, AWS S3, and object storage services compatible with the S3 protocol, ands3_regionis the region information selected when using S3. These sensitive access information are stored in the system tables of the database in an encrypted manner.When the files are stored on HDFS, the following formats are supported:
Single-node NameNode (NN) access format:
LOCATION = hdfs://localhost:port/PATH, wherelocalhostspecifies the address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the file path in HDFS.Kerberos authentication format:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx', where:principal: specifies the user for login and authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configurations. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value and needs to be configured. For example,dfs.data.transfer.protection=authentication,privacyspecifies the data transmission protection level asauthenticationandprivacy.
Hadoop high availability (HA) logical name service access format:
LOCATION = hdfs://nameserviceID/PATH, wherenameserviceIDspecifies the ID of the logical name service in Hadoop HA, andPATHspecifies the file path.Note
Make sure that the client OBServer configurations include the
nameservicedefinition and failover strategy of the HA cluster.Kerberos authentication format:
LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:principal: specifies the user for login and authentication. Set this parameter to thepricipalof a non-primary NN.keytabandkrb5conf: same as in single-node NN settings.configs: specifies additional HDFS configurations. You need to set multiple configurations, which are related to HA and security configurations:dfs.data.transfer.protection=${string}: aligns with thedfs.data.transfer.protectionconfiguration of the cluster.dfs.nameservices=${nameservice id}: specifies thenamesevice(alias) of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode backup IDs of the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the specific namenode ofnamenode1for client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the specific namenode ofnamenode2for client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: enables the HA cluster to automatically obtain an available namenode to respond to requests.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logical class for switching between primary and backup nodes in the HA cluster. You can also customize and upload the required logic for the HA cluster.
Notice
HA-related parameters are bound to the
namespace. For example, in the following sample three,myclusteris bound to thenamespace. Please note the relevant parameters when setting them.
Define FORMAT
FORMAT = ( TYPE = 'CSV'... )specifies the format of the external file as CSV. The parameters are as follows:TYPE: specifies the type of the external file.LINE_DELIMITER: specifies the line delimiter for the CSV file. The default value isLINE_DELIMITER='\n'.FIELD_DELIMITER: specifies the column delimiter for the CSV file. The default value isFIELD_DELIMITER='\t'.ESCAPE: specifies the escape character for the CSV file. It must be a single byte. The default value isESCAPE ='\'.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character used to enclose field values in the CSV file. The default value is an empty string.Notice
If the external table data file contains
NULLvalues (not string NULL, that is, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in Oracle mode, see Character sets and collations. If you do not specify this parameter, the default value is UTF8MB4.NULL_IF: specifies the string to be treated asNULL. The default value is an empty string.SKIP_HEADER: skips the file header and specifies the number of lines to skip.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, indicating that blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, indicating that leading and trailing spaces are not removed from fields in the file.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, indicating that empty strings are not treated asNULL.COMPRESSION: optional. Specifies the compression format of the file. Valid values are as follows:GZIP/DEFLATE: GZIP-compressed file.ZSTD: ZSTD-compressed file.NONE: indicates that the file is not compressed (default value).SNAPPY_BLOCK: HADOOP-SNAPPY-compressed file.
PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If not specified, the default value isTRUE.PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing, in bytes. The default value is 256 MB.MAX_ROW_LENGTH: specifies the maximum length of a single row of data, in bytes. The default value is 2 MB. This parameter is used in the boundary determination phase of parallel parsing of a single CSV file and does not affect the accuracy of data reading.
FORMAT = ( TYPE = 'PARQUET'... )specifies the format of the external file as PARQUET.FORMAT = ( TYPE = 'ORC'... )specifies the format of the external file as ORC.
Note
Starting from OceanBase Database V4.6.0, the external tables for Parquet and ORC file formats support reading complex data types: Array, Map, and JSON. The data types supported by the external tables must be consistent with those supported by OceanBase Database. For more information, see Overview of built-in data types.
OceanBase Database external tables for Parquet and ORC file formats support reading Bloom filters to filter data based on query predicates, reducing unnecessary I/O and accelerating queries.
(Optional) Define PATTERN
The PATTERN option specifies a regular expression pattern to filter files in the LOCATION directory. For each file path in the LOCATION directory, if it matches the pattern, the external table accesses the file; otherwise, it skips the file. If you do not specify this parameter, the default is to access all files in the LOCATION directory. The external table saves the list of files in the LOCATION directory that match the PATTERN in the database system table. When the external table scans, it accesses the external files based on this list.
(Optional) Define partitions for the external table
Automatically define partitions for the external table
The external table calculates and adds partitions based on the partition key expression. When you query, you can specify the value or range of the partition key. In this case, partition pruning is performed, and the external table only reads files in the specified partition.
Manually define partitions for the external table
If you need to manually add and delete partitions instead of letting the external table manage them automatically, you must specify the PARTITION_TYPE = USER_SPECIFIED field.
Example 1
Notice
The IP addresses in the commands in this example have been desensitized. You must replace them with the actual IP address of your server when you verify the commands.
The following example describes how to create an external table in the Oracle mode of OceanBase Database based on an external file stored in the local file system and an external file stored in the database.
Prepare an external file.
Execute the following command to create a file named
test_tbl1.csvin the/home/admin/external_csvdirectory on the server where the OBServer node is located.[admin@xxx /home/admin/external_csv]# vi test_tbl1.csvThe content of the file is as follows:
1,'Emma' 2,'William' 3,'Olivia'Set the import file path.
Notice
For security reasons, you can only modify the
secure_file_privsystem variable by using a local socket connection to the database. For more information, see secure_file_priv.Execute the following command to log in to the server where the OBServer node is located.
ssh admin@10.10.10.1Execute the following command to connect to the
oracle001tenant by using a local Unix socket connection.obclient -S /home/admin/oceanbase/run/sql.sock -usys@oracle001 -p******Execute the following SQL statement to set the import path to
/home/admin/external_csv.SET GLOBAL secure_file_priv = "/home/admin/external_csv";
Reconnect to the
oracle001tenant.Here is an example:
obclient -h10.10.10.1 -P2881 -usys@oracle001 -p****** -AExecute the following SQL statement to create an external table named
test_tbl1_csv.CREATE EXTERNAL TABLE test_tbl1_csv ( id INT, name VARCHAR(50) ) LOCATION = '/home/admin/external_csv' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='''' ) PATTERN = 'test_tbl1.csv';Execute the following SQL statement to query data from the
test_tbl1_csvexternal table.SELECT * FROM test_tbl1_csv;The returned result is as follows:
+------+---------+ | ID | NAME | +------+---------+ | 1 | Emma | | 2 | William | | 3 | Olivia | +------+---------+ 3 rows in set
Example 2
Example of creating an HDFS external table
With Kerberos:
CREATE EXTERNAL TABLE ext_data(ID NUMBER(32), NAME VARCHAR2(30),SCORE NUMBER(32))
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='"'
)
LOCATION = 'hdfs://localhost:8020/user?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx'
PATTERN = 'data.csv';
where:
principal: specifies the user for login authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configuration items. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value and needs to be configured. For example,dfs.data.transfer.protection=authentication,privacyspecifies the data transmission protection level asauthenticationandprivacy.
Without Kerberos:
CREATE EXTERNAL TABLE ext_data(ID NUMBER(32), NAME VARCHAR2(30),SCORE NUMBER(32))
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='"'
)
LOCATION = 'hdfs://localhost:8020/user'
PATTERN = 'test_tbl1.csv';
References
For more information about viewing and updating external files, see Manage external files.
