This topic describes how to create an external table by using SQL statements. It also provides the prerequisites, overview, and considerations for creating an external table, and gives some examples.
Overview
An external table is a logical table object that stores data in an external storage service instead of in the database.
For more information about external tables, see Overview.
Prerequisites
Before you create an external table, make sure that the following conditions are met:
You have deployed an OceanBase cluster and created a MySQL tenant. For more information, see Overview.
You have connected to a MySQL tenant of OceanBase Database. For more information, see Overview.
You have created a database. For more information, see Create a database.
You have the
CREATEprivilege. For information about how to view the privileges of the current user, see View user privileges. If you do not have the privilege, contact the administrator to grant the privilege to you. For information about how to directly grant privileges, see Directly grant privileges.
Considerations
External tables support only query operations and do not support DML operations.
If the external file accessed by an external table is deleted, the system does not return an error but instead returns an empty row.
If the external storage system is unavailable, queries on external tables will return an error.
Since external table data is stored in an external data source, queries may involve network and file system factors, which can affect query performance. Therefore, when creating an external table, you need to select an appropriate data source and optimization strategy to improve query efficiency.
Use the CLI to create an external table
Use the CREATE EXTERNAL TABLE statement to create an external table.
Define the name of the external table
When you create an external table, you must specify its name. To avoid confusion and ambiguity, we recommend that you use specific naming rules or prefixes to distinguish external tables from regular tables. For example, you can add _csv as a suffix to the name of an external table.
Here is an example:
When you create an external table that stores information about students, you can name it students_csv.
CREATE EXTERNAL TABLE students_csv external_options
Notice
Because the other attributes of the external table are not specified, the preceding SQL statement cannot be executed.
Define columns
You cannot define constraints for columns of an external table. Constraints such as DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY cannot be defined for columns of an external table.
The column types supported by external tables are the same as those supported by regular tables. For more information about the data types supported by OceanBase Database in MySQL mode, see Data types.
Define LOCATION
The LOCATION option specifies the path where the external table files are stored. Usually, the data files of an external table are stored in a separate directory, which can contain subdirectories. When you create an external table, it automatically collects all files in the directory.
OceanBase Database supports the following two path formats:
Local Location format:
LOCATION = '[file://] local_file_path'Notice
For scenarios using the local Location format, you must set the system variable
secure_file_privto configure the accessible path. For more information, see secure_file_priv.Remote Location format:
Notice
When you use an object storage path, the parameters of the object storage path are separated by the
&symbol. Make sure that the parameter values you enter contain only uppercase and lowercase letters, digits,/-_$+=, and wildcards. If you enter other characters, the configuration may fail.LOCATION = '{oss|S3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required for accessing Alibaba Cloud OSS, AWS S3, or object storage compatible with the S3 protocol, ands3_regionis the region information selected when using S3. These sensitive access information are stored in the system tables of the database in an encrypted manner.When the files are stored on HDFS, the format is as follows:
Format for accessing the cluster based on the address of a single-node NameNode (NN):
LOCATION = hdfs://localhost:port/PATHwherelocalhostis the address of HDFS,portis the port number of HDFS, andPATHis the file path in HDFS.Format with Kerberos authentication:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'where:principal: the user for login authentication.keytab: the path of the key file for user authentication.krb5conf: the path of the description file for the Kerberos environment.configs: extra HDFS configurations. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value and needs to be configured. For example:dfs.data.transfer.protection=authentication,privacyspecifies the data transmission protection level asauthenticationandprivacy.
Format for accessing the cluster based on Hadoop High Availability (HA):
LOCATION = hdfs://nameserviceID/PATHwherenameserviceIDis the logical nameservice ID of Hadoop HA, andPATHis the file path.Note
Make sure that the client OBServer has the
nameservicedefinition and failover strategy for the HA cluster.Format with Kerberos authentication:
LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'where:principal: the user for login authentication. Set this parameter to thepricipalof a non-primary NN.keytabandkrb5conf: same as in the single-node NN setting.configs: extra HDFS configurations. If you need to set multiple configurations, only HA configurations and security configurations are related:dfs.data.transfer.protection=${string}: aligns with thedfs.data.transfer.protectionconfiguration of the cluster.dfs.nameservices=${nameservice id}: specifies thenamesevice(alias) of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode backup IDs of the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the specific namenode ofnamenode1for client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the specific namenode ofnamenode2for client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: enables the HA cluster to automatically obtain an available namenode to respond to requests.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logical tool class for switching between primary and standby nodes in the HA cluster. You can also customize and upload the required logic for the HA cluster.
Notice
HA-related parameters are bound to
namespace. For example, in the following sample three,myclusteris used. Make sure to set the relevant parameters accordingly.
Define the FORMAT clause
FORMAT = ( TYPE = 'CSV'... )specifies that the external file is of the CSV type. The following parameters are available:TYPE: specifies the type of the external file.LINE_DELIMITER: specifies the line delimiter of the CSV file. The default value isLINE_DELIMITER='\n'.FIELD_DELIMITER: specifies the field delimiter of the CSV file. The default value isFIELD_DELIMITER='\t'.ESCAPE: specifies the escape character of the CSV file. It must be one byte. The default value isESCAPE ='\'.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character used to enclose field values in the CSV file. The default value is empty.Notice
If the external file contains
NULLvalues (non-string NULL, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about the character sets supported in MySQL mode, see Character sets. If this parameter is not specified, the default value is UTF8MB4.NULL_IF: specifies the string to be treated asNULL. The default value is empty.SKIP_HEADER: specifies the number of lines to skip from the beginning of the file.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, indicating that blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, indicating that leading and trailing spaces are not removed.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, indicating that empty strings are not treated asNULL.COMPRESSION: optional. Specifies the compression format for the file. Valid values are:GZIP/DEFLATE: GZIP-compressed file.ZSTD: ZSTD-compressed file.NONE: indicates that the file is not compressed (default value).SNAPPY_BLOCK: HADOOP-SNAPPY-compressed file.
PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If not specified, the default value isTRUE.PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing, in bytes. The default value is 256 MB.MAX_ROW_LENGTH: specifies the maximum length of a single row of data, in bytes. The default value is 2 MB. This parameter is used during the boundary determination phase of parallel parsing of a single CSV file and does not affect the accuracy of data reading.
FORMAT = ( TYPE = 'PARQUET'... )specifies that the external file is of the PARQUET type.FORMAT = ( TYPE = 'ORC'... )specifies that the external file is of the ORC type.
Note
Starting from OceanBase Database V4.6.0, the external table of the Parquet/ORC file format supports reading complex data types such as Array, Map, and JSON. The data types supported by the external table must be consistent with those supported by OceanBase Database. For more information, see Overview of data types.
- OceanBase Database supports reading Bloom filters from external tables of the Parquet/ORC file format. Bloom filters filter data based on query predicates to reduce unnecessary I/O operations and accelerate queries.
(Optional) Define the PATTERN clause
The PATTERN option specifies a regular expression pattern to filter files in the LOCATION directory. For each file path in the LOCATION directory, if the file path matches the pattern, the external table accesses the file; otherwise, it skips the file. If this parameter is not specified, all files in the LOCATION directory are accessible by default. The external table saves the list of files in the LOCATION directory that match the PATTERN in the database system table. When scanning the external table, it accesses the external files based on this list.
(Optional) Define partitions for the external table
Automatically define partitions for the external table
The external table calculates and adds partitions based on the expressions defined for the partitioning key. When querying, you can specify the value or range of the partitioning key. In this case, partition pruning is performed, and the external table only reads files in the specified partition.
Manually define partitions for the external table
If you need to manually add and delete partitions instead of letting the external table manage them automatically, you must specify the PARTITION_TYPE = USER_SPECIFIED field.
Example
Notice
The IP address in the example is desensitized. You need to replace it with the actual IP address of your server.
This example shows how to create an external table in the MySQL mode of OceanBase Database based on an external file stored in the local server or in the MySQL mode of OceanBase Database. The steps are as follows:
Prepare an external file.
Run the following command to create a file named
test_tbl1.csvin the/home/admindirectory on the server where the OBServer node is located.[admin@xxx /home/admin]# vi test_tbl1.csvThe content of the file is as follows:
1,'Emma','2021-09-01' 2,'William','2021-09-02' 3,'Olivia','2021-09-03'Set the import file path.
Notice
For security reasons, you can modify only the
secure_file_privsystem variable by using a local socket connection. For more information, see secure_file_priv.Run the following command to log in to the server where the OBServer node is located.
ssh admin@10.10.10.1Run the following command to connect to the
mysql001tenant by using a local Unix socket connection.obclient -S /home/admin/oceanbase/run/sql.sock -uroot@mysql001 -p******Run the following SQL statement to set the import path to
/home/admin.SET GLOBAL secure_file_priv = "/home/admin";
Reconnect to the
mysql001tenant.Here is an example:
obclient -h10.10.10.1 -P2881 -uroot@mysql001 -p****** -A -DtestRun the following SQL statement to create an external table named
test_tbl1_csv.CREATE EXTERNAL TABLE test_tbl1_csv ( id INT, name VARCHAR(50), c_date DATE ) LOCATION = '/home/admin' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='\'' ) PATTERN = 'test_tbl1.csv';Run the following SQL statement to query the data in the
test_tbl1_csvexternal table.SELECT * FROM test_tbl1_csv;The return result is as follows:
+------+---------+------------+ | id | name | c_date | +------+---------+------------+ | 1 | Emma | 2021-09-01 | | 2 | William | 2021-09-02 | | 3 | Olivia | 2021-09-03 | +------+---------+------------+ 3 rows in set
Example 2
Example of creating an HDFS external table
With Kerberos:
CREATE EXTERNAL TABLE ext_data (
id INT,
name VARCHAR(50),
c_date DATE
)
LOCATION = 'hdfs://localhost:8020/user?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx'
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='"'
)
PATTERN = 'data.csv';
Where:
principal: specifies the user for login authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configuration items. By default, it is empty. However, in a Kerberos environment, this configuration item usually has a value and needs to be configured. For example,dfs.data.transfer.protection=authentication,privacyspecifies the data transmission protection level asauthenticationandprivacy.
Without Kerberos:
CREATE EXTERNAL TABLE ext_data (
id INT,
name VARCHAR(50),
c_date DATE
)
LOCATION = 'hdfs://localhost:8020/user'
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='\''
)
PATTERN = 'test_tbl1.csv';
References
- You can use the same method as deleting a regular table to delete an external table. For more information about how to delete a table, see Delete a table.
- For more information about how to view and update external files, see Manage external files.
