Purpose
This statement is used to create a new external table in the database.
External tables are a key feature in a database management system. Typically, data in a database is stored in the database's storage space, while data in an external table is stored in an external storage service.
When creating an external table, you need to define the file path and file format for the data. Once created, users can read data from the external storage service through the external table. External tables are read-only and can be used in query statements, but they do not support DML operations. External tables also do not support defining constraints or creating indexes.
Syntax
CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = {'file_path' | @location_name['/path']}
FORMAT = (format_type_options)
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = '<regex_pattern>']
[AUTO_REFRESH = 'xxx'];
column_definition_list:
column_definition [, column_definition ...]
column_definition:
column_name column_type [AS expr]
format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option
type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
type_parquet_option:
TYPE = 'PARQUET'
type_orc_option:
TYPE = 'ORC'
Parameters
| Parameter | Description |
|---|---|
| table_name | The name of the external table to be created. |
| column_name | The name of the column in the external table. By default, the data columns in the file are automatically mapped to the columns defined in the external table in sequence. |
| column_type | The data type of the column in the external table. Constraints such as DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY cannot be defined. |
| AS expr | Used to manually specify column mapping. When the order of columns in the file does not match the order of columns defined in the external table, you can use the pseudo column metadata$filecol{N} to specify the mapping between the external table column and the Nth column in the file. For example, c2 INT AS (metadata$filecol4) indicates that the c2 column in the external table corresponds to the 4th column in the file. Note that if manual column mapping is specified, automatic mapping will be disabled, and all columns must be manually defined. |
| LOCATION | Specifies the path where the external table files are stored. Typically, the data files of the external table are stored in a separate directory, which may contain subdirectories. When creating the table, the external table automatically collects all files in this directory. Valid values:
|
| FORMAT = (format_type_options) | Specifies the properties of the external file format. Use TYPE to specify the export format as CSV or PARQUET. TYPE cannot be empty. For more information, see format_type_options below. |
| PATTERN | Specifies a regular expression pattern to filter files in the LOCATION directory. For each file path in the LOCATION directory, if it matches the pattern, the external table will access the file; otherwise, it will skip the file. If this parameter is not specified, all files in the LOCATION directory are accessible by default. The external table will save the list of files that match the PATTERN in the database system table. When scanning the external table, it will access the external files based on this list. |
| PARTITION_TYPE = USER_SPECIFIED | Specifies PARTITION_TYPE = USER_SPECIFIED if you want to manually add and remove partitions instead of letting the external table automatically manage partitions. |
| AUTO_REFRESH = 'xxx' | Specifies whether to enable automatic refresh for the external table. Valid values:
|
file_path
The file path for external tables can be in the following formats:
If the file is stored locally, the
LOCATIONformat is:LOCATION = '[file://] local_file_path', wherelocal_file_pathcan be a relative or absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database. Thesecure_file_privparameter specifies the file paths that OBServer nodes can access.local_file_pathmust be a subpath of thesecure_file_privpath.If the file is stored remotely, the
LOCATIONformat is:Notice
When using an object storage path, the parameters are separated by the
&symbol. Ensure that your input parameters contain only uppercase and lowercase letters, numbers,/-_$+=, and wildcards. If you input any other characters, the settings may fail.If the file is stored on Alibaba Cloud OSS or AWS S3, the format is:
LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path', where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access credentials required to access Alibaba Cloud OSS, AWS S3, or an object storage service compatible with the S3 protocol.s3_regionspecifies the region selected for S3. These sensitive access credentials are stored in the system tables of the database in an encrypted format.If the file is stored on HDFS, the format is as follows:
To access the cluster using the IP address of a single NameNode (NN), the format is:
LOCATION = hdfs://localhost:port/PATH, wherelocalhostis the IP address of HDFS,portis the port number of HDFS, andPATHis the file path in HDFS.For Kerberos authentication, the format is:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx', where:principal: the user for login and authentication.keytab: the path to the user's authentication key file.krb5conf: the path to the Kerberos environment description file.configs: additional HDFS configuration items. By default, this parameter is empty. However, in a Kerberos environment, this parameter typically has a value and needs to be configured, for example:dfs.data.transfer.protection=authentication,privacy, which specifies the data transfer protection level asauthenticationandprivacy.
To access the cluster using Hadoop High Availability (HA) logical naming services, the format is:
LOCATION = hdfs://nameserviceID/PATH, wherenameserviceIDis the ID of the Hadoop HA logical naming service, andPATHis the file path.Note
Ensure that the client-side configuration of the OBServer node includes the
nameservicedefinition and failover strategy for the HA cluster.For Kerberos authentication, the format is:
LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:principal: the user for login and authentication, set to thepricipalof a non-primary NameNode.keytabandkrb5conf: same as for a single NameNode.configs: additional HDFS configuration items. If multiple configuration items are needed, they are related to HA and security configurations:dfs.data.transfer.protection=${string}: aligns with the cluster'sdfs.data.transfer.protectionconfiguration.dfs.nameservices=${nameservice id}: specifies thenamesevice(alias) of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: lists the standby IDs of the HA cluster's namenodes.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the specific namenode fornamenode1to facilitate client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the specific namenode fornamenode2to facilitate client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: enables the HA cluster to automatically obtain a available namenode to respond to requests.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logical tool class for primary/standby switching in the HA cluster. You can also customize and upload the required logic for the HA cluster.
Notice
HA-related configuration items are bound to the
namespace. For example, in the following sample three, themyclusteris used. Ensure that the related configuration items are set correctly.
format_type_options
TYPE = 'CSV': specifies the external file format as CSV. It also includes the following fields:LINE_DELIMITER: specifies the line delimiter for the CSV file.FIELD_DELIMITER: specifies the column delimiter for the CSV file.ESCAPE: specifies the escape character for the CSV file. It must be a single byte.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the symbol used to enclose field values in the CSV file. The default value is empty.Notice
If the external table data file contains
NULLvalues (non-string NULL, i.e., not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and its value cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in Oracle-compatible mode, see Character set. If not specified, the default value is UTF8MB4.NULL_IF: specifies the string to be treated asNULL. The default value is empty.SKIP_HEADER: skips the file header and specifies the number of lines to skip.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, indicating that blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, indicating that leading and trailing spaces are not removed.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, indicating that empty strings are not treated asNULL.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a file line (i.e., when a column delimiter precedes the line delimiter). The default value isTRUE, indicating that the last empty field is ignored.
TYPE = 'PARQUET': specifies the external file format asPARQUET.TYPE = 'ORC': specifies the external file format asORC.
Considerations
- When an external file is deleted, the external table ignores the missing file in the file list.
- When an external file is modified, the external table accesses the latest data of the external file. If the modification of the external file and the query of the external table are performed concurrently, unexpected results may occur. Therefore, avoid modifying the external file while querying the external table.
- When a new file is added to the external directory, the external table only accesses the files listed in the file list. If you want to add the new file to the file list of the external table, you need to execute the update external table file operation.
Examples
Example 1
Prepare the data. First, set the
secure_file_privparameter to/home/admin/and place the CSV fileextdata.csvcorresponding to the external table data in the/home/admin/testdirectory of the current OBServer node.Here is an example of setting the global secure file path:
obclient> SET GLOBAL secure_file_priv = "/home/admin/"; Query OK, 0 rows affected obclinet> \q ByeNote
Since
secure_file_privis aGLOBALvariable, you need to execute\qto make it effective.The content of the CSV file is as follows:
1,'Dave','Smith','dsmith@outlook.com','friend',32 2,'Xena','Johnson','xjonson@outlook.com','contact',45 3,'Fred','Jackon','fjackson@outlook.com','co-worker',19 4,'Alma','Tyler','atyler@outlook.com','friend',53After logging in to the database as a user in the user tenant, create the external table
contacts.obclient> CREATE EXTERNAL TABLE contacts ( id INT, firstname VARCHAR(100), lastname VARCHAR(100), email VARCHAR(255), category CHAR(30), age NUMBER ) LOCATION = '/home/admin/test/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='''' )PATTERN ='extdata.csv';Query the data in the external table
contacts.obclient> SELECT * FROM contacts; +------+-----------+----------+----------------------+--------------------------------+------+ | ID | FIRSTNAME | LASTNAME | EMAIL | CATEGORY | AGE | +------+-----------+----------+----------------------+--------------------------------+------+ | 1 | Dave | Smith | dsmith@outlook.com | friend | 32 | | 2 | Xena | Johnson | xjonson@outlook.com | contact | 45 | | 3 | Fred | Jackon | fjackson@outlook.com | co-worker | 19 | | 4 | Alma | Tyler | atyler@outlook.com | friend | 53 | +------+-----------+----------+----------------------+--------------------------------+------+ 4 rows in set
Example 2
Assume that there is a CSV file in a certain file path on HDFS. The content of the file is as follows:
$hdfs dfs -cat /user/test_tbl1.csv 1,'Emma','2021-09-01' 2,'William','2021-09-02' 3,'Olivia','2021-09-03'Create an external table on HDFS.
If Kerberos authentication is not enabled in the target HDFS environment
CREATE EXTERNAL TABLE test_tbl1_csv_oracle ( id INT, name VARCHAR(50), c_date DATE ) LOCATION = 'hdfs://${hadoop_namenode_hostname}:${hadoop_namenode_port}/user' FORMAT = ( TYPE = 'CSV', FIELD_DELIMITER = ',', FIELD_OPTIONALLY_ENCLOSED_BY = '"' ) PATTERN = 'test_tbl1.csv';Notice
Here,
hadoop_namenode_hostnameandhadoop_namenode_portrefer to the host name and port of the HDFS node, which need to be replaced with actual values.If Kerberos authentication is enabled in the target HDFS environment
Notice
If Kerberos authentication is enabled, you need to deploy the keytab file (user authentication key file) and krb5conf file on the corresponding OBServer node.
CREATE EXTERNAL TABLE partsupp ( PS_PARTKEY INTEGER , PS_SUPPKEY INTEGER , PS_AVAILQTY INTEGER , PS_SUPPLYCOST DECIMAL(15,2) , PS_COMMENT VARCHAR(199) ) LOCATION = 'hdfs://localhost:8020/tpch_csv?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'partsupp.tbl';Query the data in the external table.
select * from test_tbl1_csv_oracle;The query result is as follows:
+----+----------+------------+ | id | name | c_date | +----+----------+------------+ | 1 | Emma | 2021-09-01 | | 2 | William | 2021-09-02 | | 3 | Olivia | 2021-09-03 | +----+----------+------------+ 3 rows in set
Example 3
Assume that an ORC file named
data.orcis stored in the/home/admin/orc_test/directory.Create an external table in ORC format.
obclient> CREATE EXTERNAL TABLE ext_table_orc ( id NUMBER, name VARCHAR2(50) ) LOCATION = '/home/admin/orc_test/' FORMAT = ( TYPE = 'ORC' ) PATTERN = 'data.orc';Note
You can specify
FORMAT = ( TYPE = 'ORC' )to let the external table directly read the ORC file. Make sure that the actual format of the external data file is consistent with the declared format, otherwise, the parsing will fail.
Example 4
Assume that there is a CSV file in a certain file path on HDFS. The content of the file is as follows:
$hdfs dfs -cat /hadoop_ha_test/test_simple.csv 1,lili,19 2,alic,20 3,solvi,21Start the JNI configuration items related to the observer.
For more information about the JNI configuration, see Deploy the OceanBase Database Java SDK environment.
Create an external table on HDFS
obclient> CREATE EXTERNAL TABLE test_ha ( id INT, r_name VARCHAR(100), age INT ) LOCATION = 'hdfs://mycluster/hadoop_ha_test?principal=ha/xxx@xxx.com&keytab=/path/to/ha.keytab&krb5conf=/path/to/krb5conf_file&configs=dfs.data.transfer.protection=integrity#dfs.nameservices=mycluster#dfs.ha.namenodes.mycluster=nn1,nn2#dfs.namenode.rpc-address.mycluster.nn1=localhost1:port#dfs.namenode.rpc-address.mycluster.nn2=localhost2:port#dfs.ha.automatic-failover.enabled.mycluster=true#dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY = '\'' ) PATTERN = 'test_simple.csv';Query the data in the external table.
obclient> SELECT * FROM test_ha;The query result is as follows:
+----+--------+-------+ | ID | R_NAME | AGE | +----+--------+-------+ | 1 | lili | 19 | | 2 | alic | 20 | | 3 | solvi | 21 | +----+--------+-------+ 3 rows in set