This statement is used to create a new external table in the database.
External tables are a key feature of a database management system. Typically, tables in a database are stored in the database's storage space, while the data of external tables is stored in an external storage service.
When creating an external table, you need to define the file path and file format of the data. After that, users can read the data from the external storage service through the external table. External tables are read-only and can be used in query statements, but DML operations cannot be performed on them. External tables do not support defining constraints or creating indexes.
Syntax
CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = {'file_path' | @location_name['/path']}
FORMAT = (format_type_options)
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = '<regex_pattern>']
[AUTO_REFRESH = 'xxx'];
column_definition_list:
column_definition [, column_definition ...]
column_definition:
column_name column_type [AS expr]
format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option
type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
COMPRESSION = {GZIP | ZSTD | DEFLATE | NONE | SNAPPY_BLOCK}
PARALLEL_PARSE_ON_SINGLE_FILE = {TRUE | FALSE}
PARALLEL_PARSE_FILE_SIZE_THRESHOLD = <int>
MAX_ROW_LENGTH = <int>
type_parquet_option:
TYPE = 'PARQUET'
type_orc_option:
TYPE = 'ORC'
Parameters
| Parameter | Description |
|---|---|
| table_name | The name of the external table to be created. |
| column_name | The name of the column in the external table. By default, the data columns in the file are automatically mapped to the columns defined in the external table in sequence. |
| column_type | Defines the column type of the external table, but cannot define constraints (e.g., DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, FOREIGN KEY, etc.). |
| AS expr | Used to manually specify column mappings. When the order of columns in the file does not match the order of columns defined in the external table, you can use the pseudo column metadata$filecol{N} to specify the mapping between the external table's columns and the Nth column in the file. For example, c2 INT AS (metadata$filecol4) indicates that the c2 column in the external table corresponds to the 4th column in the file. Note that if manual column mappings are specified, automatic mappings will be disabled, and all columns must be manually defined with mappings. |
| LOCATION | Specifies the path where the external table files are stored. Typically, the data files of the external table are stored in a dedicated directory, which may contain subdirectories. When creating the table, the external table will automatically collect all files in this directory. Valid values:
|
| FORMAT = (format_type_options) | Specifies the properties of the external file format. Use TYPE to specify the export format as CSV or PARQUET. The TYPE parameter cannot be empty. For more information, see format_type_options below. |
| PATTERN | Specifies a regular expression pattern to filter files in the LOCATION directory. For each file path in the LOCATION directory, if it matches the pattern, the external table will access the file; otherwise, it will skip the file. If this parameter is not specified, the external table will access all files in the LOCATION directory by default. The external table will save the list of files in the LOCATION directory that match the PATTERN in the database system table. When the external table scans, it will access the external files based on this list. |
| PARTITION_TYPE = USER_SPECIFIED | Specifies PARTITION_TYPE = USER_SPECIFIED when you need to manually add and delete partitions instead of letting the external table automatically manage partitions. |
| AUTO_REFRESH = 'xxx' | Specifies whether to enable automatic refresh for the external table. Valid values:
|
file_path
The following table describes the file path formats for external tables.
If the file is stored locally, the
LOCATIONparameter is in theLOCATION = '[file://] local_file_path'format, wherelocal_file_pathcan be a relative or absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database. Thesecure_file_privparameter specifies the file path that the OBServer node has access to. Thelocal_file_pathparameter can only be a subpath of thesecure_file_privpath.If the file is stored remotely, the
LOCATIONparameter is in the following format:Notice
When you use an object storage path, the parameters of the object storage path are separated by the
&symbol. Make sure that the parameter values contain only uppercase and lowercase letters, numbers,/-_$+=, and wildcards. If you enter other characters, the configuration may fail.If the file is stored on Alibaba Cloud OSS or AWS S3, the
LOCATIONparameter is in theLOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'format, where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required to access Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol, ands3_regionspecifies the region information when you use S3. These sensitive access information are stored in the system tables of the database in an encrypted manner.If the file is stored on HDFS, the
LOCATIONparameter is in the following format:If you access the cluster by using the IP address of a single NameNode (NN), the
LOCATIONparameter is in theLOCATION = hdfs://localhost:port/PATHformat, wherelocalhostspecifies the IP address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the file path in HDFS.If you access the cluster by using Kerberos authentication, the
LOCATIONparameter is in theLOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'format, where:principal: specifies the user for login authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configurations. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value, which must be configured. For example,dfs.data.transfer.protection=authentication,privacyspecifies the data transmission protection level asauthenticationandprivacy.
If you access the cluster by using the logical name service of Hadoop high availability (HA), the
LOCATIONparameter is in theLOCATION = hdfs://nameserviceID/PATHformat, wherenameserviceIDspecifies the ID of the logical name service of the Hadoop HA cluster, andPATHspecifies the file path.Note
Make sure that the client side of the OBServer node contains the
nameservicedefinition and failover strategy of the HA cluster.If you access the cluster by using Kerberos authentication, the
LOCATIONparameter is in theLOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'format, where:principal: specifies the user for login authentication. Set this parameter to thepricipalof a non-primary NN.keytabandkrb5conf: are the same as those in the single-node NN configuration.configs: specifies additional HDFS configurations. If multiple configurations are set, only the HA configurations and security configurations are related to Kerberos:dfs.data.transfer.protection=${string}: specifies thedfs.data.transfer.protectionconfiguration of the cluster.dfs.nameservices=${nameservice id}: specifies thenameservice(alias) of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of standby IDs of the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies thenamenode1for the client to route.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies thenamenode2for the client to route.dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies that the HA cluster automatically obtains a valid namenode to respond to requests after the HA cluster receives the requests.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logical class for the HA cluster to switch between primary and standby nodes. You can also customize the package and upload the logic required by the HA cluster.
Notice
Some HA configurations are bound to the
namespace. For example, in the following sample three, themyclusteris thenamespace. Make sure that the related configurations are set.
format_type_options
TYPE = 'CSV': specifies the format of the external file as CSV. It also includes the following fields:LINE_DELIMITER: specifies the line delimiter for the CSV file.FIELD_DELIMITER: specifies the column delimiter for the CSV file.ESCAPE: specifies the escape character for the CSV file. It must be a single byte.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character used to enclose field values in the CSV file. The default value is empty.Notice
When the external table data file contains
NULLvalues (not the string NULL, i.e., not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in Oracle mode, see Character set. If not specified, the default value is UTF8MB4.NULL_IF: specifies the string to be treated asNULL. The default value is empty.SKIP_HEADER: skips the file header and specifies the number of rows to skip.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, indicating that blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, indicating that leading and trailing spaces are not removed from fields in the file.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, indicating that empty strings are not treated asNULL.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line in the file. The default value isTRUE, indicating that the last empty field is ignored.COMPRESSION: optional. Specifies the compression format of the file. Valid values:GZIP/DEFLATE: GZIP-compressed file.ZSTD: ZSTD-compressed file.NONE: indicates that the file is not compressed (default value).SNAPPY_BLOCK: HADOOP-SNAPPY-compressed file.
PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If not specified, the default value isTRUE.PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the threshold size for files that may be parsed in parallel. The unit is bytes, and the default value is 256 MB.MAX_ROW_LENGTH: specifies the maximum length of a single row of data, in bytes. The default value is 2 MB. It is used for the boundary determination stage of parallel parsing of a single CSV file and does not affect the accuracy of data reading.
TYPE = 'PARQUET': specifies the format of the external file asPARQUET.TYPE = 'ORC': specifies the format of the external file asORC.
Considerations
- When an external file is deleted, the external table will ignore the file if it no longer exists in the file list.
- When an external file is modified, the external table will access the latest data in the external file. If the external file is modified concurrently with a query on the external table, unexpected results may occur. Therefore, avoid modifying the external file while querying the external table.
- When new files are added to the external directory, the external table will only access the files listed in the file list. To include the new files in the external table's file list, you need to execute an update operation on the external table's file list.
Examples
Example 1
Prepare data. First, set the
secure_file_privpath to/home/admin/and place the CSV fileextdata.csvcorresponding to the external table data in the/home/admin/testdirectory of the current OBServer node.Here is an example of setting the global secure file path:
obclient> SET GLOBAL secure_file_priv = "/home/admin/"; Query OK, 0 rows affected obclinet> \q ByeNote
Since
secure_file_privis aGLOBALvariable, you need to execute\qto exit and make it effective.The content of the CSV file is as follows:
1,'Dave','Smith','dsmith@outlook.com','friend',32 2,'Xena','Johnson','xjonson@outlook.com','contact',45 3,'Fred','Jackon','fjackson@outlook.com','co-worker',19 4,'Alma','Tyler','atyler@outlook.com','friend',53After logging in to the database as a user in the user tenant, create an external table named
contacts.obclient> CREATE EXTERNAL TABLE contacts ( id INT, firstname VARCHAR(100), lastname VARCHAR(100), email VARCHAR(255), category CHAR(30), age NUMBER ) LOCATION = '/home/admin/test/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='''' )PATTERN ='extdata.csv';Query the data in the
contactsexternal table.obclient> SELECT * FROM contacts; +------+-----------+----------+----------------------+--------------------------------+------+ | ID | FIRSTNAME | LASTNAME | EMAIL | CATEGORY | AGE | +------+-----------+----------+----------------------+--------------------------------+------+ | 1 | Dave | Smith | dsmith@outlook.com | friend | 32 | | 2 | Xena | Johnson | xjonson@outlook.com | contact | 45 | | 3 | Fred | Jackon | fjackson@outlook.com | co-worker | 19 | | 4 | Alma | Tyler | atyler@outlook.com | friend | 53 | +------+-----------+----------+----------------------+--------------------------------+------+ 4 rows in setCreate an external table named
test_tbl1_csvwith theFORMAToption that includes theCOMPRESSION,PARALLEL_PARSE_ON_SINGLE_FILE,PARALLEL_PARSE_FILE_SIZE_THRESHOLD, andMAX_ROW_LENGTHparameters.obclient> CREATE EXTERNAL TABLE test_tbl1_csv ( col1 INT, col2 VARCHAR(100), col3 VARCHAR(100), col4 VARCHAR(255), col5 CHAR(30), col6 NUMBER) LOCATION = '/home/admin/test/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY = '''' COMPRESSION = SNAPPY_BLOCK PARALLEL_PARSE_ON_SINGLE_FILE = TRUE PARALLEL_PARSE_FILE_SIZE_THRESHOLD = 1024 MAX_ROW_LENGTH = 1024) PATTERN = 'extdata.csv';
Example 2
Assume that a CSV file exists in a certain file path on HDFS, with the following content:
$hdfs dfs -cat /user/test_tbl1.csv 1,'Emma','2021-09-01' 2,'William','2021-09-02' 3,'Olivia','2021-09-03'Create an HDFS external table.
If Kerberos authentication is not enabled in the target HDFS environment
CREATE EXTERNAL TABLE test_tbl1_csv_oracle ( id INT, name VARCHAR(50), c_date DATE ) LOCATION = 'hdfs://${hadoop_namenode_hostname}:${hadoop_namenode_port}/user' FORMAT = ( TYPE = 'CSV', FIELD_DELIMITER = ',', FIELD_OPTIONALLY_ENCLOSED_BY = '"' ) PATTERN = 'test_tbl1.csv';Notice
Here,
hadoop_namenode_hostnameandhadoop_namenode_portrefer to the hostname and port of the HDFS node, which need to be replaced with actual values.If Kerberos authentication is enabled in the target HDFS environment
Notice
If Kerberos authentication is enabled, you need to deploy the corresponding keytab (user authentication key file) and krb5conf file on the OBServer node.
CREATE EXTERNAL TABLE partsupp ( PS_PARTKEY INTEGER , PS_SUPPKEY INTEGER , PS_AVAILQTY INTEGER , PS_SUPPLYCOST DECIMAL(15,2) , PS_COMMENT VARCHAR(199) ) LOCATION = 'hdfs://localhost:8020/tpch_csv?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'partsupp.tbl';Query the data in the external table.
select * from test_tbl1_csv_oracle;The query result is as follows:
+----+----------+------------+ | id | name | c_date | +----+----------+------------+ | 1 | Emma | 2021-09-01 | | 2 | William | 2021-09-02 | | 3 | Olivia | 2021-09-03 | +----+----------+------------+ 3 rows in set
Example 3
Assume that an ORC file named
data.orcis stored in the/home/admin/orc_test/directory.Create an ORC external table.
obclient> CREATE EXTERNAL TABLE ext_table_orc ( id NUMBER, name VARCHAR2(50) ) LOCATION = '/home/admin/orc_test/' FORMAT = ( TYPE = 'ORC' ) PATTERN = 'data.orc';Note
By specifying
FORMAT = ( TYPE = 'ORC' ), the external table can directly read the ORC data file. Ensure that the actual format of the external data file matches the declared format to avoid parsing failures.
Example 4
Assume that a CSV file exists in a certain file path on HDFS, with the following content:
$hdfs dfs -cat /hadoop_ha_test/test_simple.csv 1,lili,19 2,alic,20 3,solvi,21Start the Observer-related JNI configuration items.
For more information about the JNI configuration, see Deploy the OceanBase Database JAVA SDK environment.
Create an HDFS external table
obclient> CREATE EXTERNAL TABLE test_ha ( id INT, r_name VARCHAR(100), age INT ) LOCATION = 'hdfs://mycluster/hadoop_ha_test?principal=ha/xxx@xxx.com&keytab=/path/to/ha.keytab&krb5conf=/path/to/krb5conf_file&configs=dfs.data.transfer.protection=integrity#dfs.nameservices=mycluster#dfs.ha.namenodes.mycluster=nn1,nn2#dfs.namenode.rpc-address.mycluster.nn1=localhost1:port#dfs.namenode.rpc-address.mycluster.nn2=localhost2:port#dfs.ha.automatic-failover.enabled.mycluster=true#dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY = '\'' ) PATTERN = 'test_simple.csv';Query the data in the external table.
obclient> SELECT * FROM test_ha;The query result is as follows:
+----+--------+-------+ | ID | R_NAME | AGE | +----+--------+-------+ | 1 | lili | 19 | | 2 | alic | 20 | | 3 | solvi | 21 | +----+--------+-------+ 3 rows in set
