Purpose
This statement is used to create an external table in a database.
An external table is a key feature of a database management system. Usually, tables in a database are stored in the storage space of the database. In contrast, data of an external table is stored in an external storage service.
When you create an external table, you need to specify the file path and file format of the data. After the external table is created, you can use it to read data from the external storage service. An external table is read-only. You can use it in a query statement, but you cannot perform DML operations on it. External tables do not support constraints and indexes.
Syntax
CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = 'file_name'
FORMAT = (format_type_options)
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = '<regex_pattern>']
[AUTO_REFRESH = 'xxx'];
column_definition_list:
column_definition [, column_definition ...]
column_definition:
column_name column_type [AS expr]
format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option
type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
type_parquet_option:
TYPE = 'PARQUET'
type_orc_option:
TYPE = 'ORC'
Parameter description
| Parameter | Description |
|---|---|
| table_name | The name of the external table to be created. |
| column_name | The name of the column of the external table. By default, the data columns in the file are automatically mapped to the columns defined in the external table. |
| column_type | The column type of the external table. You cannot define constraints by using this parameter. Valid values are DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY. |
| AS expr | Specifies the manual column mapping. When the column order in the file is different from the column order defined in the external table, you can use the pseudo-column metadata$filecol{N} to specify the mapping between the columns in the external table and the Nth column in the file. For example, c2 INT AS (metadata$filecol4) specifies that the c2 column in the external table corresponds to the fourth column in the file. Note that if manual column mapping is specified, automatic mapping is disabled and all columns must be manually mapped. |
| LOCATION = 'file_name' | The path where the external table file is stored. Usually, the data file of an external table is stored in a separate directory, which can contain subdirectories. When the external table is created, it automatically collects all files in the directory. For more information, see file_name. |
| FORMAT = (format_type_options) | The attributes of the external file format. The TYPE keyword is used to specify the CSV or PARQUET file format. The TYPE parameter cannot be empty. For more information, see format_type_options. |
| PATTERN | A regular pattern string used to filter files in the LOCATION directory. For each file in the directory specified by LOCATION, if the file path matches the pattern string, the external table accesses the file; otherwise, the external table skips the file. If this parameter is not specified, all files in the LOCATION directory are accessible by default. The external table stores the list of files that meet the PATTERN requirement in the database system table. During a scan, the external table accesses external files based on this list. |
| PARTITION_TYPE = USER_SPECIFIED | If you want to manually add and delete partitions instead of having the external table automatically manage partitions, specify the PARTITION_TYPE = USER_SPECIFIED field. |
| AUTO_REFRESH = 'xxx' | Specifies automatic refresh for the external table. Valid values are:
|
file_name
The path for storing the external file can be in the following formats:
If the file is stored locally, the
LOCATIONvalue is in theLOCATION = '[file://] local_file_path'format. Here,local_file_pathcan be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.secure_file_privspecifies the file path that the OBServer node has the privilege to access.local_file_pathcan be only a subpath ofsecure_file_priv.If the file is stored remotely, the
LOCATIONvalue is in the following format:If the file is stored in Alibaba Cloud OSS or AWS S3, the
LOCATIONvalue is in theLOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'format. Here,$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information that you must configure to access Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol.s3_regionspecifies the region information that you select when you use S3. These sensitive access information is stored in the system tables of the database in an encrypted manner.If the file is stored in HDFS, the
LOCATIONvalue is in theLOCATION = hdfs://localhost:port/PATHformat. Here,localhostspecifies the address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the directory path in HDFS.The format with Kerberos authentication is
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'. Here:principal: specifies the login authentication user.keytab: specifies the key file for user authentication.krb5conf: specifies the description file for the Kerberos environment of the user.configs: specifies additional HDFS parameters. By default, this parameter is empty. However, in a Kerberos environment, this parameter is usually not empty and needs to be configured. For example,dfs.data.transfer.protection=authentication,privacyspecifies that the data transfer protection level isauthenticationandprivacy.
Notice
When you use an object storage path, the parameters in the object storage path are separated with the
&symbol. Make sure that the parameter values contain only uppercase and lowercase English letters, numbers,/-_$+=, and wildcards. If you specify other characters, the settings may fail.
format_type_options
TYPE = 'CSV': specifies the external file format as CSV. It also includes the following fields:LINE_DELIMITER: specifies the line delimiter for the CSV file.FIELD_DELIMITER: specifies the field delimiter for the CSV file.ESCAPE: specifies the escape character for the CSV file, which can be only 1 byte.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the characters that enclose the field values in the CSV file. The default value is empty.Notice
If the external data file contains
NULLvalues (non-string NULL values, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding for the file. For more information about all character sets supported in Oracle mode, see Character sets. If not specified, the default value is UTF8MB4.NULL_IF: specifies the strings to be treated asNULL. The default value is empty.SKIP_HEADER: specifies to skip the file header and the number of lines to skip.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, indicating that blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, indicating that leading and trailing spaces are not removed.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, indicating that empty strings are not treated asNULL.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the empty field at the end of a line if the line ends with an empty field (when a column separator appears before the row separator). The default value isTRUE, indicating that the last empty field is ignored.Note
For V4.3.5,
IGNORE_LAST_EMPTY_COLUMNis supported starting from V4.3.5 BP2.
TYPE = 'PARQUET': specifies the external file format asPARQUET.TYPE = 'ORC': specifies the external file format asORC.
Considerations
- If an external file is deleted, the file will no longer be in the list of external files accessed by the external table. In this case, the external table will ignore the missing file.
- If an external file is modified, the external table accesses the latest data of the external file. If the modification of the external file and the query of the external table are performed concurrently, unexpected results may occur. Therefore, avoid modifying the external file while querying the external table.
- If a new file is added to the external directory, the external table only accesses the files in the file list. If you want to add the new file to the file list of the external table, you must perform the operation to update the external table file list.
Examples
Example 1
Prepare the data. First, set the
secure_file_privpath to/home/admin/. Place the CSV fileextdata.csvcorresponding to the external table data in the/home/admin/testdirectory on the current OBServer node.Here is an example of setting the global secure path.
obclient> SET GLOBAL secure_file_priv = "" Query OK, 0 rows affected obclient> \q ByeNote
Since
secure_file_privis aGLOBALvariable, you must execute\qto exit and make it take effect.The content of the CSV file is as follows:
1,'Dave','Smith','dsmith@outlook.com','friend',32 2,'Xena','Johnson','xjonson@outlook.com','contact',45 3,'Fred','Jackon','fjackson@outlook.com','co-worker',19 4,'Alma','Tyler','atyler@outlook.com','friend',53After logging in to the user tenant, create the external table
contacts.obclient> CREATE EXTERNAL TABLE contacts ( id INT, firstname VARCHAR(100), lastname VARCHAR(100), email VARCHAR(255), category CHAR(30), age NUMBER ) LOCATION = '/home/admin/test/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='''' )PATTERN ='extdata.csv';Query the data in the external table
contacts.obclient> SELECT * FROM contacts; +------+-----------+----------+----------------------+--------------------------------+------+ | ID | FIRSTNAME | LASTNAME | EMAIL | CATEGORY | AGE | +------+-----------+----------+----------------------+--------------------------------+------+ | 1 | Dave | Smith | dsmith@outlook.com | friend | 32 | | 2 | Xena | Johnson | xjonson@outlook.com | contact | 45 | | 3 | Fred | Jackon | fjackson@outlook.com | co-worker | 19 | | 4 | Alma | Tyler | atyler@outlook.com | friend | 53 | +------+-----------+----------+----------------------+--------------------------------+------+ 4 rows in set
Example 2
Assume that a CSV file exists at a specific path on HDFS. The content of the CSV file is as follows:
$hdfs dfs -cat /user/test_tbl1.csv 1,'Emma','2021-09-01' 2,'William','2021-09-02' 3,'Olivia','2021-09-03'Create an HDFS external table.
If Kerberos authentication is not enabled in the target HDFS environment
CREATE EXTERNAL TABLE test_tbl1_csv_oracle ( id INT, name VARCHAR(50), c_date DATE ) LOCATION = 'hdfs://${hadoop_namenode_hostname}:${hadoop_namenode_port}/user' FORMAT = ( TYPE = 'CSV', FIELD_DELIMITER = ',', FIELD_OPTIONALLY_ENCLOSED_BY = '"' ) PATTERN = 'test_tbl1.csv';Notice
Here,
hadoop_namenode_hostnameandhadoop_namenode_portrefer to the hostname and port of the HDFS node. You must replace these with actual values.If Kerberos authentication is enabled in the target HDFS environment
Notice
When Kerberos authentication is enabled, the OBServer node must be configured with a keytab file (user authentication key file) and a krb5conf file.
CREATE EXTERNAL TABLE partsupp ( PS_PARTKEY INTEGER , PS_SUPPKEY INTEGER , PS_AVAILQTY INTEGER , PS_SUPPLYCOST DECIMAL(15,2) , PS_COMMENT VARCHAR(199) ) LOCATION = 'hdfs://localhost:8020/tpch_csv?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'partsupp.tbl';Query the external table data.
select * from test_tbl1_csv_oracle;The return result is as follows:
+----+----------+------------+ | id | name | c_date | +----+----------+------------+ | 1 | Emma | 2021-09-01 | | 2 | William | 2021-09-02 | | 3 | Olivia | 2021-09-03 | +----+----------+------------+ 3 rows in set
Example 3
Assume that an ORC-formatted data file named
data.orcis stored in the/home/admin/orc_test/directory.Create an ORC-formatted external table.
obclient> CREATE EXTERNAL TABLE ext_table_orc ( id NUMBER, name VARCHAR2(50) ) LOCATION = '/home/admin/orc_test/' FORMAT = ( TYPE = 'ORC' ) PATTERN = 'data.orc';Note
By specifying
FORMAT = ( TYPE = 'ORC' ), the external table can directly read the ORC-formatted data file. Make sure that the actual format of the external data file matches the declared format, otherwise parsing will fail.