You can execute the CREATE EXTERNAL TABLE statement to create an external table. When you create an external table, you must specify the path and format of the data files so that the system can read data from the external files.
Privilege requirements
To create an external table, the current user must have the CREATE TABLE privilege. For information about how to view the privileges of the current user, see View user privileges. If you do not have the CREATE TABLE privilege, contact the administrator to grant you the privilege. For information about how to grant privileges to a user, see Directly grant privileges.
Create an external table
The following statement is used to create an external table:
CREATE EXTERNAL TABLE table_name
( col_name col_type [AS (metadata$filecol{N})]
[ , col_name col_type [AS (metadata$filecol{N})] ]
[ , ... ] )
LOCATION = '<string>'
FORMAT = (
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = { TRUE | FALSE }
TRIM_SPACE = { TRUE | FALSE }
EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
COMPRESSION = {GZIP | ZSTD | DEFLATE | NONE | SNAPPY_BLOCK}
PARALLEL_PARSE_ON_SINGLE_FILE = {TRUE | FALSE}
PARALLEL_PARSE_FILE_SIZE_THRESHOLD = <int>
MAX_ROW_LENGTH = <int>
)
[ PATTERN = '<regex_pattern>' ]
The related parameters are described below:
col_name col_type [AS (metadata$filecol{N})]: used to define a column. Here,AS (metadata$filecol{N})specifies that the column is mapped manually.The external table supports the same types of columns as regular tables. For more information about the data types supported by OceanBase Database in Oracle mode, see Built-in data types.
By default, the data columns in an external file are automatically mapped to the columns defined in the external table in sequence, meaning that the first column of the external table is automatically mapped to the first column of the external file.
For example, in the following statement, the
C1column of theext_t1table is automatically mapped to the first column of the external file, and theC2column of theext_t1table is automatically mapped to the second column of the external file.CREATE EXTERNAL TABLE ext_t1 ( C1 int, C2 int ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );If the order of columns in the external file is inconsistent with the order of columns defined in the external table, you can specify which column in the external file corresponds to which column in the external table by using a pseudo column of the
metadata$filecol{N}type. Note that file columns are numbered starting from 1.For example, in the following statement, the
C1column of theext_t2table corresponds to the second column of the external file, and theC2column of theext_t2table corresponds to the fourth column of the external file.CREATE EXTERNAL TABLE ext_t2 ( C1 int AS (metadata$filecol2), C2 int AS (metadata$filecol4) ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );Notice
If you manually define column mapping, the automatic mapping feature is disabled. In this case, all columns must be manually mapped.
LOCATION = '<string>': used to specify the path where external files are stored. Typically, data files of an external table are stored in a dedicated directory and may contain subdirectories. When you create an external table, all files in this directory are automatically collected by the external table.The following formats are supported:
Local path format:
LOCATION = '[file://] local_file_path'local_file_path: can be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.Notice
The parameter
local_file_pathspecifies a directory, not a file. If you want to specify one specific file, you need to specify the upper directory of the file inLOCATION, and then set thePATTERNattribute to specify the file.For the local path format, in the case where you configure the
secure_file_privsystem variable to specify the path of files that OceanBase Database can access, the value of thesecure_file_privsystem variable must be the upper directory oflocal_file_path, which means thatlocal_file_pathmust be a subpath of the directory specified by thesecure_file_privvariable.The
secure_file_privsystem variable at the tenant level specifies the file paths that OceanBase Database can access during import and export operations. For more information aboutsecure_file_priv, see secure_file_priv.
Remote path format:
Notice
When you specify an object storage path, parameters are separated by the
&symbol. Only uppercase and lowercase English letters, numbers,/-_$+=, and wildcards are supported as parameter values. If you enter other characters, the configuration may fail.If the file is stored on Alibaba Cloud OSS or AWS S3, the format is
LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'. In this format,$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access parameters that are required to access Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol.s3_regionspecifies the region. These sensitive access parameters are stored in the system tables of the database in an encrypted manner.If the file is stored on HDFS, you can specify the path in one of the following formats:
If you specify the address of a single NameNode (NN) to access the HDFS cluster, you can specify the path in the
LOCATION = hdfs://localhost:port/PATHformat. Here,localhostspecifies the address of the HDFS cluster.portspecifies the port number of the HDFS cluster.PATHspecifies the file path in the HDFS cluster.If you enable Kerberos authentication, you can specify the path in the
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'format. Here:principal: specifies the username for logging in and authentication.keytab: specifies the path of the user authentication key file.krb5conf: specifies the path of the Kerberos configuration file.configs: specifies additional HDFS configurations. The default value is empty. However, in a Kerberos environment, this parameter usually has a value. You need to set the parameter to the desired value. Example:dfs.data.transfer.protection=authentication,privacy, which specifies that the data transmission protection level is set toauthenticationandprivacy.
If you enable Hadoop high availability (HA), you can specify the path in the
LOCATION = hdfs://nameserviceID/PATHformat. Here,nameserviceIDspecifies the ID of the logical naming service of Hadoop HA, andPATHspecifies the file path.Note
Ensure that the configuration of the OBServer node contains the
nameservicedefinition and failover strategy of the HA cluster.For Kerberos authentication, the format is as follows:
LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:principal: specifies the user for login authentication. Set it to thepricipalof a non-primary namenode.keytabandkrb5conf: set them the same as for a single-node namenode.configs: specifies additional HDFS configuration items. You can set only HA-related and security-related configuration items:dfs.data.transfer.protection=${string}: specifies thedfs.data.transfer.protectionparameter of the cluster.dfs.nameservices=${nameservice id}: specifies thenameservicealias of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode IDs in the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the address of thenamenode1namenode to facilitate client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the address of thenamenode2namenode to facilitate client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies whether the HA cluster automatically obtains an available namenode to respond to requests.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logic tool class for primary/standby switching in the HA cluster. You can also customize and upload the logic required by the HA cluster.
Notice
HA-related configuration items are bound to
namespace. For example, in the following sample three,myclusteris thenamespaceof the HA cluster. Make sure that the related configuration items are set correctly.
FORMAT = ( TYPE = 'CSV'...): specifies the external file format as CSV.TYPE: specifies the type of the external file.LINE_DELIMITER: specifies the line delimiter of the file.FIELD_DELIMITER: specifies the column delimiter of the file.ESCAPE: specifies the escape character of the file.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the symbol used to enclose field values in the file. For example,ESCAPE = '"'indicates that values are enclosed in double quotation marks. If not specified, the default value is empty.Notice
If the external data file contains
NULLvalues (not the string NULL, that is, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in Oracle mode, see Character sets. If not specified, the default value is UTF8MB4.NULL_IF: specifies the strings to be treated asNULL. If not specified, the default value is empty.SKIP_HEADER: specifies the number of rows to skip from the file header. If not specified, the default value is not to skip the file header.SKIP_BLANK_LINES: specifies whether to skip blank lines. If not specified, the default value isFALSE.TRIM_SPACE: specifies whether to remove leading and trailing spaces in fields of the file. If not specified, the default value isFALSE.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. If not specified, the default value isFALSE.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line if the line ends with an empty field (that is, the line delimiter is preceded by a column delimiter). The default value isTRUE, indicating that the last empty field is ignored.COMPRESSION: optional. Specifies the compression format of the file. Valid values:GZIP/DEFLATE: specifies a GZIP-compressed file.ZSTD: specifies a ZSTD-compressed file.NONE: specifies an uncompressed file (default value).SNAPPY_BLOCK: specifies an HADOOP-SNAPPY-compressed file.
PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If not specified, the default value isTRUE.PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing. The unit is bytes. The default value is 256 MB.MAX_ROW_LENGTH: specifies the maximum length of a single row of data. The unit is bytes. The default value is 2 MB. This parameter is used for the boundary determination stage of parallel parsing of a single CSV file and does not affect the accuracy of data reading.
FORMAT = ( TYPE = 'PARQUET'...): specifies the external file format as PARQUET.
Note
Starting from OceanBase Database V4.6.0, the following features are supported for Parquet/ORC file format external tables:
- Reading complex data types, including
Array,Map, andJSON. The data types of the external tables must be compatible with those supported by OceanBase Database. For more information, see Overview of built-in data types. - Reading Bloom Filters to filter data based on query predicates, thereby reducing unnecessary I/O and accelerating queries.
PATTERNspecifies the regular expression pattern to filter files in theLOCATIONdirectory. For each file in theLOCATIONdirectory, if it matches the pattern, the external table can access the file; otherwise, it skips the file. If this parameter is not specified, all files in theLOCATIONdirectory are accessible.
Assume that the /home/admin/oceanbase/ directory on the local machine contains a data.csv file with the following content.
1,"lin",98
2,"hei",90
3,"ali",95
Connect to the Oracle tenant of the cluster as the tenant administrator on the OBServer node using the local Unix socket.
Here is an example:
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@sys -p********For more information about how to connect to OceanBase Database using the local Unix socket, see secure_file_priv.
Configure the database to access the
/home/admin/oceanbase/directory.SET GLOBAL secure_file_priv = "/home/admin/oceanbase/";After executing the command, you need to restart the session for the changes to take effect.
Reconnect to the database and create an external table named
ext_t3.CREATE EXTERNAL TABLE ext_t3(ID NUMBER(32), NAME VARCHAR2(30),SCORE NUMBER(32)) LOCATION = '/home/admin/oceanbase/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'data.csv';
After the external table is created, you can use the SHOW CREATE TABLE statement to view the table definition, just like with a regular table.
SHOW CREATE TABLE ext_t3;
The query result is as follows:
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TABLE | CREATE TABLE |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXT_T3 | CREATE EXTERNAL TABLE "EXT_T3" (
"ID" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL1),
"NAME" VARCHAR2(30) GENERATED ALWAYS AS (METADATA$FILECOL2),
"SCORE" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL3)
)
LOCATION='file:///home/admin/oceanbase/'
PATTERN='data.csv'
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
FIELD_OPTIONALLY_ENCLOSED_BY = '"',
ENCODING = 'utf8mb4'
)COMPRESS FOR ARCHIVE REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
You can also access the external table like a regular table. When querying the external table, the system reads the external file directly through the external table driver, parses it according to the file format, converts it into internal data types of OceanBase Database, and returns the data rows. Here is an example of querying the newly created external table lineitem.
SELECT * FROM ext_t3;
The query result is as follows:
+----+------+-------+
| ID | NAME | SCORE |
+----+------+-------+
| 1 | lin | 98 |
| 2 | hei | 90 |
| 3 | ali | 95 |
+----+------+-------+
3 rows in set
In addition, you can perform combined queries with external tables and regular tables. Assume that the current database contains a regular table named info with the following data:
+------+--------+------+
| NAME | SEX | AGE |
+------+--------+------+
| lin | male | 8 |
| hei | male | 9 |
| li | female | 8 |
+------+--------+------+
3 rows in set
Here is an example of combining the external table ext_t3 with the regular table info.
SELECT info.* FROM info, ext_t3 WHERE info.name = ext_t3.name AND ext_t3.score > 90;
The query result is as follows:
+------+--------+------+
| NAME | SEX | AGE |
+------+--------+------+
| lin | male | 8 |
| li | female | 8 |
+------+--------+------+
2 rows in set
For more information about query operations, see Read data.
Considerations
You can query external tables only. You cannot perform DML operations on external tables.
If the external file accessed by an external table is deleted, no error is returned. Instead, an empty row is returned.
If the external storage system is unavailable, an error is returned when you query an external table.
What to do next
When you create an external table, the system saves the list of files that match the PATTERN in the specified path in the LOCATION clause to the system table of OceanBase Database. When you scan the external table, the system accesses the external files based on the list. If other files are added to the external directory, you must update the external table files to add the new files to the file list of the external table. For more information, see Manage external files.
After you create an external table, you can delete it. The statement for deleting an external table is the same as that for deleting a regular table. You can execute the DROP TABLE statement to delete an external table. For more information, see Delete a table.
