Use the CREATE EXTERNAL TABLE statement to create an external table. When you create an external table, you need to specify the path and format of the data files so that data can be read from the external files.
Privilege requirements
To create an external table, the current user must have the CREATE TABLE privilege. For more information about how to view the privileges of the current user, see View user privileges. If you do not have the CREATE TABLE privilege, contact your administrator to grant you the privilege. For more information about how to grant privileges to a user, see Directly grant privileges.
Create an external table
The following example shows how to create an external table:
CREATE EXTERNAL TABLE table_name
( col_name col_type [AS (metadata$filecol{N})]
[ , col_name col_type [AS (metadata$filecol{N})] ]
[ , ... ] )
LOCATION = '<string>'
FORMAT = (
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = { TRUE | FALSE }
TRIM_SPACE = { TRUE | FALSE }
EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
COMPRESSION = {GZIP | ZSTD | DEFLATE | NONE | SNAPPY_BLOCK}
PARALLEL_PARSE_ON_SINGLE_FILE = {TRUE | FALSE}
PARALLEL_PARSE_FILE_SIZE_THRESHOLD = <int>
MAX_ROW_LENGTH = <int>
)
[ PATTERN = '<regex_pattern>' ]
The following table describes the parameters in the preceding example.
col_name col_type [AS (metadata$filecol{N})]: defines a column.AS (metadata$filecol{N})specifies the mapping of the column in the external table to the Nth column in the external file.The column types supported by an external table are the same as those supported by a regular table. For more information about the data types supported by OceanBase Database in Oracle mode, see Data types.
By default, the data columns in the external file are automatically mapped to the columns in the external table in order. That is, the first column in the external file is mapped to the first column in the external table.
For example, in the following example, the
C1column in theext_t1table is automatically mapped to the first column in the external file, and theC2column is automatically mapped to the second column in the external file.CREATE EXTERNAL TABLE ext_t1 ( C1 int, C2 int ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );If the order of columns in the external file is different from the order of columns in the external table, you can use a pseudo column in the format of
metadata$filecol{N}to specify the mapping of the column in the external table to the Nth column in the external file. The columns in the external file are numbered starting from 1.For example, in the following example,
C1 int AS (metadata$filecol2)indicates that theC1column in theext_t2table is mapped to the second column in the external file.C2 int AS (metadata$filecol4)indicates that theC2column in theext_t2table is mapped to the fourth column in the external file.CREATE EXTERNAL TABLE ext_t2 ( C1 int AS (metadata$filecol2), C2 int AS (metadata$filecol4) ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );Notice
If you manually define the mapping of columns, the automatic column mapping feature is disabled. You must manually define the mapping of all columns.
LOCATION = '<string>': specifies the path where the external file is stored. Usually, the data files of an external table are stored in a separate directory, which can contain subdirectories. When you create an external table, the system automatically collects all files in the directory.The following two formats are supported:
Local location format:
LOCATION = '[file://] local_file_path'local_file_path: can be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.Notice
local_file_pathmust be a directory, not a file. If you want to specify a single file, you must specify the directory that contains the file inLOCATION, and then use thePATTERNparameter to specify the file.For the local location format, when you configure the system variable
secure_file_privto specify the file path that OceanBase Database can access, the value ofsecure_file_privmust be the directory that contains thelocal_file_pathdirectory. In other words,local_file_pathmust be a subdirectory of thesecure_file_privdirectory.The tenant-level system variable
secure_file_privspecifies the file path that OceanBase Database can access when you import or export data to or from a file. For more information aboutsecure_file_priv, see secure_file_priv.
Remote location format:
Notice
When you specify an object storage path, the parameters in the object storage path are separated by the
&symbol. Make sure that the parameter values contain only uppercase and lowercase letters, digits,/-_$+=, and wildcards. If you specify other characters, the configuration may fail.If the file is stored in Alibaba Cloud OSS or AWS S3, the format is:
LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path', where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required for accessing Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol, ands3_regionis the region selected when you use S3. These sensitive access information are stored in the system table of the database in an encrypted manner.If the file is stored in HDFS, the format is as follows:
If you access the cluster based on the address of a single-node NameNode (NN), the format is:
LOCATION = hdfs://localhost:port/PATH, wherelocalhostspecifies the address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the file path in HDFS.If you access the cluster with Kerberos authentication, the format is:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx', where:principal: specifies the user for login authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configurations. The default value is an empty string. However, in a Kerberos environment, this parameter usually has a value, which needs to be configured. For example:dfs.data.transfer.protection=authentication,privacy, which specifies the data transmission protection level asauthenticationandprivacy.
If you access the cluster based on the logical naming service of Hadoop high availability (HA), the format is:
LOCATION = hdfs://nameserviceID/PATH, wherenameserviceIDspecifies the ID of the logical naming service of Hadoop HA, andPATHspecifies the file path.Note
Make sure that the client OBServer is configured with the
nameservicedefinition and failover strategy of the HA cluster.The format with Kerberos authentication is:
LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:principal: specifies the user for Kerberos authentication. Set it to thepricipalof a non-primary namenode (NN).keytabandkrb5conf: same as in a single-node NN.configs: specifies additional HDFS configurations. You can set only HA configurations and security configurations.dfs.data.transfer.protection=${string}: specifies thedfs.data.transfer.protectionparameter of the cluster.dfs.nameservices=${nameservice id}: specifies thenameservicealias of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode IDs in the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the address of thenamenode1to facilitate client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the address of thenamenode2to facilitate client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies that the HA cluster automatically obtains a service from an available namenode after receiving a request.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logic tool class for primary/standby switching in the HA cluster. You can also customize and upload the logic required by the HA cluster.
Notice
HA configurations are bound to the
namespace. For example, in the following sample three,myclusteris thenamespace. Make sure to configure the related parameters correctly.
FORMAT = ( TYPE = 'CSV'...): specifies the external file format as CSV.TYPE: specifies the type of the external file.LINE_DELIMITER: specifies the line delimiter of the file.FIELD_DELIMITER: specifies the column delimiter of the file.ESCAPE: specifies the escape character of the file.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the symbol for enclosing field values. For example,ESCAPE = '"'indicates that the value is enclosed in double quotation marks. If you do not specify this parameter, the default value is empty.Notice
If the external file contains
NULLvalues (not the string NULL, that is, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in Oracle mode, see Character sets. If you do not specify this parameter, the default value is UTF8MB4.NULL_IF: specifies the strings to be treated asNULL. If you do not specify this parameter, the default value is empty.SKIP_HEADER: specifies the number of header rows to skip. If you do not specify this parameter, the default value is 0, indicating that the header rows are not skipped.SKIP_BLANK_LINES: specifies whether to skip blank lines. If you do not specify this parameter, the default value isFALSE.TRIM_SPACE: specifies whether to remove leading and trailing spaces in fields. If you do not specify this parameter, the default value isFALSE.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. If you do not specify this parameter, the default value isFALSE.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line. The default value isTRUE, indicating that the last empty field is ignored.COMPRESSION: optional. Specifies the compression format of the file. Valid values:GZIP/DEFLATE: specifies a GZIP-compressed file.ZSTD: specifies a ZSTD-compressed file.NONE: specifies an un-compressed file (default value).SNAPPY_BLOCK: specifies a HADOOP-SNAPPY-compressed file.Note
For V4.4.2, the compression format
SNAPPY_BLOCKis supported starting from V4.4.2 BP1.
PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If you do not specify this parameter, the default value isTRUE.PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing, in bytes. The default value is 256 MB.MAX_ROW_LENGTH: specifies the maximum length of a single row, in bytes. The default value is 2 MB. This parameter is used for the boundary determination stage of parallel parsing of a single CSV file and does not affect the correctness of data reading.
Note
For V4.4.2, the
FORMAToption supports the following parameters starting from V4.4.2 BP1:PARALLEL_PARSE_ON_SINGLE_FILE,PARALLEL_PARSE_FILE_SIZE_THRESHOLD, andMAX_ROW_LENGTH.FORMAT = ( TYPE = 'PARQUET'...): specifies the external file format as PARQUET.PATTERN: specifies a regular expression pattern to filter files in theLOCATIONdirectory. For each file in theLOCATIONdirectory, if the file can match the pattern, the external table can access the file; otherwise, it skips the file. If you do not specify this parameter, the external table can access all files in theLOCATIONdirectory by default.
Assume that the /home/admin/oceanbase/ directory on your local machine contains a data.csv file, and the content of the file is as follows.
1,"lin",98
2,"hei",90
3,"ali",95
Connect to the Oracle tenant of the cluster by using a local Unix socket as the tenant administrator on an OBServer node.
Here is an example:
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@sys -p********For more information about how to connect to OceanBase Database by using a local Unix socket, see secure_file_priv.
Configure the path
/home/admin/oceanbase/that the database can access.SET GLOBAL secure_file_priv = "/home/admin/oceanbase/";After the command is executed, you must restart the session for the modification to take effect.
Connect to the database again and create the external table
ext_t3.CREATE EXTERNAL TABLE ext_t3(ID NUMBER(32), NAME VARCHAR2(30),SCORE NUMBER(32)) LOCATION = '/home/admin/oceanbase/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'data.csv';
After the external table is created, you can execute the SHOW CREATE TABLE statement to view the table definition, just like a regular table.
SHOW CREATE TABLE ext_t3;
The query result is as follows:
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TABLE | CREATE TABLE |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXT_T3 | CREATE EXTERNAL TABLE "EXT_T3" (
"ID" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL1),
"NAME" VARCHAR2(30) GENERATED ALWAYS AS (METADATA$FILECOL2),
"SCORE" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL3)
)
LOCATION='file:///home/admin/oceanbase/'
PATTERN='data.csv'
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
FIELD_OPTIONALLY_ENCLOSED_BY = '"',
ENCODING = 'utf8mb4'
)COMPRESS FOR ARCHIVE REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
You can also access the external table like a regular table. When you query an external table, the system reads the external file directly through the driver layer of the external table, parses the file according to the file format, converts the data into internal data types of OceanBase Database, and returns the data rows. Here is an example of querying the external table lineitem that you just created.
SELECT * FROM ext_t3;
The query result is as follows:
+----+------+-------+
| ID | NAME | SCORE |
+----+------+-------+
| 1 | lin | 98 |
| 2 | hei | 90 |
| 3 | ali | 95 |
+----+------+-------+
3 rows in set
In addition, you can perform combined queries between an external table and a regular table. Assume that the current database contains a regular table named info with the following data:
+------+--------+------+
| NAME | SEX | AGE |
+------+--------+------+
| lin | male | 8 |
| hei | male | 9 |
| li | female | 8 |
+------+--------+------+
3 rows in set
Here is an example of performing a combined query between the external table ext_t3 and the regular table info.
SELECT info.* FROM info, ext_t3 WHERE info.name = ext_t3.name AND ext_t3.score > 90;
The query result is as follows:
+------+--------+------+
| NAME | SEX | AGE |
+------+--------+------+
| lin | male | 8 |
| li | female | 8 |
+------+--------+------+
2 rows in set
For more information about query operations, see Read data.
Considerations
You can query external tables but cannot perform DML operations on them.
If the external file accessed by an external table is deleted, no error is returned when you query the external table. Instead, an empty row is returned.
If the external storage system is unavailable, an error is returned when you query an external table.
What to do next
When you create an external table, OceanBase Database saves the list of files that match the PATTERN in the specified path of the LOCATION parameter to a system table. When you scan the external table, it accesses the external files based on the list. If other files are added to the external directory, you must update the external table files to add the new files to the list. For more information, see Manage external files.
After you create an external table, you can also drop it. The statement for dropping an external table is the same as that for dropping a regular table. You can execute the DROP TABLE statement to drop an external table. For more information, see Drop a table.
