You can execute the CREATE EXTERNAL TABLE statement to create an external table. When you create an external table, you must specify the path and the format of the data files to read data from the external files.
Permissions
To create an external table, the current user must have the CREATE permission. For more information about how to view the permissions of the current user, see View user permissions.
Create an external table
The SQL statement for creating an external table is as follows:
CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = 'file_name'
{FORMAT = (format_type_options)
| PROPERTIES = (properties_type_options)}
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = 'regex_pattern']
[AUTO_REFRESH = 'xxx'];
column_definition_list:
column_definition [, column_definition ...]
column_definition:
column_name column_type [AS metadata$filecol{N} | external$filepos{N}]
format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option
type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
COMPRESSION = {GZIP | ZSTD | DEFLATE | NONE | SNAPPY_BLOCK}
PARALLEL_PARSE_ON_SINGLE_FILE = {TRUE | FALSE}
PARALLEL_PARSE_FILE_SIZE_THRESHOLD = <int>
MAX_ROW_LENGTH = <int>
type_parquet_option:
TYPE = 'PARQUET'
[COLUMN_INDEX_TYPE = 'POSITION, NAME']
type_orc_option:
TYPE = 'ORC'
[COLUMN_INDEX_TYPE = 'POSITION, NAME']
properties_type_options:
type_odps_option
type_odps_option:
TYPE = 'ODPS'
ACCESSID = '<string>',
ACCESSKEY = '<string>',
ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>',
API_MODE = {"tunnel_api" | "storage_api"},
SPLIT = {"byte" | "row"}
The following table describes the related parameters.
col_name col_type [AS (metadata$filecol{N} | external$filepos{N})]: defines a column.AS (metadata$filecol{N})is used to manually define column mapping.AS (external$filepos{N})is used to index columns by position.The column types supported by external tables are the same as those supported by regular tables. For more information about the data types supported by OceanBase Database in MySQL mode, see Data types.
By default, the data columns in an external file are automatically mapped to the columns defined in the external table in order. That is, the first column of the external table corresponds to the first column of the external file.
For example, in the following example, the
C1column of the external tableext_t1is automatically mapped to the first column of the external file, and theC2column is automatically mapped to the second column of the external file.CREATE EXTERNAL TABLE ext_t1 ( C1 int, C2 int ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );When the order of columns in an external file is different from the order of columns defined in the external table, you can specify column mapping in the following two ways:
Use the
metadata$filecol{N}syntax: This syntax is applicable to CSV files. It specifies that the columns of the external table correspond to the Nth column of the external file. In this syntax, the columns of the file are numbered starting from 1.Use the
external$filepos{N}syntax: This syntax is applicable to Parquet and ORC files. It indexes columns by position. You must specify theCOLUMN_INDEX_TYPE = 'POSITION'parameter.
For example, in the following example,
C1 int AS (metadata$filecol2)specifies that theC1column of the external tableext_t2corresponds to the second column of the external file.C2 int AS (metadata$filecol4)specifies that theC2column of the external tableext_t2corresponds to the fourth column of the external file.CREATE EXTERNAL TABLE ext_t2 ( C1 int AS (metadata$filecol2), C2 int AS (metadata$filecol4) ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );For Parquet and ORC files, you can use the
external$filepos{N}syntax:CREATE EXTERNAL TABLE ext_t3 ( c1 INT AS (external$filepos1), c2 INT AS (external$filepos2), c3 VARCHAR(50) AS (external$filepos3), c4 DATE AS (external$filepos4) ) LOCATION = 'path/to/data/' FORMAT = (TYPE = 'PARQUET' COLUMN_INDEX_TYPE = 'POSITION') PATTERN = "000000_0$";Notice
If you manually define column mapping, the automatic column mapping feature will be disabled, and all columns must be manually defined.
Notice
When you use the
external$filepos{N}syntax, you must specify theCOLUMN_INDEX_TYPE = 'POSITION'parameter.LOCATION = '<string>': specifies the path where the external file is stored. Usually, the data files of an external table are stored in a separate directory, which can contain subdirectories. When you create an external table, the table automatically collects all files in the directory.The following two formats are supported:
Local location format:
LOCATION = '[file://] local_file_path'local_file_path: can be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.Notice
local_file_pathmust be a directory, not a file. If you want to specify a single file, you must specify the upper-level directory of the file inLOCATION, and specify the file by setting thePATTERNattribute.In the local location format, when you configure the
secure_file_privsystem variable to specify the file path that OceanBase Database can access,secure_file_privmust be the upper-level directory oflocal_file_path, that is,local_file_pathmust be a subdirectory ofsecure_file_priv.The
secure_file_privsystem variable specifies the path that OceanBase Database can access when importing or exporting data to or from a file. For more information aboutsecure_file_priv, see secure_file_priv.
Remote location format:
Notice
When you use an object storage path, the parameters of the object storage path are separated by the
&symbol. Make sure that the parameter values you enter contain only uppercase and lowercase letters, numbers,/-_$+=, and wildcards. If you enter other characters, the settings may fail.When the file is stored in Alibaba Cloud OSS or AWS S3 or an object storage service compatible with the S3 protocol, the format is:
LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path', where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required to access Alibaba Cloud OSS, AWS S3, or an object storage service compatible with the S3 protocol.s3_regionis the region information selected when you use S3. These sensitive access information are stored in the system table of the database in an encrypted manner.When the file is stored in HDFS, the following formats are supported:
Single-node NameNode (NN) address-based access format:
LOCATION = hdfs://localhost:port/PATH, wherelocalhostspecifies the address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the file path in HDFS.Kerberos authentication-based format:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx', where:principal: specifies the user for login authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configurations. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value, which needs to be configured. For example,dfs.data.transfer.protection=authentication,privacyspecifies the data transmission protection level asauthenticationandprivacy.
For a Hadoop HA cluster, the format is
LOCATION = hdfs://nameserviceID/PATH, wherenameserviceIDspecifies the logical name service ID of the Hadoop HA cluster, andPATHspecifies the file path.Note
Make sure that the client OBServer node is configured with the
nameservicedefinition and failover strategy of the HA cluster.For a Kerberos-authenticated cluster, the format is
LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:principal: specifies the login user. Set it to thepricipalof a non-NameNode node.keytabandkrb5conf: the same as for a single-node NameNode.configs: specifies additional HDFS configurations. You can set only HA and security configurations.dfs.data.transfer.protection=${string}: specifies thedfs.data.transfer.protectionparameter of the cluster.dfs.nameservices=${nameservice id}: specifies thenameservice(alias) of the HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the IDs of the NameNodes in the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the address of thenamenode1to facilitate client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the address of thenamenode2to facilitate client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies that the HA cluster automatically obtains a valid NameNode to respond to requests.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logic class for switching the primary and standby nodes in the HA cluster. You can also customize and upload the logic required by the HA cluster.
Notice
Some HA configurations are bound to the
namespace. For example, in the following sample, themyclusteris bound to the HA configurations. Make sure that the HA configurations are set properly.
FORMAT = ( TYPE = 'CSV'...): specifies the external file format as CSV.TYPE: specifies the type of the external file.LINE_DELIMITER: specifies the line delimiter. The default value isLINE_DELIMITER='\n'if you do not specify it.FIELD_DELIMITER: specifies the column delimiter. The default value isFIELD_DELIMITER='\t'if you do not specify it.ESCAPE: specifies the escape character. For example,ESCAPE ='*'specifies the asterisk (*) as the escape character to replace the default escape character (). The default value isESCAPE ='\'if you do not specify it.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character that encloses field values. For example,ESCAPE = '"'specifies that values are enclosed in double quotation marks. The default value is empty if you do not specify it.Notice
If the external table data file contains
NULLvalues (not the string NULL, that is, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in MySQL mode, see Character sets. The default value is UTF8MB4 if you do not specify it.NULL_IF: specifies which strings are treated asNULL. The default value is empty if you do not specify it.SKIP_HEADER: specifies the number of lines to skip from the header. The default value is not to skip the header.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSEif you do not specify it.TRIM_SPACE: specifies whether to remove leading and trailing spaces in fields. The default value isFALSEif you do not specify it.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSEif you do not specify it.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line. The default value isTRUE, which means that the last empty field is ignored.COMPRESSION: optional. Specifies the compression format of the file. Valid values:GZIP/DEFLATE: specifies a GZIP-compressed file.ZSTD: specifies a ZSTD-compressed file.NONE: specifies that the file is not compressed. This is the default value.SNAPPY_BLOCK: specifies a HADOOP-SNAPPY-compressed file.
PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. The default value isTRUEif this parameter is not specified.PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing. The unit is bytes. The default value is 256 MB.MAX_ROW_LENGTH: specifies the maximum length of a single row of data. The unit is bytes. The default value is 2 MB. This parameter is used for the boundary determination stage of parallel parsing of a single CSV file and does not affect the correctness of data reading.
FORMAT = ( TYPE = 'PARQUET'...): specifies the external file format as PARQUET.TYPE: specifies the external file type asPARQUET.COLUMN_INDEX_TYPE: specifies the column index type. The supported values are'POSITION'and'NAME', which indicate to index columns by position or by name, respectively. This parameter is required when you use theexternal$filepos{N}syntax.
FORMAT = ( TYPE = 'ORC'...): specifies the external file format as ORC.TYPE: specifies the external file type asORC.COLUMN_INDEX_TYPE: specifies the column index type. The supported value is'POSITION', which indicates to index columns by position. This parameter is required when you use theexternal$filepos{N}syntax.
Note
Starting from OceanBase Database V4.6.0, the following features are supported for Parquet and ORC file format external tables:
- Reading complex data types such as
Array,Map, andJSON. The data types of the external table must be consistent with those supported by OceanBase Database. For more information, see Data types. - Reading Bloom Filter, which filters data based on the query predicate to reduce unnecessary I/O and accelerate queries.
TYPE = 'ODPS': specifies the external file format asODPS. The following parameters are also supported:ACCESSID: specifies the AccessKey ID of the Alibaba Cloud account for identity authentication.ACCESSKEY: specifies the AccessKey secret corresponding to the AccessKey ID for identity authentication.ENDPOINT: specifies the endpoint of the ODPS service.PROJECT_NAME: specifies the name of the ODPS project to be accessed.SCHEMA_NAME: specifies the name of the schema in ODPS. This parameter is optional.TABLE_NAME: specifies the name of the target table in ODPS.QUOTA_NAME: specifies the name of the quota to be used. This parameter is optional.COMPRESSION_CODE: specifies the compression format of the data source. The supported values areZLIB,ZSTD,LZ4, andODPS_LZ4. If this parameter is not specified, compression is not enabled.API_MODE: specifies the API mode for calling ODPS. Valid values:tunnel_api(default):No special network configuration is required: this option is suitable for all deployment scenarios, where OceanBase Database and MaxCompute do not need to be deployed in the same VPC (Virtual Private Cloud).
No additional MaxCompute permissions are required: only the AccessID and AccessKey are required for authentication, and no MaxCompute Storage API permissions are required.
Applicable scenarios:
- OceanBase Database and MaxCompute are not deployed in the same VPC.
- MaxCompute Storage API is not enabled.
- Data transmission has low latency requirements.
storage_api:Network dependency: OceanBase Database and MaxCompute must be deployed in the same VPC to achieve low-latency, high-throughput data transmission.
Permission dependency: you must enable the Storage API in MaxCompute and ensure that the AccessKey has the corresponding permissions.
Applicable scenarios:
- OceanBase Database and MaxCompute are deployed in the same VPC.
- MaxCompute Storage API is enabled.
- The data volume is extremely large or the real-time requirement is high.
SPLIT: specifies the unit for task splitting whenstorage_apiis used. Valid values:byteandrow. If the data size of each row in a table varies greatly, set this parameter tobyte. Otherwise, set it torow.
PATTERN: specifies the regular expression pattern to filter files in theLOCATIONdirectory. For each file in theLOCATIONdirectory, if it matches the pattern, the external table can access the file. Otherwise, it skips the file. If this parameter is not specified, the external table can access all files in theLOCATIONdirectory by default.
Assume that a data.csv file is stored in the /home/admin/oceanbase/ directory on the local machine. The content of the file is as follows.
1,"lin",98
2,"hei",90
3,"ali",95
Example of an external table in the Parquet format
Connect to the MySQL tenant of the cluster by using the local Unix socket on the OBServer node.
Here is an example:
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@sys -p********For more information about how to connect to OceanBase Database by using the local Unix socket, see secure_file_priv.
Configure the database to access the
/home/admin/oceanbase/directory.SET GLOBAL secure_file_priv = "/home/admin/oceanbase/";After the command is executed, you must restart the session for the modification to take effect.
Connect to the database again and create an external table named
ext_t3.CREATE EXTERNAL TABLE ext_t3(id int, name char(10),score int) LOCATION = '/home/admin/oceanbase/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'data.csv';
After the external table is created, you can use the SHOW CREATE TABLE statement to view the table definition, just like a regular table.
SHOW CREATE TABLE ext_t3;
The query result is as follows:
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ext_t3 | CREATE EXTERNAL TABLE `ext_t3` (
`id` int(11) GENERATED ALWAYS AS (metadata$filecol1),
`name` char(10) GENERATED ALWAYS AS (metadata$filecol2),
`score` int(11) GENERATED ALWAYS AS (metadata$filecol3)
)
LOCATION='file:///home/admin/oceanbase/'
PATTERN='data.csv'
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
FIELD_OPTIONALLY_ENCLOSED_BY = '"',
ENCODING = 'utf8mb4'
)DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
You can also access the external table like a regular table. When you query an external table, the system reads the external file through the driver layer of the external table, parses the file in the specified format, converts the data to internal data types of OceanBase Database, and returns the data row. Here is an example of querying the external table lineitem.
SELECT * FROM ext_t3;
The query result is as follows:
+----+------+-------+
| id | name | score |
+----+------+-------+
| 1 | lin | 98 |
| 2 | hei | 90 |
| 3 | ali | 95 |
+----+------+-------+
3 rows in set
In addition, you can perform combination queries between an external table and a regular table. Assume that the current database contains a regular table named info, and the data in the table is as follows:
+------+--------+------+
| name | sex | age |
+------+--------+------+
| lin | male | 8 |
| hei | male | 9 |
| li | female | 8 |
+------+--------+------+
3 rows in set
Here is an example of combining the external table ext_t3 with the regular table info.
SELECT info.* FROM info, ext_t3 WHERE info.name = ext_t3.name AND ext_t3.score > 90;
The query result is as follows:
+------+--------+------+
| name | sex | age |
+------+--------+------+
| lin | male | 8 |
| li | female | 8 |
+------+--------+------+
2 rows in set
For more information about how to query data, see Read data.
Create an ODPS Java SDK external table
To create an ODPS Java SDK external table, OceanBase Database must be deployed in an environment that supports the Java SDK. For more information about how to deploy the Java SDK environment, see Deploy the OceanBase Database Java SDK environment.
Here is an example of creating an ODPS Java SDK external table:
CREATE EXTERNAL TABLE lineitem(
l_orderkey BIGINT,
l_partkey BIGINT,
l_suppkey BIGINT,
l_linenumber BIGINT,
l_quantity DECIMAL(15,2),
l_extendedprice DECIMAL(15,2),
l_discount DECIMAL(15,2),
l_tax DECIMAL(15,2),
l_returnflag CHAR(1),
l_linestatus CHAR(1),
l_shipdate DATE,
l_commitdate DATE,
l_receiptdate DATE,
l_shipinstruct CHAR(25),
l_shipmode CHAR(10),
l_comment VARCHAR(44))
PROPERTIES = (
TYPE = 'ODPS'
ACCESSID = '***********'
ACCESSKEY = '***********'
ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
PROJECT_NAME = 'bigdata_public_dataset',
SCHEMA_NAME = 'tpch_10g',
TABLE_NAME = 'lineitem',
QUOTA_NAME = '',
COMPRESSION_CODE = '',
API_MODE = "storage_api",
SPLIT = "byte"
);
Considerations for using external tables
External tables can only be queried and cannot be used for DML operations.
If the external file accessed by the external table is deleted, the system does not return an error but instead returns an empty row.
Since the files accessed by the external table are managed by an external storage system, if the external storage is unavailable, querying the external table will result in an error.
What to do next
When you create an external table, the system saves the list of files that match the PATTERN in the specified path of the LOCATION parameter in the system table of OceanBase Database. When you scan the external table, the system accesses the external files based on the list. If new files are added to the external directory, you must update the external table file list to add the new files. For more information, see Manage external files.
After you create an external table, you can also drop it. The statement for dropping an external table is the same as that for dropping a regular table. You can use the DROP TABLE statement to drop an external table. For more information, see Drop a table.
