You can execute the CREATE EXTERNAL TABLE statement to create an external table. When you create an external table, you must specify the path and format of the data files so that data can be read from the external files.
Privilege requirements
To create an external table, the current user must have the CREATE privilege. For information about how to view the privileges of the current user, see View user privileges.
Create an external table
The following example shows the SQL statement for creating an external table:
CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = 'file_name'
{FORMAT = (format_type_options)
| PROPERTIES = (properties_type_options)}
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = 'regex_pattern']
[AUTO_REFRESH = 'xxx'];
column_definition_list:
column_definition [, column_definition ...]
column_definition:
column_name column_type [AS (metadata$filecol{N})]
format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option
type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
type_parquet_option:
TYPE = 'PARQUET'
type_orc_option:
TYPE = 'ORC'
properties_type_options:
type_odps_option
type_odps_option:
TYPE = 'ODPS'
ACCESSID = '<string>'
ACCESSKEY = '<string>'
ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>',
API_MODE = {"tunnel_api" | "storage_api"},
SPLIT = {"byte" | "row"}
The following table describes the parameters in the preceding statement.
col_name col_type [AS (metadata$filecol{N})]: specifies the column definition.AS (metadata$filecol{N})specifies the column mapping manually.The column types supported by an external table are the same as those supported by a regular table. For more information about the data types supported by OceanBase Database in MySQL mode, see Data types.
By default, the data columns in the external file are automatically mapped to the columns defined in the external table in sequence. That is, the first column of the external table is automatically mapped to the first column of the external file.
For example, in the following example, the
C1column of theext_t1external table is automatically mapped to the first column of the external file, and theC2column is automatically mapped to the second column of the external file.CREATE EXTERNAL TABLE ext_t1 ( C1 int, C2 int ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );If the column order in the external file is different from that in the external table, you can use a pseudo column in the format of
metadata$filecol{N}to specify that the column in the external table is mapped to the Nth column in the external file. In this case, the columns in the file are numbered starting from 1.For example, in the following example,
C1 int AS (metadata$filecol2)indicates that theC1column of theext_t2external table is mapped to the second column in the external file, andC2 int AS (metadata$filecol4)indicates that theC2column of theext_t2external table is mapped to the fourth column in the external file.CREATE EXTERNAL TABLE ext_t2 ( C1 int AS (metadata$filecol2), C2 int AS (metadata$filecol4) ) LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' );Notice
If you manually define column mappings, the automatic column mapping feature will be disabled, and all columns must be manually defined.
LOCATION = 'file_name': specifies the path where the external file is stored. Usually, the data files of an external table are stored in a separate directory, which can contain subdirectories. When you create an external table, the table automatically collects all files in the directory.The following two formats are supported:
Local Location format:
LOCATION = '[file://] local_file_path'local_file_path: can be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.Notice
local_file_pathmust be a directory, not a file. If you want to specify a single file, you need to specify the parent directory of the file inLOCATIONand use thePATTERNattribute to specify the file.For scenarios using the Local Location format, when you configure the system variable
secure_file_privto specify the file path that OceanBase Database can access,secure_file_privmust be the parent directory oflocal_file_path, that is,local_file_pathmust be a subdirectory ofsecure_file_priv.The system variable
secure_file_privspecifies the path that OceanBase Database can access when importing or exporting data to or from a file. For more information aboutsecure_file_priv, see secure_file_priv.
Remote Location format:
Notice
When you use an object storage path, the parameters of the object storage path are separated by the
&symbol. Make sure that your input parameter values contain only uppercase and lowercase letters, numbers,/-_$+=, and wildcards. If you enter other characters, the configuration may fail.If the file is stored in OSS or S3, the format is:
LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'. In this format,$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required to access Alibaba Cloud OSS, AWS S3, or an object storage service compatible with the S3 protocol.s3_regionis the region information selected when using S3. These sensitive access information are stored in the system table of the database in an encrypted manner.If the file is stored in HDFS, the following formats are supported:
Single-node NameNode (NN) access format:
LOCATION = hdfs://localhost:port/PATH, wherelocalhostspecifies the address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the file path in HDFS.Kerberos authentication format:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx', where:principal: specifies the user for login and authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configurations. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value and needs to be configured. For example:dfs.data.transfer.protection=authentication,privacy, which specifies the data transmission protection level asauthenticationandprivacy.
Hadoop High Availability (HA) logical naming service access format:
LOCATION = hdfs://nameserviceID/PATH, wherenameserviceIDspecifies the ID of the Hadoop HA logical naming service, andPATHspecifies the file path.Note
Make sure that the client OBServer is configured with the
nameservicedefinition and failover strategy of the HA cluster.For Kerberos authentication, the format is as follows:
LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:principal: specifies the login user. Set it to thepricipalof the non-primary namenode (NN).keytabandkrb5conf: set them in the same way as for a single-node NN.configs: specifies additional HDFS configurations. You can set only HA configurations and security configurations.dfs.data.transfer.protection=${string}: specifies thedfs.data.transfer.protectionparameter of the cluster.dfs.nameservices=${nameservice id}: specifies thenamesevice(alias) of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode IDs in the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the address of thenamenode1to facilitate client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the address of thenamenode2to facilitate client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies that the HA cluster automatically obtains a valid namenode to respond to requests.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logic class for switching the primary and standby nodes in the HA cluster. You can also customize and upload the logic required by the HA cluster.
Notice
HA configurations are bound to
namespace. For example, in the following sample,myclusteris thenamespace. Make sure to configure the related parameters properly.
TYPE = 'CSV': specifies the external file format as CSV.TYPE: specifies the type of the external file.LINE_DELIMITER: specifies the line delimiter of the file. The default value isLINE_DELIMITER='\n'if this parameter is not specified.FIELD_DELIMITER: specifies the column delimiter of the file. The default value isFIELD_DELIMITER='\t'if this parameter is not specified.ESCAPE: specifies the escape character of the file. For example,ESCAPE ='*'specifies the star (*) as the escape character, replacing the default escape character (). The default value isESCAPE ='\'if this parameter is not specified.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character used to enclose field values in the file. For example,ESCAPE = '"'specifies that values are enclosed in double quotes. The default value is an empty string if this parameter is not specified.Notice
If the external file contains
NULLvalues (non-string NULL, that is, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in MySQL mode, see Character sets. The default value is UTF8MB4 if this parameter is not specified.NULL_IF: specifies the strings that are treated asNULL. The default value is an empty string if this parameter is not specified.SKIP_HEADER: specifies the number of header rows to skip. The default value is not to skip the header rows if this parameter is not specified.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSEif this parameter is not specified.TRIM_SPACE: specifies whether to remove leading and trailing spaces in fields. The default value isFALSEif this parameter is not specified.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSEif this parameter is not specified.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line. The default value isTRUE, which indicates to ignore the last empty field.Note
For V4.3.5, the
IGNORE_LAST_EMPTY_COLUMNparameter is supported starting from V4.3.5 BP2.
TYPE = 'PARQUET': specifies the external file format as PARQUET.TYPE = 'ORC': specifies the external file format asORC.TYPE = 'ODPS': specifies the external file format asODPS. The following fields are also supported:ACCESSID: specifies the AccessKey ID of the Alibaba Cloud account for identity authentication.ACCESSKEY: specifies the AccessKey secret corresponding to the AccessKey ID for identity verification.ENDPOINT: specifies the endpoint of the ODPS service.PROJECT_NAME: specifies the name of the target ODPS project.SCHEMA_NAME: optional. Specifies the name of the schema in ODPS.TABLE_NAME: specifies the name of the target table in ODPS.QUOTA_NAME: optional. Specifies the quota to be used.COMPRESSION_CODE: optional. Specifies the compression format of the data source. Valid values:ZLIB,ZSTD,LZ4, andODPS_LZ4. If you do not specify this parameter, compression is not enabled.API_MODE: specifies the API mode for calling the ODPS API. Valid values:Note
For OceanBase Database V4.3.5, the
API_MODEandSPLITparameters are supported starting from V4.3.5 BP3.tunnel_api(default):No special network configuration is required: this option is applicable to all deployment scenarios, and OceanBase Database does not need to be deployed in the same VPC as MaxCompute.
No additional MaxCompute permissions are required: only the AccessID and AccessKey are needed for authentication, and the MaxCompute Storage API permission does not need to be enabled.
Applicable scenarios:
- OceanBase Database is not deployed in the same VPC as MaxCompute.
- The MaxCompute Storage API permission is not enabled.
- Data transmission has low latency requirements.
storage_api:Network dependency: OceanBase Database and MaxCompute must be deployed in the same VPC to achieve low-latency, high-throughput data transmission.
Permission dependency: The Storage API permission must be enabled in MaxCompute, and the AccessKey must have the corresponding permissions.
Applicable scenarios:
- OceanBase Database and MaxCompute are deployed in the same VPC.
- The MaxCompute Storage API permission is enabled.
- The data volume is extremely large or the real-time requirement is high.
SPLIT: specifies whether to split tasks bybyteorrowwhen using thestorage_apimode. If the data size of each row in a table varies significantly, setSPLITtobyte. Otherwise, set it torow.
PATTERN: specifies the regular expression pattern to filter files in theLOCATIONdirectory. For each file in theLOCATIONdirectory, if it matches the pattern, the external table can access the file; otherwise, it skips the file. If you do not specify this parameter, all files in theLOCATIONdirectory are accessible.
Assume that the data.csv file is stored in the /home/admin/oceanbase/ directory on the local machine. The content of the file is as follows.
1,"lin",98
2,"hei",90
3,"ali",95
Connect to the MySQL tenant of the cluster by using a local Unix socket on the OBServer node as the tenant administrator.
The following example shows how to connect to the MySQL tenant:
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@sys -p********For more information about how to connect to OceanBase Database through a local Unix socket, see secure_file_priv.
Configure the
/home/admin/oceanbase/directory as the accessible path for the database.SET GLOBAL secure_file_priv = "/home/admin/oceanbase/";After the command is executed successfully, you must restart the session for the modification to take effect.
Connect to the database again and create the
ext_t3external table.CREATE EXTERNAL TABLE ext_t3(id int, name char(10),score int) LOCATION = '/home/admin/oceanbase/' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'data.csv';
After the external table is created, you can use the SHOW CREATE TABLE statement to view the table definition, just like with a regular table.
SHOW CREATE TABLE ext_t3;
The query result is as follows:
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ext_t3 | CREATE EXTERNAL TABLE `ext_t3` (
`id` int(11) GENERATED ALWAYS AS (metadata$filecol1),
`name` char(10) GENERATED ALWAYS AS (metadata$filecol2),
`score` int(11) GENERATED ALWAYS AS (metadata$filecol3)
)
LOCATION='file:///home/admin/oceanbase/'
PATTERN='data.csv'
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
FIELD_OPTIONALLY_ENCLOSED_BY = '"',
ENCODING = 'utf8mb4'
)DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
You can also access the external table like a regular table. When querying an external table, the system directly reads the external file through the external table driver, parses the file according to its format, converts the data into internal data types of OceanBase Database, and returns the data rows. The following example shows how to query the lineitem external table.
SELECT * FROM ext_t3;
The query result is as follows:
+----+------+-------+
| id | name | score |
+----+------+-------+
| 1 | lin | 98 |
| 2 | hei | 90 |
| 3 | ali | 95 |
+----+------+-------+
3 rows in set
In addition, you can perform combined queries between an external table and a regular table. Assume that there is a regular table named info in the current database, and the table contains the following data:
+------+--------+------+
| name | sex | age |
+------+--------+------+
| lin | male | 8 |
| hei | male | 9 |
| li | female | 8 |
+------+--------+------+
3 rows in set
The following example shows how to perform a combined query between the ext_t3 external table and the info regular table.
SELECT info.* FROM info, ext_t3 WHERE info.name = ext_t3.name AND ext_t3.score > 90;
The query result is as follows:
+------+--------+------+
| name | sex | age |
+------+--------+------+
| lin | male | 8 |
| li | female | 8 |
+------+--------+------+
2 rows in set
For more information about how to query data, see Read data.
Create an ODPS Java SDK external table
To create an ODPS Java SDK external table, you must use OceanBase Database in an environment where the Java SDK is installed. For more information about how to deploy the Java SDK environment, see Deploy the Java SDK environment.
Here is an example of creating an ODPS Java SDK external table:
CREATE EXTERNAL TABLE lineitem(
l_orderkey BIGINT,
l_partkey BIGINT,
l_suppkey BIGINT,
l_linenumber BIGINT,
l_quantity DECIMAL(15,2),
l_extendedprice DECIMAL(15,2),
l_discount DECIMAL(15,2),
l_tax DECIMAL(15,2),
l_returnflag CHAR(1),
l_linestatus CHAR(1),
l_shipdate DATE,
l_commitdate DATE,
l_receiptdate DATE,
l_shipinstruct CHAR(25),
l_shipmode CHAR(10),
l_comment VARCHAR(44))
PROPERTIES = (
TYPE = 'ODPS'
ACCESSID = '***********'
ACCESSKEY = '***********'
ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
PROJECT_NAME = 'bigdata_public_dataset',
SCHEMA_NAME = 'tpch_10g',
TABLE_NAME = 'lineitem',
QUOTA_NAME = '',
COMPRESSION_CODE = '',
API_MODE = "storage_api",
SPLIT = "byte"
);
Considerations
You can only query an external table. You cannot perform DML operations on it.
When you query an external table, if the external file that the external table accesses has been deleted, the system does not return an error. Instead, it returns an empty row.
If the external storage system that manages the external files is unavailable, an error will be returned when you query the external table.
What to do next
When you create an external table, the system saves the list of files that match the PATTERN in the specified path under the LOCATION parameter in the system table of OceanBase Database. When the external table is scanned, it accesses the external files based on this list. If new files are added to the external directory, you need to update the external table files to add these new files to the external table file list. For more information, see Manage external files.
After you create an external table, you can also delete it. The statement for deleting an external table is the same as that for deleting a regular table. You can use the DROP TABLE statement to delete an external table. For more information, see Delete a table.
