Purpose
This statement is used to create an external table in a database.
An external table is a key feature in a database management system. Generally, the data of a table in the database is stored in the storage space of the database, while an external table is used to access data stored in external storage services.
When you create an external table, you must specify the file path and file format of the data. After the external table is created, you can use it to read data from the external storage service. An external table is read-only. You can use an external table in a query statement, but you cannot perform DML operations on an external table. An external table does not support constraints or indexes.
Syntax
CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = 'file_name'
{FORMAT = (format_type_options)
| PROPERTIES = (properties_type_options)}
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = 'regex_pattern']
[AUTO_REFRESH = 'xxx'];
column_definition_list:
column_definition [, column_definition ...]
column_definition:
column_name column_type [AS expr]
format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option
type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
type_parquet_option:
TYPE = 'PARQUET'
type_orc_option:
TYPE = 'ORC'
properties_type_options:
type_odps_option
type_odps_option:
TYPE = 'ODPS'
ACCESSID = '<string>'
ACCESSKEY = '<string>'
ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>',
API_MODE = {"tunnel_api" | "storage_api"},
SPLIT = {"byte" | "row"}
Parameters
| Parameter | Description |
|---|---|
| table_name | The name of the external table to be created. |
| column_name | The column name of the external table. By default, data columns in the file are automatically mapped to columns in the external table definition in order. |
| column_type | The column type of the external table. You cannot define constraints such as DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY by using this parameter. |
| AS expr | Specifies the column mapping manually. If the order of columns in the file is different from the order of columns in the external table definition, you can use the pseudo-column metadata$filecol{N} to specify the mapping between the Nth column in the file and the corresponding column in the external table. For example, c2 INT AS (metadata$filecol4) specifies that the c2 column in the external table corresponds to the fourth column in the file. Note that if you specify manual column mappings, automatic mappings will no longer be available, and all columns must be manually mapped. |
| LOCATION = 'file_name' | The path where the external table file is stored. Usually, the data file of an external table is stored in a separate directory, which can contain subdirectories. When the external table is created, it will automatically collect all files in the directory. For more information, see file_name. |
| FORMAT = (format_type_options) | Specifies the relevant attributes of the external file format. Use TYPE to specify the export format of CSV, PARQUET, or ORC files. TYPE cannot be empty. For more information, see format_type_options. |
| PROPERTIES = (properties_type_options) | Specifies the relevant attributes of the external file format. Use TYPE to specify the export format of ODPS files. TYPE cannot be empty. For more information, see properties_type_options. |
| PATTERN | Specifies a regular pattern string for filtering files in the LOCATION directory. For each file path in the LOCATION directory, if the file path matches the pattern string, the external table will access the file; otherwise, the external table will skip the file. If this parameter is not specified, all files in the LOCATION directory are accessible by default. The external table will store a list of files that match the LOCATION specified by PATTERN in the database system table. During a scan, the external table will access external files based on this list. |
| PARTITION_TYPE = USER_SPECIFIED | If you want to manually add and remove partitions instead of having the external table automatically manage partitions, specify the PARTITION_TYPE = USER_SPECIFIED field. |
| AUTO_REFRESH = 'xxx' | Specifies whether to enable automatic refresh for the external table. Valid values:
|
file_name
The path for storing the external file can be in the following formats:
When the file is stored locally, the
LOCATIONvalue is in theLOCATION = '[file://] local_file_path'format, wherelocal_file_pathcan be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.secure_file_privspecifies the file path that the OBServer node has the privilege to access.local_file_pathcan be only a subpath ofsecure_file_priv.When the file is stored remotely, the
LOCATIONvalue is in the following format:When the file is stored in Alibaba Cloud OSS or AWS S3, the
LOCATIONvalue is in theLOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'format. Here,$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required for accessing Alibaba Cloud OSS, AWS S3, or an object storage system that is compatible with the S3 protocol.s3_regionspecifies the region information when you use S3. These sensitive access information is stored in the system table of the database in an encrypted manner.When the file is stored in HDFS, the
LOCATIONvalue is in theLOCATION = hdfs://localhost:port/PATHformat. Here,localhostspecifies the address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the directory path in HDFS.The format with Kerberos authentication is as follows:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'. Here:principal: specifies the login user.keytab: specifies the key file for user authentication.krb5conf: specifies the description file of the Kerberos environment used by the user.configs: specifies additional HDFS parameters. By default, this parameter is left empty. However, if the Kerberos environment is used, this parameter usually has a value and needs to be configured. For example, you can setdfs.data.transfer.protection=authentication,privacyto specify that the data transmission protection level isauthenticationandprivacy.
Notice
When you use an object storage path, separate the parameters of the object storage path with the
&symbol. Make sure that the parameter values contain only uppercase and lowercase English letters, numbers,/-_$+=, and wildcard characters. If you specify other characters, the settings may fail.
format_type_options
TYPE = 'CSV': specifies the format of the external file as CSV. It also includes the following fields:LINE_DELIMITER: specifies the line delimiter for the CSV file. The default value isLINE_DELIMITER='\n'.FIELD_DELIMITER: specifies the field delimiter for the CSV file. The default value isFIELD_DELIMITER='\t'.ESCAPE: specifies the escape character for the CSV file. It can be only one byte. The default value isESCAPE='\'.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the characters that enclose the field values in the CSV file. By default, this parameter is left empty.Notice
If the external data file contains
NULLvalues (non-string NULL values, not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set for the file. For the character sets supported in MySQL mode, see Character sets. If this parameter is not specified, the default value UTF8MB4 is used.NULL_IF: specifies the strings to be treated asNULLvalues. By default, this parameter is left empty.SKIP_HEADER: specifies to skip the file header and the number of lines to skip.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, which specifies not to skip blank lines.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, which specifies not to remove leading and trailing spaces from fields in the file.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULLvalues. The default value isFALSE, which specifies not to treat empty strings asNULLvalues.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the empty field at the end of a line if the line ends with an empty field (when a column separator appears before the row separator). The default value isTRUE, indicating that the last empty field is ignored.Note
For V4.3.5,
IGNORE_LAST_EMPTY_COLUMNis supported starting from V4.3.5 BP2.
TYPE = 'PARQUET': specifies the format of the external file asPARQUET.TYPE = 'ORC': specifies the format of the external file asORC.
properties_type_options
TYPE = 'ODPS': specifies the external file in theODPSformat. It also contains the following fields:ACCESSID: specifies the AccessKey ID of an Alibaba Cloud account for identity authentication.ACCESSKEY: specifies the AccessKey Secret corresponding to the AccessKey ID, for identity verification.ENDPOINT: specifies the connection address of the ODPS service.PROJECT_NAME: specifies the name of the target ODPS project.SCHEMA_NAME: optional. Specifies the schema name in ODPS.TABLE_NAME: specifies the name of the target table in ODPS.QUOTA_NAME: optional. Specifies the quota to be used.COMPRESSION_CODE: optional. Specifies the compression format of the data source. Valid values:ZLIB,ZSTD,LZ4, andODPS_LZ4. If this parameter is not specified, no compression is performed.API_MODE: specifies the API mode for accessing ODPS. Valid values:Note
For OceanBase Database V4.3.5,
API_MODEandSPLITare supported from V4.3.5 BP3.tunnel_api(default value):No special network configuration is required: applicable to all deployment scenarios, regardless of whether OceanBase Database and MaxCompute are in the same VPC (Virtual Private Cloud).
No additional MaxCompute permissions are required: only AccessID and AccessKey are needed for authentication, without the need to enable the MaxCompute Storage API.
Applicable scenarios:
- OceanBase Database and MaxCompute are not deployed in the same VPC.
- The MaxCompute Storage API is not enabled.
- Data transmission has low latency requirements.
storage_api:Network dependency: requires OceanBase Database and MaxCompute to be deployed in the same VPC to ensure low latency and high throughput data transfer.
Permission dependency: requires the Storage API to be enabled in MaxCompute, and the AccessKey must have the corresponding permissions.
Applicable scenarios:
- OceanBase Database and MaxCompute are in the same VPC.
- The MaxCompute Storage API is enabled.
- Extremely large data volumes or high real-time requirements.
SPLIT: specifies the method for task splitting when usingstorage_api, distributing tasks to various threads based onbyteorrow. If the byte sizes of rows in a table vary significantly, setSPLITtobyte. Otherwise, set it torow.
Considerations
- When an external file is deleted, the file will no longer be in the list of external files accessed by the external table. In this case, the external table will ignore the missing file.
- When an external file is modified, the external table accesses the latest data of the external file. If the external file is modified and the external table queries are performed in parallel, the results may not be as expected. Therefore, avoid modifying the external file while the external table is being queried.
- When a new file is added to the external directory, the external table only accesses the files listed in the external file list. To add the new file to the external file list, you must update the external file list.
Example
Example 1
Prepare data. First, set the
secure_file_privpath to/home/admin/and place the CSV fileextdata.csvcorresponding to the external table data in the/home/admin/testdirectory of the current local OBServer node.Here is an example of setting the global security path:
obclient> SET GLOBAL secure_file_priv = "" Query OK, 0 rows affected obclient> \q ByeNote
Since
secure_file_privis aGLOBALvariable, you must execute\qto exit and make it take effect.The content of the CSV file is as follows:
1,'Dave','Smith','dsmith@outlook.com','friend',32 2,'Xena','Johnson','xjonson@outlook.com','contact',45 3,'Fred','Jackon','fjackson@outlook.com','co-worker',19 4,'Alma','Tyler','atyler@outlook.com','friend',53Log in to the database as a user tenant and create an external table named
contacts.obclient> CREATE EXTERNAL TABLE contacts ( id INT, firstname VARCHAR(100), lastname VARCHAR(100), email VARCHAR(255), category CHAR(30), age NUMBER ) LOCATION = '/home/admin/test' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='\'' ) PATTERN = 'extdata.csv';Query the data in the
contactsexternal table.obclient> SELECT * FROM contacts; +----+-----------+----------+----------------------+-----------+------+ | id | firstname | lastname | email | category | age | +----+-----------+----------+----------------------+-----------+------+ | 1 | Dave | Smith | dsmith@outlook.com | friend | 32 | | 2 | Xena | Johnson | xjonson@outlook.com | contact | 45 | | 3 | Fred | Jackon | fjackson@outlook.com | co-worker | 19 | | 4 | Alma | Tyler | atyler@outlook.com | friend | 53 | +----+-----------+----------+----------------------+-----------+------+ 4 rows in set
Example 2
Assume that a CSV file exists at a specific path on HDFS. The content of the file is as follows:
$hdfs dfs -cat /user/test_tbl1.csv 1,'Emma','2021-09-01' 2,'William','2021-09-02' 3,'Olivia','2021-09-03'Create an HDFS external table.
If Kerberos authentication is not enabled in the HDFS environment
CREATE EXTERNAL TABLE test_tbl1_csv_mysql ( id INT, name VARCHAR(50), c_date DATE ) LOCATION = 'hdfs://${hadoop namenode hostname}:${hadoop namenode port}/user' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='\'' ) PATTERN = 'test_tbl1.csv';Notice
Here,
hadoop namenode hostnameandhadoop namenode portrefer to the host name and port of the HDFS node. You must replace the values with the actual ones.If Kerberos authentication is enabled in the HDFS environment
Environment preparation Since the current HDFS integration uses the JNI framework, the environment must support deploying a Java environment. If the OBServer is deployed on multiple nodes, each OBServer node must be configured with the following settings, and you cannot configure only one node.
Step 1: Deploy the Java environment
You can download the latest version of OpenJDK 8 from openjdk.
Step 2: Decompress the installation package
After decompressing it, you will find the following files:
$ ls $JAVA_HOME ASSEMBLY_EXCEPTION bin include jre lib LICENSE man NOTICE release sample src.zip THIRD_PARTY_READMEStep 3: Deploy the HDFS.so dynamic library
You can download the required dependency from https://mirrors.aliyun.com/oceanbase/. After downloading, run the following command to install it.
sudo rpm -Uvh devdeps-hdfs-sdk-3.3.6-xxxxx.xxx.xxx.rpmRun the following command to check if the installation was successful. The installation must include the
libhdfs.soandlibhdfs.so.0.0.0files, and the corresponding soft link must be working.$ll /usr/local/oceanbase/deps/devel/lib total 376 lrwxrwxrwx 1 root root 16 Dec 24 19:49 libhdfs.so -> libhdfs.so.0.0.0 -rwxr-xr-x 1 root root 384632 Dec 24 19:09 libhdfs.so.0.0.0Step 4: Deploy the path of the required jar package
After downloading the jar package, run the following command to decompress and install it:
sudo rpm -Uvh devdeps-java-extensions-3.3.6--xxxxx.xxx.xxx.rpmCheck if the installation was successful:
$ll /home/admin/oceanbase/jni_packages total 52756 drwxr-sr-x 4 root root 4096 Dec 24 20:25 hadoop drwxr-xr-x 3 root root 4096 Dec 24 20:25 lib -rw-r--r-- 1 root root 54008720 Dec 24 19:52 oceanbase-odps-connector-jar-with-dependencies.jarStep 4: Deploy the path of the required jar package:
Download the jar package: Access Aliyun OceanBase Mirror to download the jar package.
Notice
- For OceanBase Database V4.3.5 BP1 and earlier: Download the devdeps-java-extensions RPM installation package of version 1.0.0.
- For OceanBase Database V4.3.5 BP2 and later: Download the devdeps-java-extensions RPM installation package of version 1.0.1.
- For OceanBase Database V4.4.0: Download the devdeps-java-extensions RPM installation package of version 1.0.1.
Install the jar package: After downloading the jar package, run the following command to decompress and install it:
sudo rpm -Uvh devdeps-java-extensions-x.x.x-xxxxxxxxxxxx.xxx.xxxxxx.rpmCheck if the installation was successful:
$ll /home/admin/oceanbase/jni_packages total 52756 drwxr-sr-x 4 root root 4096 Dec 24 20:25 hadoop drwxr-xr-x 3 root root 4096 Dec 24 20:25 lib -rw-r--r-- 1 root root 54008720 Dec 24 19:52 oceanbase-odps-connector-jar-with-dependencies.jar
Start the OBServer
Before using the HDFS external table, you must configure the OBServer node. The configuration is as follows:
Note
Due to the limitations of the JNI-related configurations supported by the OBServer, you must restart the OBServer for the changes to take effect. All the following configurations are cluster-level settings. Configure them once, and there is no need to configure each node separately.
Execute the following command as the sys tenant to enable the Java environment for the OBServer:
alter system set ob_enable_java_env=true;Execute the following command to set the Java home path for the OBServer:
alter system set ob_java_home="/home/user/openjdk/jdk8u422-b05";Note
The path is from the installation path of OpenJDK.
Set the relevant configurations for the Java environment startup:
Notice
Changes to this configuration require a restart of the OBServer to take effect. Since the current HDFS integration involves directly copying the HDFS data stream to the C++ memory heap, you can appropriately reduce the settings for
-Xmx2048mand-Xms2048m.alter system set ob_java_opts="-Djdk.lang.processReaperUseDefaultStackSize=true -XX:+HeapDumpOnOutOfMemoryError -Xmx2048m -Xms2048m -Xloggc:/home/user/jvmlogs/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/user/jvmlogs/heapdumps/ -XX:+UseG1GC -XX:-CriticalJNINatives";Note
Here,
/home/user/jvmlogsand/home/user/jvmlogs/heapdumpsare the paths for generating log files. You must manually create these directories.Set the available LD_LIBRARY_PATH:
Note
The JNI SDK supported by the OBServer is an extended feature. Therefore, it is loaded dynamically, and you must configure the path of the dynamic library and set the LD_LIBRARY_PATH. This configuration must be applied to all OBServer nodes.
$ vim ~/.bashrc export LD_LIBRARY_PATH=/home/user/openjdk/jdk8u422-b05/jre/lib/amd64/server:/usr/local/oceanbase/deps/devel/lib
Create an HDFS external table
Notice
When Kerberos authentication is configured, you must deploy the keytab file and the krb5conf file on the OBServer node.
CREATE EXTERNAL TABLE partsupp ( PS_PARTKEY INTEGER , PS_SUPPKEY INTEGER , PS_AVAILQTY INTEGER , PS_SUPPLYCOST DECIMAL(15,2) , PS_COMMENT VARCHAR(199) ) LOCATION = 'hdfs://localhost:8020/tpch_csv?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'partsupp.tbl';
Query the data in the external table.
select * from test_tbl1_csv_mysql;The return result is as follows:
+----+----------+------------+ | id | name | c_date | +----+----------+------------+ | 1 | Emma | 2021-09-01 | | 2 | William | 2021-09-02 | | 3 | Olivia | 2021-09-03 | +----+----------+------------+ 3 rows in set