Purpose
This statement is used to create a new external table in the database.
External tables are a key feature in database management systems. Typically, tables in a database are stored in the database's storage space, while external tables store data in an external storage service.
When creating an external table, you need to define the file path and file format for the data. Once created, users can query the external table to read data from the external storage service. External tables are read-only and can be used in query statements, but they do not support DML operations. External tables also do not support defining constraints or creating indexes.
Privilege requirements
To execute the
CREATE EXTERNAL TABLEstatement, the current user must have theCREATEprivilege. For more information about OceanBase Database privileges, see Privilege types in MySQL-compatible mode.When you create an external table by referencing a Location object, you must have the
READprivilege on the object.Here is an example:
Grant the
user001user theREADprivilege on thetest_location_localobject.GRANT READ ON LOCATION test_location_local TO user001 WITH GRANT OPTION;
Syntax
CREATE EXTERNAL TABLE table_name (column_definition_list)
LOCATION = {'file_path' | @location_name['/path']}
{FORMAT = (format_type_options)
| PROPERTIES = (properties_type_options)}
[PARTITION BY (column_name [, column_name ...])]
[PARTITION_TYPE = USER_SPECIFIED]
[PATTERN = 'regex_pattern']
[AUTO_REFRESH = 'xxx'];
column_definition_list:
column_definition [, column_definition ...]
column_definition:
column_name column_type [AS metadata$filecol{N}]
format_type_options:
type_csv_option
| type_parquet_option
| type_orc_option
type_csv_option:
TYPE = 'CSV'
LINE_DELIMITER = '<string>' | <expr>
FIELD_DELIMITER = '<string>' | <expr>
ESCAPE = '<character>' | <expr>
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
ENCODING = 'charset'
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
SKIP_HEADER = <int>
SKIP_BLANK_LINES = {TRUE | FALSE}
TRIM_SPACE = {TRUE | FALSE}
EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
type_parquet_option:
TYPE = 'PARQUET'
type_orc_option:
TYPE = 'ORC'
properties_type_options:
type_odps_option
type_odps_option:
TYPE = 'ODPS'
ACCESSID = '<string>'
ACCESSKEY = '<string>'
ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>',
API_MODE = {"tunnel_api" | "storage_api"},
SPLIT = {"byte" | "row"}
Parameters
| Parameter | Description |
|---|---|
| table_name | The name of the external table to be created. |
| column_name | The name of the column of the external table. By default, the data columns in the file are automatically mapped to the columns defined in the external table in sequence. |
| column_type | Defines the data type of the external table's columns, but cannot define constraints such as DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY. |
| AS metadata$filecol{N} | Manually specifies the column mapping. When the order of columns in the file does not match the order of columns defined in the external table, you can use the pseudo column metadata$filecol{N} to specify the mapping between the external table's columns and the Nth column in the file. For example, c2 INT AS (metadata$filecol4) indicates that the c2 column of the external table corresponds to the 4th column in the file. Note that if manual column mapping is specified, automatic mapping will be disabled, and all columns must be manually defined. |
| LOCATION | Specifies the path where the external table files are stored. Typically, the data files of the external table are stored in a separate directory, which may contain subdirectories. When creating the table, the external table automatically collects all files in this directory. Valid values:
|
| FORMAT = (format_type_options) | Specifies the properties of the external file format. Use TYPE to specify the export format as CSV, PARQUET, or ORC. The TYPE parameter cannot be empty. For more information, see format_type_options. |
| PROPERTIES = (properties_type_options) | Specifies the properties of the external file format. Use TYPE to specify the export format as ODPS. The TYPE parameter cannot be empty. For more information, see properties_type_options. |
| PATTERN | Specifies a regular expression pattern to filter files in the LOCATION directory. For each file path in the LOCATION directory, if it matches the pattern, the external table will access the file; otherwise, it will skip the file. If this parameter is not specified, the external table will access all files in the LOCATION directory by default. The external table will save the list of files that match the PATTERN in the database system table when the LOCATION path is specified. During external table scanning, it will access the external files based on this list. |
| PARTITION_TYPE = USER_SPECIFIED | Specifies the PARTITION_TYPE = USER_SPECIFIED parameter if you want to manually add and delete partitions instead of letting the external table automatically manage partitions. |
| AUTO_REFRESH = 'xxx' | Specifies whether to enable automatic refresh for the external table. Valid values:
|
file_path
The file path of an external table is specified in the following formats:
If the file is stored locally, the
LOCATIONclause is specified in theLOCATION = '[file://] local_file_path'format. In this case,local_file_pathcan be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database. Thesecure_file_privparameter specifies the file path that the OBServer node can access.local_file_pathmust be a subpath of the path specified bysecure_file_priv.If the file is stored remotely, the
LOCATIONclause is specified in the following formats:Notice
When you use an object storage path, the parameters of the object storage path are separated by the
&symbol. Make sure that the parameter values you enter contain only uppercase and lowercase letters, digits,/-_$+=, and wildcards. If you enter other characters, the settings may fail.If the file is stored on Alibaba Cloud OSS or AWS S3, the
LOCATIONclause is specified in theLOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'format. In this case,$ACCESS_ID,$ACCESS_KEY, and$HOSTspecify the access information required to access Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol.s3_regionspecifies the region information when you use S3. These sensitive access parameters are stored in the system tables of the database in an encrypted manner.If the file is stored on Alibaba Cloud OSS or AWS S3, the
LOCATIONclause is specified in theLOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'format. In this case,$ACCESS_ID,$ACCESS_KEY, and$HOSTspecify the access information required to access Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol.s3_regionspecifies the region information when you use S3. These sensitive access parameters are stored in the system tables of the database in an encrypted manner.If the file is stored on HDFS, the
LOCATIONclause is specified in the following formats:If you access the cluster by using the address of a single-node NameNode (NN), the
LOCATIONclause is specified in theLOCATION = hdfs://localhost:port/PATHformat. In this case,localhostspecifies the address of HDFS,portspecifies the port number of HDFS, andPATHspecifies the file path in HDFS.If you access the cluster by using Kerberos authentication, the
LOCATIONclause is specified in theLOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'format. In this case:principal: specifies the user for login authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configuration items. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value and needs to be configured. For example,dfs.data.transfer.protection=authentication,privacyspecifies the data transfer protection level asauthenticationandprivacy.
If you access the cluster by using Hadoop high availability (HA), the
LOCATIONclause is specified in theLOCATION = hdfs://nameserviceID/PATHformat. In this case,nameserviceIDspecifies the ID of the logical naming service of the Hadoop HA cluster, andPATHspecifies the file path.Note
Make sure that the client configuration on the OBServer node contains the
nameservicedefinition and failover strategy of the HA cluster.If you access the cluster by using Kerberos authentication, the
LOCATIONclause is specified in theLOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'format. In this case:principal: specifies the user for login authentication. In this case, theprincipalof a non-primary NN must be specified.keytabandkrb5conf: are the same as in the single-node NN case.configs: specifies additional HDFS configuration items. If multiple configuration items are specified, they are related to HA and security configurations:dfs.data.transfer.protection=${string}: specifies thedfs.data.transfer.protectionparameter of the cluster.dfs.nameservices=${nameservice id}: specifies thenameservice(alias) of the current HA cluster.dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of standby IDs of the namenodes in the HA cluster.dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the specific namenode ofnamenode1for client routing.dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the specific namenode ofnamenode2for client routing.dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies that the HA cluster automatically obtains a valid namenode to respond to requests after receiving them.dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logic class for switching between primary and standby nodes in the HA cluster. You can also customize and upload the logic required by the HA cluster.
Notice
Some HA-related configuration items are bound to the
namespace. For example, themyclusterin the following example. Make sure that the related configuration items are set correctly.
format_type_options
TYPE = 'CSV': specifies the external file format as CSV. It also includes the following fields:LINE_DELIMITER: specifies the line delimiter for the CSV file. The default value isLINE_DELIMITER='\n'.FIELD_DELIMITER: specifies the column delimiter for the CSV file. The default value isFIELD_DELIMITER='\t'.ESCAPE: specifies the escape character for the CSV file. It must be a single byte. The default value isESCAPE ='\'.FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character used to enclose field values in the CSV file. The default value is empty.Notice
When the external table data file contains
NULLvalues (non-string NULL, i.e., not "NULL"), you must explicitly configure theFIELD_OPTIONALLY_ENCLOSED_BYparameter, and the value of this parameter cannot be empty.ENCODING: specifies the character set encoding format of the file. For information about the character sets supported in MySQL-compatible mode, see Character set. If this parameter is not specified, the default value is UTF8MB4.NULL_IF: specifies the string to be treated asNULL. The default value is empty.SKIP_HEADER: specifies the number of header rows to skip.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, indicating that blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, indicating that leading and trailing spaces are not removed.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, indicating that empty strings are not treated asNULL.IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line in the file. The default value isTRUE, indicating that the last empty field is ignored.
TYPE = 'PARQUET': specifies the external file format asPARQUET.TYPE = 'ORC': specifies the external file format asORC.
properties_type_options
TYPE = 'ODPS': specifies the external file format asODPS. It also includes the following fields:ACCESSID: specifies the AccessKey ID of the Alibaba Cloud account for identity authentication.ACCESSKEY: specifies the AccessKey secret corresponding to the AccessKey ID for identity verification.ENDPOINT: specifies the connection address of the ODPS service.PROJECT_NAME: specifies the name of the target ODPS project to access.SCHEMA_NAME: optionally, specifies the schema name in ODPS.TABLE_NAME: specifies the name of the target table in ODPS.QUOTA_NAME: optionally, specifies the quota to use.COMPRESSION_CODE: optionally, specifies the compression format of the data source. Supported values areZLIB,ZSTD,LZ4, andODPS_LZ4. If this parameter is not specified, compression is not enabled.API_MODE: specifies the API mode for calling ODPS. Valid values are as follows:tunnel_api(default):No special network configuration is required. This mode is applicable to all deployment scenarios, and OceanBase Database does not need to be deployed in the same VPC as MaxCompute.
No additional MaxCompute Storage API permissions are required. Authentication can be completed by providing the AccessID and AccessKey.
Applicable scenarios:
- OceanBase Database and MaxCompute are not deployed in the same VPC.
- MaxCompute Storage API is not enabled.
- Data transmission has low latency requirements.
storage_api:Network dependency: OceanBase Database and MaxCompute must be deployed in the same VPC to achieve low-latency, high-throughput data transmission.
Permission dependency: Storage API permissions must be enabled in MaxCompute, and the AccessKey must have the corresponding permissions.
Applicable scenarios:
- OceanBase Database and MaxCompute are deployed in the same VPC.
- MaxCompute Storage API is enabled.
- The data volume is extremely large or real-time requirements are high.
SPLIT: specifies the unit for task splitting when usingstorage_api. Valid values arebyteandrow. If the data size of each row in a table varies significantly, setSPLITtobyte. Otherwise, set it torow.
Considerations
- When an external file is deleted, the external table ignores the file if it no longer exists in the file list.
- When an external file is modified, the external table accesses the latest data in the external file. If the modification of the external file and the query of the external table are performed concurrently, unexpected results may occur. Therefore, avoid modifying the external file while querying the external table.
- When a new file is added to the external directory, the external table only accesses the files listed in the file list. If you want to add the new file to the file list of the external table, you need to perform an update operation on the external table's file list.
Examples
Example 1
Prepare the data. First, set the
secure_file_privpath to/home/admin/and place the CSV fileextdata.csvcorresponding to the external table data in the/home/admin/testdirectory of the current OBServer node.Here is an example of setting the global secure file path.
obclient> SET GLOBAL secure_file_priv = "/home/admin/"; Query OK, 0 rows affected obclinet> \q ByeNote
Since
secure_file_privis aGLOBALvariable, you need to execute\qto exit and make it effective.The content of the CSV file is as follows:
1,'Dave','Smith','dsmith@outlook.com','friend',32 2,'Xena','Johnson','xjonson@outlook.com','contact',45 3,'Fred','Jackon','fjackson@outlook.com','co-worker',19 4,'Alma','Tyler','atyler@outlook.com','friend',53After logging in to the database as a user in the user tenant, create the external table
contacts.obclient> CREATE EXTERNAL TABLE contacts ( id INT, firstname VARCHAR(100), lastname VARCHAR(100), email VARCHAR(255), category CHAR(30), age NUMBER ) LOCATION = '/home/admin/test' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='\'' ) PATTERN = 'extdata.csv';Query the data in the external table
contacts.obclient> SELECT * FROM contacts; +----+-----------+----------+----------------------+-----------+------+ | id | firstname | lastname | email | category | age | +----+-----------+----------+----------------------+-----------+------+ | 1 | Dave | Smith | dsmith@outlook.com | friend | 32 | | 2 | Xena | Johnson | xjonson@outlook.com | contact | 45 | | 3 | Fred | Jackon | fjackson@outlook.com | co-worker | 19 | | 4 | Alma | Tyler | atyler@outlook.com | friend | 53 | +----+-----------+----------+----------------------+-----------+------+ 4 rows in set
Example 2
Assume that there is a CSV file under a certain file path on HDFS, and the content of the file is as follows:
$hdfs dfs -cat /user/test_tbl1.csv 1,'Emma','2021-09-01' 2,'William','2021-09-02' 3,'Olivia','2021-09-03'Create an HDFS external table.
If Kerberos authentication is not enabled in the target HDFS environment:
CREATE EXTERNAL TABLE test_tbl1_csv_mysql ( id INT, name VARCHAR(50), c_date DATE ) LOCATION = 'hdfs://${hadoop namenode hostname}:${hadoop namenode port}/user' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY ='\'' ) PATTERN = 'test_tbl1.csv';Notice
Here,
hadoop namenode hostnameandhadoop namenode portrefer to the host name and port of the HDFS node, which need to be replaced with actual values.If Kerberos authentication is enabled in the target HDFS environment:
Environment preparation Since the current HDFS integration uses the JNI framework, the corresponding environment must support the deployment of a Java environment. If OBServer is deployed across multiple nodes, the corresponding OBServer nodes must be configured with the following settings. It is not sufficient to configure only one node.
Step 1: Deploy and configure the Java environment
You can download the latest version of OpenJDK 8 from openjdk.
Step 2: Decompress the installation package
After decompression, you will see the following files:
$ ls $JAVA_HOME ASSEMBLY_EXCEPTION bin include jre lib LICENSE man NOTICE release sample src.zip THIRD_PARTY_READMEStep 3: Deploy and configure the HDFS.so dynamic library
You can download the required dependencies from the following address: https://mirrors.aliyun.com/oceanbase/ After downloading, install it using the following command:
sudo rpm -Uvh devdeps-hdfs-sdk-3.3.6-xxxxx.xxx.xxx.rpmExecute the following command to check if the installation meets the expected requirements. The installation should include
libhdfs.soandlibhdfs.so.0.0.0files, and the corresponding symbolic links should be valid.$ll /usr/local/oceanbase/deps/devel/lib total 376 lrwxrwxrwx 1 root root 16 Dec 24 19:49 libhdfs.so -> libhdfs.so.0.0.0 -rwxr-xr-x 1 root root 384632 Dec 24 19:09 libhdfs.so.0.0.0Step 4: Deploy and configure the path for the required JAR package:
Download the JAR package: Visit OceanBase Database Mirror to download the JAR package.
Notice
- For OceanBase Database V4.3.5 BP1 and earlier versions: Please download the 1.0.0 version of the devdeps-java-extensions RPM installation package.
- For OceanBase Database V4.3.5 BP2 and later versions: Please download the 1.0.1 version of the devdeps-java-extensions RPM installation package.
- For OceanBase Database V4.4.0: Please download the 1.0.1 version of the devdeps-java-extensions RPM installation package.
- For OceanBase Database V4.4.1: Please download the 1.0.2 version of the devdeps-java-extensions RPM installation package.
Install the JAR package: After downloading the required JAR package, execute the following command to decompress and install it:
sudo rpm -Uvh devdeps-java-extensions-x.x.x-xxxxxxxxxxxx.xxx.xxxxxx.rpmCheck if the installation meets the expected requirements:
$ll /home/admin/oceanbase/jni_packages total 52756 drwxr-sr-x 4 root root 4096 Dec 24 20:25 hadoop drwxr-xr-x 3 root root 4096 Dec 24 20:25 lib -rw-r--r-- 1 root root 54008720 Dec 24 19:52 oceanbase-odps-connector-jar-with-dependencies.jar
Start the OBSERVER
Before using the HDFS external table, you need to configure the corresponding OBServer node. The configuration steps are as follows:
Note
Since the current OBServer does not support flexible and immediate configuration changes for JNI-related settings, any changes to Java environment variables require a restart of the OBServer to take effect. All the following configurations are cluster-wide and need to be set only once, without needing to configure each node individually.
Execute the following command as the sys tenant to enable the Java environment supported by OBServer:
alter system set ob_enable_java_env=true;Execute the following command to set the Java home path for OBServer:
alter system set ob_java_home="/home/user/openjdk/jdk8u422-b05";Note
The path is derived from the OpenJDK Java installation path.
Set the relevant configuration items for Java environment startup:
Notice
Changes to this configuration item require a restart of the OBServer to take effect. Since the current HDFS integration uses direct memory copying from the HDFS data stream to the C++ memory heap, you can appropriately reduce the settings for
-Xmx2048mand-Xms2048m.alter system set ob_java_opts="-Djdk.lang.processReaperUseDefaultStackSize=true -XX:+HeapDumpOnOutOfMemoryError -Xmx2048m -Xms2048m -Xloggc:/home/user/jvmlogs/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/user/jvmlogs/heapdumps/ -XX:+UseG1GC -XX:-CriticalJNINatives";Note
Here,
/home/user/jvmlogsand/home/user/jvmlogs/heapdumpsare the paths for generating the corresponding log folders, which need to be manually created.Set the available LD_LIBRARY_PATH path:
Note
Since the JNI SDK supported by OBServer is an extended feature, it is dynamically loaded using dynamic libraries. Therefore, you need to configure the corresponding dynamic library path by setting LD_LIBRARY_PATH. This configuration must be applied to all OBServer nodes to be effective.
$ vim ~/.bashrc export LD_LIBRARY_PATH=/home/user/openjdk/jdk8u422-b05/jre/lib/amd64/server:/usr/local/oceanbase/deps/devel/lib
Create the HDFS external table
Notice
If Kerberos authentication is configured, the corresponding OBServer node must be deployed with the keytab (user authentication key file) and krb5conf file.
CREATE EXTERNAL TABLE partsupp ( PS_PARTKEY INTEGER , PS_SUPPKEY INTEGER , PS_AVAILQTY INTEGER , PS_SUPPLYCOST DECIMAL(15,2) , PS_COMMENT VARCHAR(199) ) LOCATION = 'hdfs://localhost:8020/tpch_csv?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = '|' FIELD_OPTIONALLY_ENCLOSED_BY ='"' ) PATTERN = 'partsupp.tbl';
Query the external table data.
select * from test_tbl1_csv_mysql;The returned result is as follows:
+----+----------+------------+ | id | name | c_date | +----+----------+------------+ | 1 | Emma | 2021-09-01 | | 2 | William | 2021-09-02 | | 3 | Olivia | 2021-09-03 | +----+----------+------------+ 3 rows in set
Example 3
Assume that there is a CSV file in a certain path on HDFS, and the file content is as follows:
$hdfs dfs -cat /hadoop_ha_test/test_simple.csv 1,lili,19 2,alic,20 3,solvi,21Start the Observer-related JNI configuration items.
For more information about the JNI configuration, see the Environment preparation section in Step 2 if Kerberos authentication is enabled for the target HDFS environment. For more information about how to deploy the OceanBase Database Java SDK environment, see Deploy the OceanBase Database Java SDK environment.
Create an HDFS external table.
obclient> CREATE EXTERNAL TABLE test_ha ( id INT, r_name VARCHAR(100), age INT ) LOCATION = 'hdfs://mycluster/hadoop_ha_test?principal=ha/xxx@xxx.com&keytab=/path/to/ha.keytab&krb5conf=/path/to/krb5conf_file&configs=dfs.data.transfer.protection=integrity#dfs.nameservices=mycluster#dfs.ha.namenodes.mycluster=nn1,nn2#dfs.namenode.rpc-address.mycluster.nn1=localhost1:port#dfs.namenode.rpc-address.mycluster.nn2=localhost2:port#dfs.ha.automatic-failover.enabled.mycluster=true#dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' FORMAT = ( TYPE = 'CSV' FIELD_DELIMITER = ',' FIELD_OPTIONALLY_ENCLOSED_BY = '\'' ) PATTERN = 'test_simple.csv';Query data from the external table.
obclient> SELECT * FROM test_ha;The returned result is as follows:
+----+--------+-------+ | id | r_name | age | +----+--------+-------+ | 1 | lili | 19 | | 2 | alic | 20 | | 3 | solvi | 21 | +----+--------+-------+ 3 rows in set