Create an external table|V4.6.0|OceanBase Database| docs|Distributed Database

This topic describes how to create an external table by using SQL statements. It also provides the prerequisites, overview, and considerations for creating an external table, and gives some examples.

Overview

An external table is a logical table object that stores data in an external storage service instead of in the database.

For more information about external tables, see Overview.

Prerequisites

Before you create an external table, make sure that the following conditions are met:

You have deployed an OceanBase cluster and created a MySQL tenant. For more information, see Overview.
You have connected to a MySQL tenant of OceanBase Database. For more information, see Overview.
You have created a database. For more information, see Create a database.
You have the CREATE privilege. For information about how to view the privileges of the current user, see View user privileges. If you do not have the privilege, contact the administrator to grant the privilege to you. For information about how to directly grant privileges, see Directly grant privileges.

Considerations

External tables support only query operations and do not support DML operations.
If the external file accessed by an external table is deleted, the system does not return an error but instead returns an empty row.
If the external storage system is unavailable, queries on external tables will return an error.
Since external table data is stored in an external data source, queries may involve network and file system factors, which can affect query performance. Therefore, when creating an external table, you need to select an appropriate data source and optimization strategy to improve query efficiency.

Use the CLI to create an external table

Use the CREATE EXTERNAL TABLE statement to create an external table.

Define the name of the external table

When you create an external table, you must specify its name. To avoid confusion and ambiguity, we recommend that you use specific naming rules or prefixes to distinguish external tables from regular tables. For example, you can add _csv as a suffix to the name of an external table.

Here is an example:

When you create an external table that stores information about students, you can name it students_csv.

CREATE EXTERNAL TABLE students_csv external_options

Notice

Because the other attributes of the external table are not specified, the preceding SQL statement cannot be executed.

Define columns

You cannot define constraints for columns of an external table. Constraints such as DEFAULT, NOT NULL, UNIQUE, CHECK, PRIMARY KEY, and FOREIGN KEY cannot be defined for columns of an external table.

The column types supported by external tables are the same as those supported by regular tables. For more information about the data types supported by OceanBase Database in MySQL mode, see Data types.

Define LOCATION

The LOCATION option specifies the path where the external table files are stored. Usually, the data files of an external table are stored in a separate directory, which can contain subdirectories. When you create an external table, it automatically collects all files in the directory.

OceanBase Database supports the following two path formats:

Local Location format: LOCATION = '[file://] local_file_path'

Notice

For scenarios using the local Location format, you must set the system variable secure_file_priv to configure the accessible path. For more information, see secure_file_priv.
Remote Location format:

Notice

When you use an object storage path, the parameters of the object storage path are separated by the & symbol. Make sure that the parameter values you enter contain only uppercase and lowercase letters, digits, /-_$+=, and wildcards. If you enter other characters, the configuration may fail.
- LOCATION = '{oss|S3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path' where $ACCESS_ID, $ACCESS_KEY, and $HOST are the access information required for accessing Alibaba Cloud OSS, AWS S3, or object storage compatible with the S3 protocol, and s3_region is the region information selected when using S3. These sensitive access information are stored in the system tables of the database in an encrypted manner.
- When the files are stored on HDFS, the format is as follows:
  - Format for accessing the cluster based on the address of a single-node NameNode (NN): LOCATION = hdfs://localhost:port/PATH where localhost is the address of HDFS, port is the port number of HDFS, and PATH is the file path in HDFS.
    - Format with Kerberos authentication: LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx' where:
      - principal: the user for login authentication.
      - keytab: the path of the key file for user authentication.
      - krb5conf: the path of the description file for the Kerberos environment.
      - configs: extra HDFS configurations. By default, this parameter is empty. However, in a Kerberos environment, this parameter usually has a value and needs to be configured. For example: dfs.data.transfer.protection=authentication,privacy specifies the data transmission protection level as authentication and privacy.
  - Format for accessing the cluster based on Hadoop High Availability (HA): LOCATION = hdfs://nameserviceID/PATH where nameserviceID is the logical nameservice ID of Hadoop HA, and PATH is the file path.
    
    Note
    
    Make sure that the client OBServer has the nameservice definition and failover strategy for the HA cluster.
    - Format with Kerberos authentication: LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' where:
      - principal: the user for login authentication. Set this parameter to the pricipal of a non-primary NN.
      - keytab and krb5conf: same as in the single-node NN setting.
      - configs: extra HDFS configurations. If you need to set multiple configurations, only HA configurations and security configurations are related:
        
        dfs.data.transfer.protection=${string}: aligns with the dfs.data.transfer.protection configuration of the cluster.
        
        dfs.nameservices=${nameservice id}: specifies the namesevice (alias) of the current HA cluster.
        
        dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode backup IDs of the HA cluster.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the specific namenode of namenode1 for client routing.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the specific namenode of namenode2 for client routing.
        
        dfs.ha.automatic-failover.enabled.${nameservice id}=true: enables the HA cluster to automatically obtain an available namenode to respond to requests.
        
        dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logical tool class for switching between primary and standby nodes in the HA cluster. You can also customize and upload the required logic for the HA cluster.
    Notice
    
    HA-related parameters are bound to namespace. For example, in the following sample three, mycluster is used. Make sure to set the relevant parameters accordingly.

Define the FORMAT clause

FORMAT = ( TYPE = 'CSV'... ) specifies that the external file is of the CSV type. The following parameters are available:
- TYPE: specifies the type of the external file.
- LINE_DELIMITER: specifies the line delimiter of the CSV file. The default value is LINE_DELIMITER='\n'.
- FIELD_DELIMITER: specifies the field delimiter of the CSV file. The default value is FIELD_DELIMITER='\t'.
- ESCAPE: specifies the escape character of the CSV file. It must be one byte. The default value is ESCAPE ='\'.
- FIELD_OPTIONALLY_ENCLOSED_BY: specifies the character used to enclose field values in the CSV file. The default value is empty.
  
  Notice
  
  If the external file contains NULL values (non-string NULL, not "NULL"), you must explicitly configure the FIELD_OPTIONALLY_ENCLOSED_BY parameter, and the value of this parameter cannot be empty.
- ENCODING: specifies the character set encoding format of the file. For information about the character sets supported in MySQL mode, see Character sets. If this parameter is not specified, the default value is UTF8MB4.
- NULL_IF: specifies the string to be treated as NULL. The default value is empty.
- SKIP_HEADER: specifies the number of lines to skip from the beginning of the file.
- SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value is FALSE, indicating that blank lines are not skipped.
- TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value is FALSE, indicating that leading and trailing spaces are not removed.
- EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings as NULL. The default value is FALSE, indicating that empty strings are not treated as NULL.
- COMPRESSION: optional. Specifies the compression format for the file. Valid values are:
  - GZIP/DEFLATE: GZIP-compressed file.
  - ZSTD: ZSTD-compressed file.
  - NONE: indicates that the file is not compressed (default value).
  - SNAPPY_BLOCK: HADOOP-SNAPPY-compressed file.
- PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If not specified, the default value is TRUE.
- PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing, in bytes. The default value is 256 MB.
- MAX_ROW_LENGTH: specifies the maximum length of a single row of data, in bytes. The default value is 2 MB. This parameter is used during the boundary determination phase of parallel parsing of a single CSV file and does not affect the accuracy of data reading.
FORMAT = ( TYPE = 'PARQUET'... ) specifies that the external file is of the PARQUET type.
FORMAT = ( TYPE = 'ORC'... ) specifies that the external file is of the ORC type.

Note

Starting from OceanBase Database V4.6.0, the external table of the Parquet/ORC file format supports reading complex data types such as Array, Map, and JSON. The data types supported by the external table must be consistent with those supported by OceanBase Database. For more information, see Overview of data types.

OceanBase Database supports reading Bloom filters from external tables of the Parquet/ORC file format. Bloom filters filter data based on query predicates to reduce unnecessary I/O operations and accelerate queries.

(Optional) Define the PATTERN clause

The PATTERN option specifies a regular expression pattern to filter files in the LOCATION directory. For each file path in the LOCATION directory, if the file path matches the pattern, the external table accesses the file; otherwise, it skips the file. If this parameter is not specified, all files in the LOCATION directory are accessible by default. The external table saves the list of files in the LOCATION directory that match the PATTERN in the database system table. When scanning the external table, it accesses the external files based on this list.

(Optional) Define partitions for the external table

Automatically define partitions for the external table

The external table calculates and adds partitions based on the expressions defined for the partitioning key. When querying, you can specify the value or range of the partitioning key. In this case, partition pruning is performed, and the external table only reads files in the specified partition.

Manually define partitions for the external table

If you need to manually add and delete partitions instead of letting the external table manage them automatically, you must specify the PARTITION_TYPE = USER_SPECIFIED field.

Example

Notice

The IP address in the example is desensitized. You need to replace it with the actual IP address of your server.

This example shows how to create an external table in the MySQL mode of OceanBase Database based on an external file stored in the local server or in the MySQL mode of OceanBase Database. The steps are as follows:

Prepare an external file.

Run the following command to create a file named test_tbl1.csv in the /home/admin directory on the server where the OBServer node is located.
```
[admin@xxx /home/admin]# vi test_tbl1.csv
```
The content of the file is as follows:
```
1,'Emma','2021-09-01'
2,'William','2021-09-02'
3,'Olivia','2021-09-03'
```
Set the import file path.

Notice

For security reasons, you can modify only the secure_file_priv system variable by using a local socket connection. For more information, see secure_file_priv.
1. Run the following command to log in to the server where the OBServer node is located.
```
ssh admin@10.10.10.1
```
2. Run the following command to connect to the mysql001 tenant by using a local Unix socket connection.
```
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@mysql001 -p******
```
3. Run the following SQL statement to set the import path to /home/admin.
```
SET GLOBAL secure_file_priv = "/home/admin";
```

Reconnect to the mysql001 tenant.

Here is an example:

obclient -h10.10.10.1 -P2881 -uroot@mysql001 -p****** -A -Dtest

Run the following SQL statement to create an external table named test_tbl1_csv.

CREATE EXTERNAL TABLE test_tbl1_csv ( 
    id INT, 
    name VARCHAR(50), 
    c_date    DATE
    )
    LOCATION = '/home/admin'
    FORMAT = (
      TYPE = 'CSV'
      FIELD_DELIMITER = ','
      FIELD_OPTIONALLY_ENCLOSED_BY ='\''
      )
    PATTERN = 'test_tbl1.csv';

Run the following SQL statement to query the data in the test_tbl1_csv external table.

SELECT * FROM test_tbl1_csv;

The return result is as follows:

+------+---------+------------+
| id   | name    | c_date     |
+------+---------+------------+
|    1 | Emma    | 2021-09-01 |
|    2 | William | 2021-09-02 |
|    3 | Olivia  | 2021-09-03 |
+------+---------+------------+
3 rows in set

Example 2

Example of creating an HDFS external table

With Kerberos:

CREATE EXTERNAL TABLE ext_data ( 
  id INT, 
  name VARCHAR(50), 
  c_date    DATE
)
LOCATION = 'hdfs://localhost:8020/user?principal=principal_str&keytab=/path/to/keytab&krb5conf=/path/to/krb5conf_file&configs=xxx=xxx#xxx=xxx'
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='"'
)
PATTERN = 'data.csv';

Where:

principal: specifies the user for login authentication.
keytab: specifies the path of the key file for user authentication.
krb5conf: specifies the path of the description file for the Kerberos environment.
configs: specifies additional HDFS configuration items. By default, it is empty. However, in a Kerberos environment, this configuration item usually has a value and needs to be configured. For example, dfs.data.transfer.protection=authentication,privacy specifies the data transmission protection level as authentication and privacy.

Without Kerberos:

CREATE EXTERNAL TABLE ext_data ( 
  id INT, 
  name VARCHAR(50), 
  c_date    DATE
)
LOCATION = 'hdfs://localhost:8020/user'
FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = ','
  FIELD_OPTIONALLY_ENCLOSED_BY ='\''
  )
PATTERN = 'test_tbl1.csv';

References

You can use the same method as deleting a regular table to delete an external table. For more information about how to delete a table, see Delete a table.
For more information about how to view and update external files, see Manage external files.

OceanBase

Customer Stories

Documentation

Create an external table

Overview

Prerequisites

Considerations

Use the CLI to create an external table

Define the name of the external table

Notice

Define columns

Define LOCATION

Notice

Notice

Note

Notice

Define the FORMAT clause

Notice

Note

(Optional) Define the PATTERN clause

(Optional) Define partitions for the external table

Automatically define partitions for the external table

Manually define partitions for the external table

Example

Notice

Notice

Example 2

References