Create an external table|V4.4.2|OceanBase Database| docs|Distributed Database

You can create an external table by using the CREATE EXTERNAL TABLE statement. When you create an external table, you must specify the path of the data file and the format of the data file to read data from the external file.

Privilege requirements

To create an external table, the current user must have the CREATE privilege. For more information about how to view the privileges of the current user, see View user privileges.

Create an external table

The SQL statement for creating an external table is as follows:

CREATE EXTERNAL TABLE table_name (column_definition_list)
    LOCATION = 'file_name'
    {FORMAT = (format_type_options)
     | PROPERTIES = (properties_type_options)}
    [PARTITION BY (column_name [, column_name ...])]
    [PARTITION_TYPE = USER_SPECIFIED]
    [PATTERN = 'regex_pattern']
    [AUTO_REFRESH = 'xxx'];

column_definition_list:
    column_definition [, column_definition ...]

column_definition:
    column_name column_type [AS metadata$filecol{N} | external$filepos{N}]

format_type_options:
    type_csv_option
    | type_parquet_option
    | type_orc_option

type_csv_option:
    TYPE = 'CSV'
    LINE_DELIMITER = '<string>' | <expr>
    FIELD_DELIMITER = '<string>' | <expr>
    ESCAPE = '<character>' | <expr>
    FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
    ENCODING = 'charset'
    NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
    SKIP_HEADER = <int>
    SKIP_BLANK_LINES = {TRUE | FALSE}
    TRIM_SPACE = {TRUE | FALSE}
    EMPTY_FIELD_AS_NULL = {TRUE | FALSE}
    IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
    COMPRESSION = {GZIP | ZSTD | DEFLATE | NONE | SNAPPY_BLOCK}
    PARALLEL_PARSE_ON_SINGLE_FILE = {TRUE | FALSE}
    PARALLEL_PARSE_FILE_SIZE_THRESHOLD = <int>
    MAX_ROW_LENGTH = <int>

type_parquet_option:
    TYPE = 'PARQUET'
    [COLUMN_INDEX_TYPE = 'POSITION, NAME']

type_orc_option:
    TYPE = 'ORC'
    [COLUMN_INDEX_TYPE = 'POSITION, NAME']

properties_type_options:
    type_odps_option

type_odps_option:
    TYPE = 'ODPS'
    ACCESSID = '<string>',
    ACCESSKEY = '<string>',
    ENDPOINT = '<string>',
    PROJECT_NAME = '<string>',
    SCHEMA_NAME = '<string>',
    TABLE_NAME = '<string>',
    QUOTA_NAME = '<string>',
    COMPRESSION_CODE = '<string>',
    API_MODE = {"tunnel_api" | "storage_api"},
    SPLIT = {"byte" | "row"}

The parameters are described as follows:

col_name col_type [AS (metadata$filecol{N} | external$filepos{N})]: defines the columns. AS (metadata$filecol{N}) is used to manually define column mapping; AS (external$filepos{N}) is used to index columns by position.

The column types supported by external tables are the same as those of regular tables. For more information about the data types supported by OceanBase Database in MySQL-compatible mode, see Overview of data types.

By default, columns in the external file are automatically mapped to those in the external table in sequence. That is, the first column in the external table corresponds to the first column in the external file.

For example, in the following example, the C1 column in the ext_t1 external table is automatically mapped to the first column in the external file; the C2 column is automatically mapped to the second column in the external file.
```
CREATE EXTERNAL TABLE ext_t1 (
  C1 int,
  C2 int
  )
  LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/'
  FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = '|'
  );
```
When the column order in the external file is different from that in the external table, you can specify column mapping in either of the following ways:
- Use the metadata$filecol{N} syntax: applies to CSV format. It specifies that a column in the external table corresponds to the Nth column in the external file. Columns in the file are numbered starting from 1.
- Use the external$filepos{N} syntax: applies to Parquet and ORC formats. It indexes columns by position. You must use it with the COLUMN_INDEX_TYPE = 'POSITION' parameter.
For example, in the following example, C1 int AS (metadata$filecol2) specifies that the C1 column in the ext_t2 external table corresponds to the second column in the file; C2 int AS (metadata$filecol4) specifies that the C2 column in the ext_t2 external table corresponds to the fourth column in the external file.
```
CREATE EXTERNAL TABLE ext_t2 (
            C1 int AS (metadata$filecol2),
            C2 int AS (metadata$filecol4)
  )
  LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/'
  FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = '|'
  );
```
For Parquet and ORC formats, you can use the external$filepos{N} syntax:
```
CREATE EXTERNAL TABLE ext_t3 (
            c1 INT AS (external$filepos1),
            c2 INT AS (external$filepos2),
            c3 VARCHAR(50) AS (external$filepos3),
            c4 DATE AS (external$filepos4)
  )
  LOCATION = 'path/to/data/'
  FORMAT = (TYPE = 'PARQUET' COLUMN_INDEX_TYPE = 'POSITION')
  PATTERN = "000000_0$";
```
Notice

If you manually define column mapping, automatic column mapping will be disabled and all columns must be mapped manually.

Notice

When using the external$filepos{N} syntax, you must set the COLUMN_INDEX_TYPE = 'POSITION' parameter.
LOCATION = 'file_name': specifies the path where the external file is stored. In most cases, the data files of an external table are stored in a dedicated directory, which can contain subdirectories. When you create an external table, the database automatically collects all files in the directory.

Two formats are supported:
- Local location format: LOCATION = '[file://] local_file_path'
  - local_file_path: can be a relative path or an absolute path. If you enter a relative path, the current directory must be the installation directory of OceanBase Database.
    
    Notice
    
    The local_file_path parameter must specify a directory rather than a file. If you want to specify a single file, you need to set the LOCATION parameter to the directory that contains the file and set the PATTERN parameter to the file.
  - For scenarios that use the local location format, if you configure the value of the secure_file_priv variable to a path that OceanBase Database can access, the value must be the parent directory of local_file_path, namely, local_file_path must be a subdirectory of the secure_file_priv path.
    
    The secure_file_priv variable specifies the path that OceanBase Database can access when you import data to a file or export data from a file. For more information about the secure_file_priv variable, see secure_file_priv.
- Remote location format:
  
  Notice
  
  When using object storage paths, parameters in the path are separated by the & character. Ensure that the parameter values contain only uppercase and lowercase letters, digits, /-_$+=, and wildcards. If you use other characters, the setting may fail.
  - For files on OSS/S3, the format is: LOCATION = '{oss|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path', where $ACCESS_ID, $ACCESS_KEY, and $HOST are the access parameters required for Alibaba Cloud OSS, AWS S3, and S3-compatible object storage. s3_region is the region selected when using S3. These sensitive access parameters are stored in the system tables of the database in encrypted form.
  - For files on HDFS, the formats are:
    - Single NameNode (NN) access: LOCATION = hdfs://localhost:port/PATH, where localhost is the HDFS address, port is the HDFS port, and PATH is the file path in HDFS.
      - With Kerberos authentication: LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx', where:
        
        principal: the login user for authentication.
        
        keytab: the path to the keytab file for user authentication.
        
        krb5conf: the path to the Kerberos configuration file.
        
        configs: extra HDFS configuration parameters. Default is empty. For Kerberos environments, this is usually set, for example: dfs.data.transfer.protection=authentication,privacy to specify the data transfer protection level.
    - Hadoop HA (high availability) logical naming service access: LOCATION = hdfs://nameserviceID/PATH, where nameserviceID is the logical naming service ID of Hadoop HA, and PATH is the file path.
      
      Note
      
      Ensure that the OBServer client configuration includes the nameservice definition and failover strategy for the HA cluster.
      - With Kerberos authentication: LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:
        
        principal: the login user for authentication. Set to the principal of the non-primary NN.
        
        keytab and krb5conf: same as for single NN.
        
        configs: additional HDFS configuration options. Multiple options can be set, mainly related to HA and security settings:
        
        dfs.data.transfer.protection=${string}: the configuration for data transfer protection, aligned with the cluster’s dfs.data.transfer.protection.
        
        dfs.nameservices=${nameservice id}: the nameservice (alias) for the current HA cluster.
        
        dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: the list of standby namenode IDs for the HA cluster.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: the address for namenode1, used for client routing.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: the address for namenode2, used for client routing.
        
        dfs.ha.automatic-failover.enabled.${nameservice id}=true: the setting that enables automatic failover, allowing the HA cluster to automatically select an available namenode for service.
        
        dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: the logic tool class for HA cluster failover. A custom logic class can also be packaged and uploaded if needed.
      Notice
      
      Some HA configuration parameters are bound to namespace (for example, mycluster in the example). Configure related parameters together.
TYPE = 'CSV': specifies the format of CSV external files.
TYPE: specifies the type of the external file.
LINE_DELIMITER: specifies the line delimiter of the file. If this parameter is not specified, the default value LINE_DELIMITER='\n' takes effect.
FIELD_DELIMITER: specifies the field delimiter of the file. If this parameter is not specified, the default value FIELD_DELIMITER='\t' takes effect.
ESCAPE: specifies the escape character of the file. For example, ESCAPE ='*' indicates that the asterisk (*) is the escape character, which replaces the default escape character (). If this parameter is not specified, the default value ESCAPE ='\' takes effect.
FIELD_OPTIONALLY_ENCLOSED_BY: specifies the characters that enclose the field values in the file. For example, FIELD_OPTIONALLY_ENCLOSED_BY = '"' indicates that values are enclosed in double quotation marks. If this parameter is not specified, the default value is empty.

Notice

When the external table data file contains NULL values (not the string NULL, that is, not "NULL"), you must explicitly configure the FIELD_OPTIONALLY_ENCLOSED_BY parameter, and its value cannot be empty.
ENCODING: specifies the character set encoding used by the file. For more information about the character sets supported in MySQL-compatible mode, see Character sets. If this parameter is not specified, the default value UTF8MB4 takes effect.
NULL_IF: specifies the strings that are treated as NULL values. If this parameter is not specified, the default value takes effect.
SKIP_HEADER: specifies the number of lines to skip in the file header. If this parameter is not specified, the file header is not skipped by default.
SKIP_BLANK_LINES: specifies whether to skip blank lines. If this parameter is not specified, the default value FALSE takes effect.
TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. If this parameter is not specified, the default value FALSE takes effect.
EMPTY_FIELD_AS_NULL: specifies whether empty strings are treated as NULL values. If this parameter is not specified, the default value FALSE takes effect.
IGNORE_LAST_EMPTY_COLUMN: if a line ends with an empty field (a column delimiter immediately before the line delimiter), specifies whether to ignore that empty field. Default is TRUE, meaning the last empty field is ignored.
COMPRESSION: optional. Specifies the compression format of the file. Valid values:
- GZIP/DEFLATE: GZIP-compressed file.
- ZSTD: ZSTD-compressed file.
- NONE: indicates that the file is not compressed. This is the default value.
- SNAPPY_BLOCK: HADOOP-SNAPPY-compressed file.
  
  Note
  
  For V4.4.2, support for the SNAPPY_BLOCK compression format was introduced in V4.4.2 BP1.
PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If this parameter is not specified, the default value TRUE takes effect.
PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing, in bytes. The default value is 256 MB.
MAX_ROW_LENGTH: specifies the maximum length of a single row of data, in bytes. The default value is 2 MB. This parameter is used for the boundary determination stage of parallel parsing of a single CSV file and does not affect the correctness of data reading.

Note

For V4.4.2, support for the PARALLEL_PARSE_ON_SINGLE_FILE, PARALLEL_PARSE_FILE_SIZE_THRESHOLD, and MAX_ROW_LENGTH parameters was introduced in V4.4.2 BP1.

FORMAT = ( TYPE = 'PARQUET'...): defines the external file format as PARQUET.
- TYPE: specifies the external file type as PARQUET.
- COLUMN_INDEX_TYPE: specifies the column index type. Supported values are 'POSITION' and 'NAME', for position-based and name-based column indexing. This parameter is required when using the external$filepos{N} syntax.
FORMAT = ( TYPE = 'ORC'...): defines the external file format as ORC.
- TYPE: specifies the external file type as ORC.
- COLUMN_INDEX_TYPE: specifies the column index type. Currently only 'POSITION' is supported for position-based column indexing. This parameter is required when using the external$filepos{N} syntax.
TYPE = 'ODPS': Specifies that the external file format is ODPS. Also includes the following fields:
- ACCESSID: specifies the AccessKey ID of your Alibaba Cloud account for authentication.
- ACCESSKEY: specifies the AccessKey Secret corresponding to the AccessKey ID for authentication.
- ENDPOINT: specifies the connection address for the ODPS service.
- PROJECT_NAME: specifies the name of the target ODPS project.
- SCHEMA_NAME: (optional). Specifies the schema name in ODPS.
- TABLE_NAME: specifies the target table name in ODPS.
- QUOTA_NAME: (optional). Specifies the quota to use.
- COMPRESSION_CODE: (optional). Specifies the compression format of the data source. Supports ZLIB, ZSTD, LZ4, and ODPS_LZ4. If not set, disables compression.
- API_MODE: specifies the API mode for accessing ODPS. Valid values:
  - tunnel_api (default value):
    - Does not require special network configuration: Suitable for all deployment scenarios. OceanBase Database and MaxCompute do not need to be in the same Virtual Private Cloud (VPC).
    - Does not require additional MaxCompute permissions: Only needs AccessKey ID and AccessKey Secret for authentication. Does not require MaxCompute Storage API permissions.
    - Is suitable when:
      - OceanBase Database and MaxCompute are not deployed in the same VPC.
      - MaxCompute Storage API is not enabled.
      - Data transfer does not have strict latency requirements.
  - storage_api:
    - Requires OceanBase Database and MaxCompute to be deployed in the same VPC for low-latency, high-throughput data transfer.
    - Requires Storage API permission to be enabled in MaxCompute, and the AccessKey to have the necessary permissions.
    - Is suitable when:
      - OceanBase Database and MaxCompute are in the same VPC network.
      - MaxCompute Storage API is enabled.
      - Data volume is large or real-time requirements are high.
- SPLIT: When using storage_api, specifies whether to split tasks by byte or row for assignment to different threads. If the byte size of each row in a table varies greatly, setting SPLIT to byte does the splitting by byte; otherwise, setting it to row does the splitting by row.
PATTERN: specifies the regular pattern string to filter files in the directory specified by LOCATION. For each file in the directory specified by LOCATION, if the file matches the pattern string, the external table can access the file. Otherwise, the external table skips the file. If this parameter is not specified, the external table can access all files in the directory specified by LOCATION by default.

Assume that a data.csv file exists in the /home/admin/oceanbase/ directory on your local machine, and the file contains the following data.

1,"lin",98
2,"hei",90
3,"ali",95

Parquet format external table example

On the OBServer node, the tenant administrator connects to the MySQL-compatible tenant of the cluster through the local Unix socket.

Connection example:
```
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@sys -p********
```
For more information about how to connect to OceanBase Database through a local Unix socket, see secure_file_priv.
Configure the path /home/admin/oceanbase/ that the database can access.
```
SET GLOBAL secure_file_priv = "/home/admin/oceanbase/";
```
After the command is executed successfully, you need to restart the session for the modification to take effect.

Reconnect to the database and create the external table ext_t3.

CREATE EXTERNAL TABLE ext_t3(id int, name char(10),score int)
LOCATION = '/home/admin/oceanbase/'
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='"'
)
PATTERN = 'data.csv';

After the external table is created, you can use the SHOW CREATE TABLE statement to view the table definition, just like that for a regular table.

SHOW CREATE TABLE ext_t3;

The query result is as follows:

+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table  | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ext_t3 | CREATE EXTERNAL TABLE `ext_t3` (
  `id` int(11) GENERATED ALWAYS AS (metadata$filecol1),
  `name` char(10) GENERATED ALWAYS AS (metadata$filecol2),
  `score` int(11) GENERATED ALWAYS AS (metadata$filecol3)
)
LOCATION='file:///home/admin/oceanbase/'
PATTERN='data.csv'
FORMAT (
  TYPE = 'CSV',
  FIELD_DELIMITER = ',',
  FIELD_OPTIONALLY_ENCLOSED_BY = '"',
  ENCODING = 'utf8mb4'
)DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set

You can also access the external table like a regular table. When you query an external table, the system reads the external file through the driver of the external table and parses the file according to its format. Then it converts the parsed data into internal data types of OceanBase Database and returns the data rows. The following example queries the external table ext_t3 that was just created.

SELECT * FROM ext_t3;

The query result is as follows:

+----+------+-------+
| id | name | score |
+----+------+-------+
|  1 | lin  |    98 |
|  2 | hei  |    90 |
|  3 | ali  |    95 |
+----+------+-------+
3 rows in set

You can also combine an external table with a regular table for query operations. Assume that a regular table named info exists in the current database, and the table contains the following data:

+------+--------+------+
| name | sex    | age  |
+------+--------+------+
| lin  | male   |    8 |
| hei  | male   |    9 |
| li   | female |    8 |
+------+--------+------+
3 rows in set

Example of combining the external table ext_t3 with the regular table info for query operations:

SELECT info.* FROM info, ext_t3 WHERE info.name = ext_t3.name AND ext_t3.score > 90;

The query result is as follows:

+------+--------+------+
| name | sex    | age  |
+------+--------+------+
| lin  | male   |    8 |
| li   | female |    8 |
+------+--------+------+
2 rows in set

For more information about queries, see Read data.

Create an ODPS Java SDK external table

To create an ODPS Java SDK external table, you need to use OceanBase Database in an environment with the Java SDK installed. For detailed information on setting up the Java SDK environment, see Deploy the Java SDK environment for OceanBase Database.

An example of creating an ODPS Java SDK external table is as follows:

CREATE EXTERNAL TABLE lineitem(
    l_orderkey BIGINT,
    l_partkey BIGINT, 
    l_suppkey BIGINT,
    l_linenumber BIGINT, 
    l_quantity DECIMAL(15,2), 
    l_extendedprice DECIMAL(15,2),
    l_discount DECIMAL(15,2),
    l_tax DECIMAL(15,2),
    l_returnflag CHAR(1),
    l_linestatus CHAR(1), 
    l_shipdate DATE, 
    l_commitdate DATE, 
    l_receiptdate DATE,
    l_shipinstruct CHAR(25),
    l_shipmode CHAR(10),
    l_comment VARCHAR(44))
    PROPERTIES = (
        TYPE = 'ODPS'
        ACCESSID = '***********'
        ACCESSKEY = '***********'
        ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
        PROJECT_NAME = 'bigdata_public_dataset',
        SCHEMA_NAME = 'tpch_10g',
        TABLE_NAME = 'lineitem',
        QUOTA_NAME = '',
        COMPRESSION_CODE = '',
        API_MODE = "storage_api",
        SPLIT = "byte"
        );

Considerations for using external tables

External tables support only query operations; DML operations are not supported.
When you query an external table, if the external file accessed by the table has been deleted, the system does not return an error but returns empty rows.
The files accessed by an external table are managed by the external storage system. If the external storage is unavailable, an error is returned when you query the external table.

What to do next

When you create an external table, the system saves the file list that matches the PATTERN in the path specified in the LOCATION to the system table of OceanBase Database. During a scan, the system accesses external files based on this list. If other files are added to the external directory, you must perform an operation to update the external table to add the new files to the file list of the external table. For more information, see External file management.

After you create an external table, you can drop it. The statement for dropping an external table is the same as that for dropping a regular table. You can use the DROP TABLE statement to drop an external table. For more information, see Drop a table.

OceanBase

Customer Stories

Documentation

Create an external table

Privilege requirements

Create an external table

Notice

Notice

Notice

Notice

Note

Notice

Notice

Note

Note

Parquet format external table example

Create an ODPS Java SDK external table

Considerations for using external tables

What to do next

References