Create an external table|V4.6.0|OceanBase Database| docs|Distributed Database

You can execute the CREATE EXTERNAL TABLE statement to create an external table. When you create an external table, you must specify the path and format of the data files so that the system can read data from the external files.

Privilege requirements

To create an external table, the current user must have the CREATE TABLE privilege. For information about how to view the privileges of the current user, see View user privileges. If you do not have the CREATE TABLE privilege, contact the administrator to grant you the privilege. For information about how to grant privileges to a user, see Directly grant privileges.

Create an external table

The following statement is used to create an external table:

CREATE EXTERNAL TABLE table_name
( col_name col_type [AS (metadata$filecol{N})] 
  [ , col_name col_type [AS (metadata$filecol{N})] ]
  [ , ... ] )
LOCATION = '<string>'
FORMAT = (
  TYPE = 'CSV'
  LINE_DELIMITER = '<string>' | <expr>
  FIELD_DELIMITER = '<string>' | <expr>
  ESCAPE = '<character>' | <expr>
  FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
  ENCODING = 'charset'
  NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
  SKIP_HEADER = <int>
  SKIP_BLANK_LINES = { TRUE | FALSE }
  TRIM_SPACE = { TRUE | FALSE }
  EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
  IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
  COMPRESSION = {GZIP | ZSTD | DEFLATE | NONE | SNAPPY_BLOCK}
  PARALLEL_PARSE_ON_SINGLE_FILE = {TRUE | FALSE}
  PARALLEL_PARSE_FILE_SIZE_THRESHOLD = <int>
  MAX_ROW_LENGTH = <int>
)
[ PATTERN = '<regex_pattern>' ]

The related parameters are described below:

col_name col_type [AS (metadata$filecol{N})]: used to define a column. Here, AS (metadata$filecol{N}) specifies that the column is mapped manually.

The external table supports the same types of columns as regular tables. For more information about the data types supported by OceanBase Database in Oracle mode, see Built-in data types.

By default, the data columns in an external file are automatically mapped to the columns defined in the external table in sequence, meaning that the first column of the external table is automatically mapped to the first column of the external file.

For example, in the following statement, the C1 column of the ext_t1 table is automatically mapped to the first column of the external file, and the C2 column of the ext_t1 table is automatically mapped to the second column of the external file.
```
CREATE EXTERNAL TABLE ext_t1 (
  C1 int,
  C2 int
  )
  LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/'
  FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = '|'
  );
```
If the order of columns in the external file is inconsistent with the order of columns defined in the external table, you can specify which column in the external file corresponds to which column in the external table by using a pseudo column of the metadata$filecol{N} type. Note that file columns are numbered starting from 1.

For example, in the following statement, the C1 column of the ext_t2 table corresponds to the second column of the external file, and the C2 column of the ext_t2 table corresponds to the fourth column of the external file.
```
CREATE EXTERNAL TABLE ext_t2 (
            C1 int AS (metadata$filecol2),
            C2 int AS (metadata$filecol4)
  )
  LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/'
  FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = '|'
  );
```
Notice

If you manually define column mapping, the automatic mapping feature is disabled. In this case, all columns must be manually mapped.
LOCATION = '<string>': used to specify the path where external files are stored. Typically, data files of an external table are stored in a dedicated directory and may contain subdirectories. When you create an external table, all files in this directory are automatically collected by the external table.

The following formats are supported:
- Local path format: LOCATION = '[file://] local_file_path'
  - local_file_path: can be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.
    
    Notice
    
    The parameter local_file_path specifies a directory, not a file. If you want to specify one specific file, you need to specify the upper directory of the file in LOCATION, and then set the PATTERN attribute to specify the file.
  - For the local path format, in the case where you configure the secure_file_priv system variable to specify the path of files that OceanBase Database can access, the value of the secure_file_priv system variable must be the upper directory of local_file_path, which means that local_file_path must be a subpath of the directory specified by the secure_file_priv variable.
    
    The secure_file_priv system variable at the tenant level specifies the file paths that OceanBase Database can access during import and export operations. For more information about secure_file_priv, see secure_file_priv.
- Remote path format:
  
  Notice
  
  When you specify an object storage path, parameters are separated by the & symbol. Only uppercase and lowercase English letters, numbers, /-_$+=, and wildcards are supported as parameter values. If you enter other characters, the configuration may fail.
  - If the file is stored on Alibaba Cloud OSS or AWS S3, the format is LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'. In this format, $ACCESS_ID, $ACCESS_KEY, and $HOST are the access parameters that are required to access Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol. s3_region specifies the region. These sensitive access parameters are stored in the system tables of the database in an encrypted manner.
  - If the file is stored on HDFS, you can specify the path in one of the following formats:
    - If you specify the address of a single NameNode (NN) to access the HDFS cluster, you can specify the path in the LOCATION = hdfs://localhost:port/PATH format. Here, localhost specifies the address of the HDFS cluster. port specifies the port number of the HDFS cluster. PATH specifies the file path in the HDFS cluster.
      - If you enable Kerberos authentication, you can specify the path in the LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx' format. Here:
        
        principal: specifies the username for logging in and authentication.
        
        keytab: specifies the path of the user authentication key file.
        
        krb5conf: specifies the path of the Kerberos configuration file.
        
        configs: specifies additional HDFS configurations. The default value is empty. However, in a Kerberos environment, this parameter usually has a value. You need to set the parameter to the desired value. Example: dfs.data.transfer.protection=authentication,privacy, which specifies that the data transmission protection level is set to authentication and privacy.
    - If you enable Hadoop high availability (HA), you can specify the path in the LOCATION = hdfs://nameserviceID/PATH format. Here, nameserviceID specifies the ID of the logical naming service of Hadoop HA, and PATH specifies the file path.
      
      Note
      
      Ensure that the configuration of the OBServer node contains the nameservice definition and failover strategy of the HA cluster.
      - For Kerberos authentication, the format is as follows: LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:
        
        principal: specifies the user for login authentication. Set it to the pricipal of a non-primary namenode.
        
        keytab and krb5conf: set them the same as for a single-node namenode.
        
        configs: specifies additional HDFS configuration items. You can set only HA-related and security-related configuration items:
        
        dfs.data.transfer.protection=${string}: specifies the dfs.data.transfer.protection parameter of the cluster.
        
        dfs.nameservices=${nameservice id}: specifies the nameservice alias of the current HA cluster.
        
        dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode IDs in the HA cluster.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the address of the namenode1 namenode to facilitate client routing.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the address of the namenode2 namenode to facilitate client routing.
        
        dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies whether the HA cluster automatically obtains an available namenode to respond to requests.
        
        dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logic tool class for primary/standby switching in the HA cluster. You can also customize and upload the logic required by the HA cluster.
      Notice
      
      HA-related configuration items are bound to namespace. For example, in the following sample three, mycluster is the namespace of the HA cluster. Make sure that the related configuration items are set correctly.
FORMAT = ( TYPE = 'CSV'...): specifies the external file format as CSV.
- TYPE: specifies the type of the external file.
- LINE_DELIMITER: specifies the line delimiter of the file.
- FIELD_DELIMITER: specifies the column delimiter of the file.
- ESCAPE: specifies the escape character of the file.
- FIELD_OPTIONALLY_ENCLOSED_BY: specifies the symbol used to enclose field values in the file. For example, ESCAPE = '"' indicates that values are enclosed in double quotation marks. If not specified, the default value is empty.
  
  Notice
  
  If the external data file contains NULL values (not the string NULL, that is, not "NULL"), you must explicitly configure the FIELD_OPTIONALLY_ENCLOSED_BY parameter, and the value of this parameter cannot be empty.
- ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in Oracle mode, see Character sets. If not specified, the default value is UTF8MB4.
- NULL_IF: specifies the strings to be treated as NULL. If not specified, the default value is empty.
- SKIP_HEADER: specifies the number of rows to skip from the file header. If not specified, the default value is not to skip the file header.
- SKIP_BLANK_LINES: specifies whether to skip blank lines. If not specified, the default value is FALSE.
- TRIM_SPACE: specifies whether to remove leading and trailing spaces in fields of the file. If not specified, the default value is FALSE.
- EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings as NULL. If not specified, the default value is FALSE.
- IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line if the line ends with an empty field (that is, the line delimiter is preceded by a column delimiter). The default value is TRUE, indicating that the last empty field is ignored.
- COMPRESSION: optional. Specifies the compression format of the file. Valid values:
  - GZIP/DEFLATE: specifies a GZIP-compressed file.
  - ZSTD: specifies a ZSTD-compressed file.
  - NONE: specifies an uncompressed file (default value).
  - SNAPPY_BLOCK: specifies an HADOOP-SNAPPY-compressed file.
- PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If not specified, the default value is TRUE.
- PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing. The unit is bytes. The default value is 256 MB.
- MAX_ROW_LENGTH: specifies the maximum length of a single row of data. The unit is bytes. The default value is 2 MB. This parameter is used for the boundary determination stage of parallel parsing of a single CSV file and does not affect the accuracy of data reading.
FORMAT = ( TYPE = 'PARQUET'...): specifies the external file format as PARQUET.

Note

Starting from OceanBase Database V4.6.0, the following features are supported for Parquet/ORC file format external tables:

Reading complex data types, including Array, Map, and JSON. The data types of the external tables must be compatible with those supported by OceanBase Database. For more information, see Overview of built-in data types.
Reading Bloom Filters to filter data based on query predicates, thereby reducing unnecessary I/O and accelerating queries.

PATTERN specifies the regular expression pattern to filter files in the LOCATION directory. For each file in the LOCATION directory, if it matches the pattern, the external table can access the file; otherwise, it skips the file. If this parameter is not specified, all files in the LOCATION directory are accessible.

Assume that the /home/admin/oceanbase/ directory on the local machine contains a data.csv file with the following content.

1,"lin",98
2,"hei",90
3,"ali",95

Connect to the Oracle tenant of the cluster as the tenant administrator on the OBServer node using the local Unix socket.

Here is an example:
```
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@sys -p********
```
For more information about how to connect to OceanBase Database using the local Unix socket, see secure_file_priv.
Configure the database to access the /home/admin/oceanbase/ directory.
```
SET GLOBAL secure_file_priv = "/home/admin/oceanbase/";
```
After executing the command, you need to restart the session for the changes to take effect.

Reconnect to the database and create an external table named ext_t3.

CREATE EXTERNAL TABLE ext_t3(ID NUMBER(32), NAME VARCHAR2(30),SCORE NUMBER(32))
LOCATION = '/home/admin/oceanbase/'
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='"'
)
PATTERN = 'data.csv';

After the external table is created, you can use the SHOW CREATE TABLE statement to view the table definition, just like with a regular table.

SHOW CREATE TABLE ext_t3;

The query result is as follows:

+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TABLE  | CREATE TABLE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXT_T3 | CREATE EXTERNAL TABLE "EXT_T3" (
  "ID" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL1),
  "NAME" VARCHAR2(30) GENERATED ALWAYS AS (METADATA$FILECOL2),
  "SCORE" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL3)
)
LOCATION='file:///home/admin/oceanbase/'
PATTERN='data.csv'
FORMAT (
  TYPE = 'CSV',
  FIELD_DELIMITER = ',',
  FIELD_OPTIONALLY_ENCLOSED_BY = '"',
  ENCODING = 'utf8mb4'
)COMPRESS FOR ARCHIVE REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set

You can also access the external table like a regular table. When querying the external table, the system reads the external file directly through the external table driver, parses it according to the file format, converts it into internal data types of OceanBase Database, and returns the data rows. Here is an example of querying the newly created external table lineitem.

SELECT * FROM ext_t3;

The query result is as follows:

+----+------+-------+
| ID | NAME | SCORE |
+----+------+-------+
|  1 | lin  |    98 |
|  2 | hei  |    90 |
|  3 | ali  |    95 |
+----+------+-------+
3 rows in set

In addition, you can perform combined queries with external tables and regular tables. Assume that the current database contains a regular table named info with the following data:

+------+--------+------+
| NAME | SEX    | AGE  |
+------+--------+------+
| lin  | male   |    8 |
| hei  | male   |    9 |
| li   | female |    8 |
+------+--------+------+
3 rows in set

Here is an example of combining the external table ext_t3 with the regular table info.

SELECT info.* FROM info, ext_t3 WHERE info.name = ext_t3.name AND ext_t3.score > 90;

The query result is as follows:

+------+--------+------+
| NAME | SEX    | AGE  |
+------+--------+------+
| lin  | male   |    8 |
| li   | female |    8 |
+------+--------+------+
2 rows in set

For more information about query operations, see Read data.

Considerations

You can query external tables only. You cannot perform DML operations on external tables.
If the external file accessed by an external table is deleted, no error is returned. Instead, an empty row is returned.
If the external storage system is unavailable, an error is returned when you query an external table.

What to do next

When you create an external table, the system saves the list of files that match the PATTERN in the specified path in the LOCATION clause to the system table of OceanBase Database. When you scan the external table, the system accesses the external files based on the list. If other files are added to the external directory, you must update the external table files to add the new files to the file list of the external table. For more information, see Manage external files.

After you create an external table, you can delete it. The statement for deleting an external table is the same as that for deleting a regular table. You can execute the DROP TABLE statement to delete an external table. For more information, see Delete a table.

References

Privilege requirements

Create an external table

The following statement is used to create an external table:

CREATE EXTERNAL TABLE table_name
( col_name col_type [AS (metadata$filecol{N})] 
  [ , col_name col_type [AS (metadata$filecol{N})] ]
  [ , ... ] )
LOCATION = '<string>'
FORMAT = (
  TYPE = 'CSV'
  LINE_DELIMITER = '<string>' | <expr>
  FIELD_DELIMITER = '<string>' | <expr>
  ESCAPE = '<character>' | <expr>
  FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>
  ENCODING = 'charset'
  NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...)
  SKIP_HEADER = <int>
  SKIP_BLANK_LINES = { TRUE | FALSE }
  TRIM_SPACE = { TRUE | FALSE }
  EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
  IGNORE_LAST_EMPTY_COLUMN = {TRUE | FALSE}
  COMPRESSION = {GZIP | ZSTD | DEFLATE | NONE | SNAPPY_BLOCK}
  PARALLEL_PARSE_ON_SINGLE_FILE = {TRUE | FALSE}
  PARALLEL_PARSE_FILE_SIZE_THRESHOLD = <int>
  MAX_ROW_LENGTH = <int>
)
[ PATTERN = '<regex_pattern>' ]

The related parameters are described below:

col_name col_type [AS (metadata$filecol{N})]: used to define a column. Here, AS (metadata$filecol{N}) specifies that the column is mapped manually.

The external table supports the same types of columns as regular tables. For more information about the data types supported by OceanBase Database in Oracle mode, see Built-in data types.

By default, the data columns in an external file are automatically mapped to the columns defined in the external table in sequence, meaning that the first column of the external table is automatically mapped to the first column of the external file.

For example, in the following statement, the C1 column of the ext_t1 table is automatically mapped to the first column of the external file, and the C2 column of the ext_t1 table is automatically mapped to the second column of the external file.
```
CREATE EXTERNAL TABLE ext_t1 (
  C1 int,
  C2 int
  )
  LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/'
  FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = '|'
  );
```
If the order of columns in the external file is inconsistent with the order of columns defined in the external table, you can specify which column in the external file corresponds to which column in the external table by using a pseudo column of the metadata$filecol{N} type. Note that file columns are numbered starting from 1.

For example, in the following statement, the C1 column of the ext_t2 table corresponds to the second column of the external file, and the C2 column of the ext_t2 table corresponds to the fourth column of the external file.
```
CREATE EXTERNAL TABLE ext_t2 (
            C1 int AS (metadata$filecol2),
            C2 int AS (metadata$filecol4)
  )
  LOCATION = 'oss://$ACCESS_ID:$ACCESS_KEY@$HOST/tpch_1g_data/lineitem/'
  FORMAT = (
  TYPE = 'CSV'
  FIELD_DELIMITER = '|'
  );
```
Notice

If you manually define column mapping, the automatic mapping feature is disabled. In this case, all columns must be manually mapped.
LOCATION = '<string>': used to specify the path where external files are stored. Typically, data files of an external table are stored in a dedicated directory and may contain subdirectories. When you create an external table, all files in this directory are automatically collected by the external table.

The following formats are supported:
- Local path format: LOCATION = '[file://] local_file_path'
  - local_file_path: can be a relative path or an absolute path. If you specify a relative path, the current directory must be the installation directory of OceanBase Database.
    
    Notice
    
    The parameter local_file_path specifies a directory, not a file. If you want to specify one specific file, you need to specify the upper directory of the file in LOCATION, and then set the PATTERN attribute to specify the file.
  - For the local path format, in the case where you configure the secure_file_priv system variable to specify the path of files that OceanBase Database can access, the value of the secure_file_priv system variable must be the upper directory of local_file_path, which means that local_file_path must be a subpath of the directory specified by the secure_file_priv variable.
    
    The secure_file_priv system variable at the tenant level specifies the file paths that OceanBase Database can access during import and export operations. For more information about secure_file_priv, see secure_file_priv.
- Remote path format:
  
  Notice
  
  When you specify an object storage path, parameters are separated by the & symbol. Only uppercase and lowercase English letters, numbers, /-_$+=, and wildcards are supported as parameter values. If you enter other characters, the configuration may fail.
  - If the file is stored on Alibaba Cloud OSS or AWS S3, the format is LOCATION = '{oss\|s3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path'. In this format, $ACCESS_ID, $ACCESS_KEY, and $HOST are the access parameters that are required to access Alibaba Cloud OSS, AWS S3, or an object storage service that is compatible with the S3 protocol. s3_region specifies the region. These sensitive access parameters are stored in the system tables of the database in an encrypted manner.
  - If the file is stored on HDFS, you can specify the path in one of the following formats:
    - If you specify the address of a single NameNode (NN) to access the HDFS cluster, you can specify the path in the LOCATION = hdfs://localhost:port/PATH format. Here, localhost specifies the address of the HDFS cluster. port specifies the port number of the HDFS cluster. PATH specifies the file path in the HDFS cluster.
      - If you enable Kerberos authentication, you can specify the path in the LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx' format. Here:
        
        principal: specifies the username for logging in and authentication.
        
        keytab: specifies the path of the user authentication key file.
        
        krb5conf: specifies the path of the Kerberos configuration file.
        
        configs: specifies additional HDFS configurations. The default value is empty. However, in a Kerberos environment, this parameter usually has a value. You need to set the parameter to the desired value. Example: dfs.data.transfer.protection=authentication,privacy, which specifies that the data transmission protection level is set to authentication and privacy.
    - If you enable Hadoop high availability (HA), you can specify the path in the LOCATION = hdfs://nameserviceID/PATH format. Here, nameserviceID specifies the ID of the logical naming service of Hadoop HA, and PATH specifies the file path.
      
      Note
      
      Ensure that the configuration of the OBServer node contains the nameservice definition and failover strategy of the HA cluster.
      - For Kerberos authentication, the format is as follows: LOCATION = 'hdfs://nameserviceID/PATH?principal=xxx&keytab=xxx&krb5conf=xxx&configs=dfs.data.transfer.protection=${string}#dfs.nameservices=${nameservice id}#dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}#dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}#dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}#dfs.ha.automatic-failover.enabled.${nameservice id}=true#dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider', where:
        
        principal: specifies the user for login authentication. Set it to the pricipal of a non-primary namenode.
        
        keytab and krb5conf: set them the same as for a single-node namenode.
        
        configs: specifies additional HDFS configuration items. You can set only HA-related and security-related configuration items:
        
        dfs.data.transfer.protection=${string}: specifies the dfs.data.transfer.protection parameter of the cluster.
        
        dfs.nameservices=${nameservice id}: specifies the nameservice alias of the current HA cluster.
        
        dfs.ha.namenodes.${nameservice id}=${namenode1}, ${namenode2}: specifies the list of namenode IDs in the HA cluster.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode1}=${namenode 1 address}: specifies the address of the namenode1 namenode to facilitate client routing.
        
        dfs.namenode.rpc-address.${nameservice id}.${namenode2}=${namenode 2 address}: specifies the address of the namenode2 namenode to facilitate client routing.
        
        dfs.ha.automatic-failover.enabled.${nameservice id}=true: specifies whether the HA cluster automatically obtains an available namenode to respond to requests.
        
        dfs.client.failover.proxy.provider.${nameservice id}=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider: specifies the logic tool class for primary/standby switching in the HA cluster. You can also customize and upload the logic required by the HA cluster.
      Notice
      
      HA-related configuration items are bound to namespace. For example, in the following sample three, mycluster is the namespace of the HA cluster. Make sure that the related configuration items are set correctly.
FORMAT = ( TYPE = 'CSV'...): specifies the external file format as CSV.
- TYPE: specifies the type of the external file.
- LINE_DELIMITER: specifies the line delimiter of the file.
- FIELD_DELIMITER: specifies the column delimiter of the file.
- ESCAPE: specifies the escape character of the file.
- FIELD_OPTIONALLY_ENCLOSED_BY: specifies the symbol used to enclose field values in the file. For example, ESCAPE = '"' indicates that values are enclosed in double quotation marks. If not specified, the default value is empty.
  
  Notice
  
  If the external data file contains NULL values (not the string NULL, that is, not "NULL"), you must explicitly configure the FIELD_OPTIONALLY_ENCLOSED_BY parameter, and the value of this parameter cannot be empty.
- ENCODING: specifies the character set encoding format of the file. For information about all character sets supported in Oracle mode, see Character sets. If not specified, the default value is UTF8MB4.
- NULL_IF: specifies the strings to be treated as NULL. If not specified, the default value is empty.
- SKIP_HEADER: specifies the number of rows to skip from the file header. If not specified, the default value is not to skip the file header.
- SKIP_BLANK_LINES: specifies whether to skip blank lines. If not specified, the default value is FALSE.
- TRIM_SPACE: specifies whether to remove leading and trailing spaces in fields of the file. If not specified, the default value is FALSE.
- EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings as NULL. If not specified, the default value is FALSE.
- IGNORE_LAST_EMPTY_COLUMN: specifies whether to ignore the last empty field at the end of a line if the line ends with an empty field (that is, the line delimiter is preceded by a column delimiter). The default value is TRUE, indicating that the last empty field is ignored.
- COMPRESSION: optional. Specifies the compression format of the file. Valid values:
  - GZIP/DEFLATE: specifies a GZIP-compressed file.
  - ZSTD: specifies a ZSTD-compressed file.
  - NONE: specifies an uncompressed file (default value).
  - SNAPPY_BLOCK: specifies an HADOOP-SNAPPY-compressed file.
- PARALLEL_PARSE_ON_SINGLE_FILE: specifies whether to enable parallel parsing of a single CSV file. If not specified, the default value is TRUE.
- PARALLEL_PARSE_FILE_SIZE_THRESHOLD: specifies the file size threshold for parallel parsing. The unit is bytes. The default value is 256 MB.
- MAX_ROW_LENGTH: specifies the maximum length of a single row of data. The unit is bytes. The default value is 2 MB. This parameter is used for the boundary determination stage of parallel parsing of a single CSV file and does not affect the accuracy of data reading.
FORMAT = ( TYPE = 'PARQUET'...): specifies the external file format as PARQUET.

Note

Starting from OceanBase Database V4.6.0, the following features are supported for Parquet/ORC file format external tables:

Reading complex data types, including Array, Map, and JSON. The data types of the external tables must be compatible with those supported by OceanBase Database. For more information, see Overview of built-in data types.
Reading Bloom Filters to filter data based on query predicates, thereby reducing unnecessary I/O and accelerating queries.

PATTERN specifies the regular expression pattern to filter files in the LOCATION directory. For each file in the LOCATION directory, if it matches the pattern, the external table can access the file; otherwise, it skips the file. If this parameter is not specified, all files in the LOCATION directory are accessible.

Assume that the /home/admin/oceanbase/ directory on the local machine contains a data.csv file with the following content.

1,"lin",98
2,"hei",90
3,"ali",95

Connect to the Oracle tenant of the cluster as the tenant administrator on the OBServer node using the local Unix socket.

Here is an example:
```
obclient -S /home/admin/oceanbase/run/sql.sock -uroot@sys -p********
```
For more information about how to connect to OceanBase Database using the local Unix socket, see secure_file_priv.
Configure the database to access the /home/admin/oceanbase/ directory.
```
SET GLOBAL secure_file_priv = "/home/admin/oceanbase/";
```
After executing the command, you need to restart the session for the changes to take effect.

Reconnect to the database and create an external table named ext_t3.

CREATE EXTERNAL TABLE ext_t3(ID NUMBER(32), NAME VARCHAR2(30),SCORE NUMBER(32))
LOCATION = '/home/admin/oceanbase/'
FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
FIELD_OPTIONALLY_ENCLOSED_BY ='"'
)
PATTERN = 'data.csv';

After the external table is created, you can use the SHOW CREATE TABLE statement to view the table definition, just like with a regular table.

SHOW CREATE TABLE ext_t3;

The query result is as follows:

+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TABLE  | CREATE TABLE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXT_T3 | CREATE EXTERNAL TABLE "EXT_T3" (
  "ID" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL1),
  "NAME" VARCHAR2(30) GENERATED ALWAYS AS (METADATA$FILECOL2),
  "SCORE" NUMBER(32) GENERATED ALWAYS AS (METADATA$FILECOL3)
)
LOCATION='file:///home/admin/oceanbase/'
PATTERN='data.csv'
FORMAT (
  TYPE = 'CSV',
  FIELD_DELIMITER = ',',
  FIELD_OPTIONALLY_ENCLOSED_BY = '"',
  ENCODING = 'utf8mb4'
)COMPRESS FOR ARCHIVE REPLICA_NUM = 1 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set

SELECT * FROM ext_t3;

The query result is as follows:

+----+------+-------+
| ID | NAME | SCORE |
+----+------+-------+
|  1 | lin  |    98 |
|  2 | hei  |    90 |
|  3 | ali  |    95 |
+----+------+-------+
3 rows in set

In addition, you can perform combined queries with external tables and regular tables. Assume that the current database contains a regular table named info with the following data:

+------+--------+------+
| NAME | SEX    | AGE  |
+------+--------+------+
| lin  | male   |    8 |
| hei  | male   |    9 |
| li   | female |    8 |
+------+--------+------+
3 rows in set

Here is an example of combining the external table ext_t3 with the regular table info.

SELECT info.* FROM info, ext_t3 WHERE info.name = ext_t3.name AND ext_t3.score > 90;

The query result is as follows:

+------+--------+------+
| NAME | SEX    | AGE  |
+------+--------+------+
| lin  | male   |    8 |
| li   | female |    8 |
+------+--------+------+
2 rows in set

For more information about query operations, see Read data.

Considerations

You can query external tables only. You cannot perform DML operations on external tables.
If the external file accessed by an external table is deleted, no error is returned. Instead, an empty row is returned.
If the external storage system is unavailable, an error is returned when you query an external table.

OceanBase

Customer Stories

Documentation

Create an external table

Privilege requirements

Create an external table

Notice

Notice

Notice

Note

Notice

Notice

Note

Considerations

What to do next

References

Create an external table

Privilege requirements

Create an external table

Notice

Notice

Notice

Note

Notice

Notice

Note

Considerations

What to do next

References