URL external tables in OceanBase Database|V4.3.5|OceanBase Database| docs|Distributed Database

URL external tables in OceanBase Database

Last Updated：2026-05-11 02:24:09 Updated

OceanBase Database allows you to directly read data from CSV, Parquet, ORC files, or an ODPS table by using a SELECT statement, without the need to manually create an external table. This feature is particularly useful in the following typical AP scenarios:

Temporary data exploration
Rapid data loading

URL external tables in OceanBase Database are a metadata-free external data access feature. You can directly specify the data file path and format using a SELECT or LOAD DATA statement. The system dynamically parses and returns the results, without the need to create an external table in advance.

Supported data sources and formats

Data source type	Access method	Supported formats	Description
Local / object storage file	`FILES()`	CSV / Parquet / ORC	Supports HDFS, OSS, S3, and object storage compatible with the S3 protocol, as well as local files
MaxCompute (ODPS)	`SOURCE()`	ODPS table	Directly access ODPS Tunnel without export

Notice

The query result of a CSV file is of the VARCHAR type by default. You can use the CAST function to change the data type.

The schema of a Parquet or ORC file is automatically inferred. The schema contains nested structures.

There is no limit on the size of a file to be read at a time.

ODPS does not support the FILES() function. You must use the SOURCE() function.

Query external data

General syntax

Method 1: Use the Location URL form (simple)

SELECT * FROM 'outfiles/'
(
    FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
    PATTERN = 'regex'
);

Method 2: Use the Table Function form (flexible and recommended)

SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
    PATTERN = 'regex'
);

'outfiles/': specifies the path where external files are stored. Subdirectories can be recursively scanned.
FORMAT: specifies the format and parsing options of external files.
PATTERN: specifies a regular expression pattern to filter files in the LOCATION directory. If not specified, all files are read by default.

Examples of queries for each format

CSV file

General syntax

The URL external table supports two equivalent syntax forms for reading CSV files. You must specify TYPE = 'CSV':

Location URL form (simple)

SELECT * FROM 'outfiles/'
(
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ','
    ),
    PATTERN = 'data$'
);

Table function form (flexible, recommended)

SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ','
    ),
    PATTERN = 'data$'
);

Note

outfiles/ specifies the path where external table files are stored. Subdirectories can be recursively scanned.

FORMAT specifies the CSV file format and related parsing options.

PATTERN specifies a regular expression pattern to filter files in the LOCATION directory. If not specified, all files are read.

Important: When using the URL external table to read CSV files, all columns are of the VARCHAR type by default. You can convert them to other types using CAST later.

The LOCATION parameter can be a string path (e.g., 'oss://bucket/path/', 'outfiles/') or the name of a created Location object (without quotes).

Example 1: Create a persistent external table and reference a Location object

This example is suitable for scenarios where the external table needs to be used long-term. The table metadata is stored in the database.

-- Define a Location object named my_hdfs_loc
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';

-- Reference the Location object when creating the external table
CREATE EXTERNAL TABLE ex_t1 (
    c1 INT,
    c2 INT,
    c3 INT
)
LOCATION = @my_hdfs_loc
FORMAT (
    TYPE = 'csv',
    FIELD_DELIMITER = ',',
    LINE_DELIMITER = '\n'
);

Example 2: Reference a Location object in a temporary query using FILES()

This example is suitable for one-time analysis of external files without the need to create a persistent table.

-- Assume the Location object my_hdfs_loc already exists
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';

-- Reference the Location object directly in the FILES table function
SELECT * FROM FILES(
    LOCATION = @my_hdfs_loc,      -- Use the @ prefix to reference a defined Location
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = '.*\\.parquet$'     -- Match all .parquet files
);

Example 3: Specify the path directly without using a Location object

This example is suitable for simple or temporary scenarios. If you do not need to reuse the path, you can directly write the URL or path in the query. Two equivalent syntax forms are supported.

-- Method 1: Use the path as the table name (simple syntax)
SELECT * FROM '/data/'
(
    FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
    PATTERN = '^datafiles.*\\.csv$'  -- Note: PATTERN is a regular expression
);

-- Method 2: Use the FILES table function (explicit, recommended for complex configurations)
SELECT * FROM FILES(
    LOCATION = '/data/',
    FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
    PATTERN = '^datafiles.*\\.csv$'
);

Support for header parsing

When accessing a CSV file through the FILES() function, if the table schema (column names and count) is not explicitly defined, the system automatically infers the schema:

Column count: The first line of the CSV file in the specified directory is sampled.
Column names: Default to 'c1', 'c2', etc.

Since many CSV files have the first line as the header (column names), and the subsequent lines as data, OceanBase provides the PARSE_HEADER configuration parameter to control whether the first line is parsed as the header.

If the first line of the CSV file contains column names: Set PARSE_HEADER = TRUE.
If the first line of the CSV file contains actual data: Keep PARSE_HEADER = FALSE (default).

-- Basic query, CSV file has no header (first line is data), use default column names c1, c2...
SELECT * FROM FILES(
    LOCATION = 'oss://my-bucket/logs/',
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ',',
        SKIP_BLANK_LINES = TRUE
    ),
    PATTERN = 'user_log_202504.*\\.csv$'
);

-- CSV file has a header (e.g., name, age, city), enable PARSE_HEADER
SELECT * FROM FILES(
    LOCATION = '/data/sales.csv',
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ',',
        PARSE_HEADER = TRUE  -- Parse the first line as the header
    )
);

Limitations: PARSE_HEADER and SKIP_HEADER are mutually exclusive and cannot be used together.

Parquet files (schema automatically inferred)

-- Location URL format
SELECT * FROM 'outfiles/'
(
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = 'data$'
);

-- Table function format
SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = 'data$'
);

'outfiles/' specifies the path where the external table files are stored. Subdirectories are recursively scanned.
FORMAT specifies the file format as PARQUET.
PATTERN is used for file filtering (regular expression matching).

-- Example: Read Parquet partitioned files from S3
SELECT * FROM FILES(
    LOCATION = 's3://analytics-bucket/events/',
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = 'part-.*\\.parquet$'
);

-- Read all Parquet files starting with "datafiles" from the '/data/' path
SELECT * FROM '/data/' (
    FORMAT = (TYPE = 'PARQUET'),
    PATTERN = 'datafiles$'
);

SELECT * FROM FILES(
    LOCATION = '/data/',
    FORMAT = (TYPE = 'PARQUET'),
    PATTERN = 'datafiles$'
);

ORC files

-- URL location format
SELECT * FROM 'outfiles/'
(
    FORMAT (TYPE = 'ORC'),
    PATTERN = 'data$'
);

-- Table function format
SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (TYPE = 'ORC'),
    PATTERN = 'data$'
);

'outfiles/' specifies the path, which supports subdirectories.
FORMAT specifies the format as ORC.
PATTERN is used for file filtering.

-- Example: Read ORC files from HDFS
SELECT * FROM 'hdfs://mycluster/data/orc/'
(
    FORMAT (TYPE = 'ORC'),
    PATTERN = 'clicks_202504.*'
);

MaxCompute tables (directly connected)

ODPS data sources are not in file format and do not support FILES(). Only SOURCE() is supported:

SELECT * FROM SOURCE(
    TYPE = 'ODPS',
    ACCESSID = 'LTAI5tXXXXXX',
    ACCESSKEY = 'xxxxxxxxxxxxxx',
    ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
    TUNNEL_ENDPOINT = 'http://dt.cn-hangzhou.maxcompute.aliyun.com',
    PROJECT_NAME = 'sales_analytics',
    TABLE_NAME = 'user_behavior'
);

Import external data (LOAD DATA)

You can efficiently import external data into an internal table of OceanBase Database. The operation supports parallel import and direct write.

Syntax

LOAD DATA
    [/*+ INSERT HINT */]
    [REPLACE | IGNORE]
    FROM {
        <url_table_function_expr> |
        ( SELECT expression_list FROM <url_table_function_expr> )
    }
    INTO TABLE table_name
    [PARTITION (partition_name1, [partition_name2 ...])]
    [(column_name_var [, column_name_var] ...)]

In the preceding syntax, <url_table_function_expr> is:

FILES (
    LOCATION = '<string>',
    {
        FORMAT = (
            TYPE = 'CSV',
            LINE_DELIMITER = '<string>' | <expr>,
            FIELD_DELIMITER = '<string>' | <expr>,
            ESCAPE = '<character>' | <expr>,
            FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>,
            ENCODING = 'charset',
            NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...),
            SKIP_HEADER = <int>,
            SKIP_BLANK_LINES = { TRUE | FALSE },
            TRIM_SPACE = { TRUE | FALSE },
            EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
        )
        | FORMAT = ( TYPE = 'PARQUET' | 'ORC' )
    },
    [PATTERN = '<regex_pattern>']
)
| SOURCE (
    TYPE = 'ODPS',
    ACCESSID = '<string>',
    ACCESSKEY = '<string>',
    ENDPOINT = '<string>',
    TUNNEL_ENDPOINT = '<string>',
    PROJECT_NAME = '<string>',
    SCHEMA_NAME = '<string>',
    TABLE_NAME = '<string>',
    QUOTA_NAME = '<string>',
    COMPRESSION_CODE = '<string>'
)

LOCATION specifies the file path. Subdirectories can be recursively scanned.
For a local path, the format is file://path/ or /path/. The path must be a subdirectory of the secure_file_priv directory.
For a remote path, the format is oss://bucket/path/. Sensitive information such as AccessKey is encrypted and stored in the system table by OceanBase Database.

Note

Sensitive information such as AccessKey is encrypted and stored in the system table by OceanBase Database. You can omit the sensitive information in the SQL statement by reusing the preconfigured LOCATION object.

Examples

Import CSV files (all columns or specified columns)

-- Import all columns from data1.csv.
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
    LOCATION = 'data/csv',
    FORMAT = (
        TYPE = 'CSV',
        FIELD_DELIMITER = ',',
        SKIP_BLANK_LINES = TRUE
    ),
    PATTERN = 'data1.csv'
)
INTO TABLE csv_ex_t1;

-- Import only the c1 and c2 columns to the col1 and col2 columns of the target table.
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM (
    SELECT c1, c2 FROM FILES(
        LOCATION = 'data/csv',
        FORMAT = (
            TYPE = 'CSV',
            FIELD_DELIMITER = ',',
            SKIP_BLANK_LINES = TRUE
        ),
        PATTERN = 'data1.csv'
    )
)
INTO TABLE csv_ex_t1 (col1, col2);

Import Parquet files

LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
    LOCATION = 'data/parquet',
    FORMAT = (TYPE = 'PARQUET'),
    PATTERN = 'data1.parquet'
)
INTO TABLE parquet_ex_t1;

Import ORC files

LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
    LOCATION = 'data/orc',
    FORMAT = (TYPE = 'ORC'),
    PATTERN = 'data1.orc'
)
INTO TABLE orc_ex_t1;

Import from ODPS

LOAD DATA
FROM SOURCE (
    TYPE = 'ODPS',
    ACCESSID = '$ODPS_ACCESSID',
    ACCESSKEY = '$ODPS_ACCESSKEY',
    ENDPOINT = '$ODPS_ENDPOINT',
    PROJECT_NAME = 'example_project',
    SCHEMA_NAME = '',
    TABLE_NAME = 'example_table',
    QUOTA_NAME = '',
    COMPRESSION_CODE = ''
)
INTO TABLE odps_ex_t1;

Parameters

LOCATION path format

Type	Format	Description
Local file	`file://data/` or `/data/`	The `secure_file_priv` parameter must be configured, and the path must be a subdirectory of the `secure_file_priv` directory.
OSS	`oss://bucket/path/`	An AccessKey must be configured on the OBServer node.
S3	`s3://bucket/path/`	Supports IAM Role or AccessKey.
HDFS	`hdfs://namenode:8020/path/`	Java SDK and Kerberos must be deployed (if Kerberos is enabled).

CSV format options (common)

Parameter	Default value	Description
`FIELD_DELIMITER`	`,`	The field delimiter.
`PARSE_HEADER`	`FALSE`	Specifies whether to use the first line as the column names.
`SKIP_BLANK_LINES`	`FALSE`	Specifies whether to skip blank lines.
`EMPTY_FIELD_AS_NULL`	`FALSE`	Specifies whether to convert empty fields to NULL values.
`ENCODING`	`utf8`	The file encoding.

Performance hints (for LOAD DATA)

/*+ DIRECT(TRUE) PARALLEL(8) */

DIRECT(TRUE): Bypasses the transaction layer and directly writes data to the storage engine (high speed, but not rollable).
PARALLEL(N): Specifies the number of parallel file reads (N ≤ number of files).

OceanBase

Customer Stories

Documentation

URL external tables in OceanBase Database

Supported data sources and formats

Notice

Query external data

General syntax

Method 1: Use the Location URL form (simple)

Method 2: Use the Table Function form (flexible and recommended)

Examples of queries for each format

CSV file

Note

Parquet files (schema automatically inferred)

ORC files

MaxCompute tables (directly connected)

Import external data (LOAD DATA)

Syntax

Note

Examples

Import CSV files (all columns or specified columns)

Import Parquet files

Import ORC files

Import from ODPS

Parameters

LOCATION path format

CSV format options (common)

Performance hints (for LOAD DATA)

References