URL external tables of OceanBase Database|V4.4.2|OceanBase Database| docs|Distributed Database

URL external tables of OceanBase Database

Last Updated：2026-04-02 06:23:56 Updated

Starting from OceanBase Database V4.3.5 BP1, you can directly read data from CSV, Parquet, ORC files, or an ODPS table by using the SELECT statement without manually creating an external table. This feature is suitable for the following typical AP scenarios:

Data exploration
Rapid data loading

URL external tables are an unregistered external data access feature provided by OceanBase Database. You can directly specify the data file path and format by using the SELECT or LOAD DATA statement. The system dynamically parses the data and returns the result. You do not need to create an external table in advance.

Supported data sources and formats

Data source type	Access method	Supported format	Description
Local / object storage file	`FILES()`	CSV / Parquet / ORC	Supports HDFS, OSS, S3, and object storage systems that are compatible with the S3 protocol, as well as local files
MaxCompute (ODPS)	`SOURCE()`	ODPS table	Directly connects to the ODPS Tunnel without requiring export

Notice

CSV query results are of the VARCHAR type by default (you can use the CAST function to convert the data type later).

Parquet/ ORC automatically infers the schema, including nested structures.

There is no limit on the size of a single file that can be read.

ODPS does not support the FILES() function. You must use the SOURCE() function.

Query external data

General syntax

Method 1: Use the location URL format (simple)

SELECT * FROM 'outfiles/'
(
    FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
    PATTERN = 'regex'
);

Method 2: Use the table function format (flexible and recommended)

SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
    PATTERN = 'regex'
);

'outfiles/': specifies the path where the external files are stored. Subdirectories can be recursively scanned.
FORMAT: specifies the format and parsing options for the external files.
PATTERN: specifies a regular expression pattern to filter files in the LOCATION directory. If not specified, all files are read by default.

Query examples in different formats

CSV files

General syntax

The URL external table supports two equivalent syntax forms for reading CSV files, with TYPE = 'CSV' specified:

Location URL form (concise)

SELECT * FROM 'outfiles/'
(
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ','
    ),
    PATTERN = 'data$'
);

Table function form (flexible, recommended)

SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ','
    ),
    PATTERN = 'data$'
);

Note

outfiles/ specifies the path where the external table files are stored, supporting recursive scanning of subdirectories.

FORMAT specifies the external file format as CSV and related parsing options.

PATTERN specifies the regular expression pattern for filtering files in the LOCATION directory; if not specified, all files are read.

Important: When using the URL external table to read CSV files, all column data types default to VARCHAR (which can be converted to other types using CAST later).

The LOCATION parameter can be specified as a string path (e.g., 'oss://bucket/path/', 'outfiles/') or the name of an existing Location object (without quotes).

Example 1: Create a persistent external table and reference a Location object

This example is suitable for scenarios where the external table needs to be used long-term. The table metadata is stored in the database.

-- First, define a Location object named my_hdfs_loc
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';

-- Reference the Location object when creating the external table
CREATE EXTERNAL TABLE ex_t1 (
    c1 INT,
    c2 INT,
    c3 INT
)
LOCATION = @my_hdfs_loc
FORMAT (
    TYPE = 'csv',
    FIELD_DELIMITER = ',',
    LINE_DELIMITER = '\n'
);

Example 2: Reference a Location object in a temporary query using FILES()

This example is suitable for one-time analysis of external files without the need to create a persistent table.

-- Assume the Location object my_hdfs_loc already exists
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';

-- Directly reference the Location object in the FILES table function
SELECT * FROM FILES(
    LOCATION = @my_hdfs_loc,      -- Use the @ prefix to reference a defined Location
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = '.*\\.parquet$'     -- Match all .parquet files
);

Example 3: Specify the path directly without using a Location object

This example is suitable for simple or temporary scenarios. If the path does not need to be reused, you can directly write the URL or path in the query, supporting two equivalent syntax forms.

-- Method 1: Use the path as the table name (concise syntax)
SELECT * FROM '/data/'
(
    FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
    PATTERN = '^datafiles.*\\.csv$'  -- Note: PATTERN is a regular expression
);

-- Method 2: Use the FILES table function (explicit, recommended for complex configurations)
SELECT * FROM FILES(
    LOCATION = '/data/',
    FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
    PATTERN = '^datafiles.*\\.csv$'
);

Support for header parsing

When accessing a CSV file using the FILES() function, if the table structure (column names and count) is not explicitly defined, the system automatically infers the table structure:

Column count: Sample the first line of the CSV file in the specified directory.
Column names: Default to 'c1', 'c2', etc.

Many CSV files have the first line as the header (column names), with the subsequent lines containing the actual data. OceanBase Database provides the PARSE_HEADER configuration parameter to control whether the first line is parsed as the header.

If the first line of the CSV file contains column names: Set PARSE_HEADER = TRUE.
If the first line of the CSV file contains actual data: Keep PARSE_HEADER = FALSE (default).

-- Basic query, CSV file has no header (first line is data), using default column names c1, c2...
SELECT * FROM FILES(
    LOCATION = 'oss://my-bucket/logs/',
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ',',
        SKIP_BLANK_LINES = TRUE
    ),
    PATTERN = 'user_log_202504.*\\.csv$'
);

-- CSV file has header (e.g., name, age, city), enable PARSE_HEADER
SELECT * FROM FILES(
    LOCATION = '/data/sales.csv',
    FORMAT (
        TYPE = 'CSV',
        FIELD_DELIMITER = ',',
        PARSE_HEADER = TRUE  -- Parse the first line as the header
    )
);

Limitations: PARSE_HEADER and SKIP_HEADER are mutually exclusive and cannot be used together.

Parquet files (schema inferred automatically)

-- Location URL format
SELECT * FROM 'outfiles/'
(
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = 'data$'
);

-- Table function format
SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = 'data$'
);

'outfiles/' specifies the path where the external table files are stored. Subdirectories are recursively scanned.
FORMAT specifies the file format as PARQUET.
PATTERN is used for file filtering (regular expression matching).

-- Example: Read Parquet partitioned files from S3
SELECT * FROM FILES(
    LOCATION = 's3://analytics-bucket/events/',
    FORMAT (TYPE = 'PARQUET'),
    PATTERN = 'part-.*\\.parquet$'
);

-- Read all Parquet files starting with "datafiles" in the '/data/' path
SELECT * FROM '/data/' (
    FORMAT = (TYPE = 'PARQUET'),
    PATTERN = 'datafiles$'
);

SELECT * FROM FILES(
    LOCATION = '/data/',
    FORMAT = (TYPE = 'PARQUET'),
    PATTERN = 'datafiles$'
);

ORC files

-- URL location format
SELECT * FROM 'outfiles/'
(
    FORMAT (TYPE = 'ORC'),
    PATTERN = 'data$'
);

-- Table function format
SELECT * FROM FILES(
    LOCATION = 'outfiles/',
    FORMAT (TYPE = 'ORC'),
    PATTERN = 'data$'
);

'outfiles/' specifies the path, which supports subdirectories.
FORMAT specifies the format as ORC.
PATTERN is used for file filtering.

-- Example: Read ORC files from HDFS
SELECT * FROM 'hdfs://mycluster/data/orc/'
(
    FORMAT (TYPE = 'ORC'),
    PATTERN = 'clicks_202504.*'
);

MaxCompute tables (directly connected to MaxCompute)

ODPS data sources are not in file format and do not support FILES(). Only SOURCE() is supported:

SELECT * FROM SOURCE(
    TYPE = 'ODPS',
    ACCESSID = 'LTAI5tXXXXXX',
    ACCESSKEY = 'xxxxxxxxxxxxxx',
    ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
    TUNNEL_ENDPOINT = 'http://dt.cn-hangzhou.maxcompute.aliyun.com',
    PROJECT_NAME = 'sales_analytics',
    TABLE_NAME = 'user_behavior'
);

Import external data (LOAD DATA)

Import external data into an internal table of OceanBase Database efficiently. Parallel import and direct write are supported.

Syntax

LOAD DATA
    [/*+ INSERT HINT */]
    [REPLACE | IGNORE]
    FROM {
        <url_table_function_expr> |
        ( SELECT expression_list FROM <url_table_function_expr> )
    }
    INTO TABLE table_name
    [PARTITION (partition_name1, [partition_name2 ...])]
    [(column_name_var [, column_name_var] ...)]

where <url_table_function_expr> is:

FILES (
    LOCATION = '<string>',
    {
        FORMAT = (
            TYPE = 'CSV',
            LINE_DELIMITER = '<string>' | <expr>,
            FIELD_DELIMITER = '<string>' | <expr>,
            ESCAPE = '<character>' | <expr>,
            FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>,
            ENCODING = 'charset',
            NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...),
            SKIP_HEADER = <int>,
            SKIP_BLANK_LINES = { TRUE | FALSE },
            TRIM_SPACE = { TRUE | FALSE },
            EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
        )
        | FORMAT = ( TYPE = 'PARQUET' | 'ORC' )
    },
    [PATTERN = '<regex_pattern>']
)
| SOURCE (
    TYPE = 'ODPS',
    ACCESSID = '<string>',
    ACCESSKEY = '<string>',
    ENDPOINT = '<string>',
    TUNNEL_ENDPOINT = '<string>',
    PROJECT_NAME = '<string>',
    SCHEMA_NAME = '<string>',
    TABLE_NAME = '<string>',
    QUOTA_NAME = '<string>',
    COMPRESSION_CODE = '<string>'
)

LOCATION specifies the file path, which supports recursive scanning of subdirectories.
For a local path, the format is file://path/ or /path/. The path must be a subdirectory of the secure_file_priv directory.
For a remote path, the format is oss://bucket/path/ or s3://bucket/path/. Sensitive information, such as AccessKey, is encrypted and stored in system tables by OceanBase Database.

Note

Sensitive credentials, such as AccessKey, are encrypted and stored in system tables by OceanBase Database. You can omit them in the SQL statement by reusing the preconfigured LOCATION object.

Examples

Import CSV files (all columns or specified columns)

-- Import all columns from data1.csv
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
    LOCATION = 'data/csv',
    FORMAT = (
        TYPE = 'CSV',
        FIELD_DELIMITER = ',',
        SKIP_BLANK_LINES = TRUE
    ),
    PATTERN = 'data1.csv'
)
INTO TABLE csv_ex_t1;

-- Import only the c1 and c2 columns to the col1 and col2 columns of the target table
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM (
    SELECT c1, c2 FROM FILES(
        LOCATION = 'data/csv',
        FORMAT = (
            TYPE = 'CSV',
            FIELD_DELIMITER = ',',
            SKIP_BLANK_LINES = TRUE
        ),
        PATTERN = 'data1.csv'
    )
)
INTO TABLE csv_ex_t1 (col1, col2);

Import Parquet files

LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
    LOCATION = 'data/parquet',
    FORMAT = (TYPE = 'PARQUET'),
    PATTERN = 'data1.parquet'
)
INTO TABLE parquet_ex_t1;

Import ORC files

LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
    LOCATION = 'data/orc',
    FORMAT = (TYPE = 'ORC'),
    PATTERN = 'data1.orc'
)
INTO TABLE orc_ex_t1;

Import from ODPS

LOAD DATA
FROM SOURCE (
    TYPE = 'ODPS',
    ACCESSID = '$ODPS_ACCESSID',
    ACCESSKEY = '$ODPS_ACCESSKEY',
    ENDPOINT = '$ODPS_ENDPOINT',
    PROJECT_NAME = 'example_project',
    SCHEMA_NAME = '',
    TABLE_NAME = 'example_table',
    QUOTA_NAME = '',
    COMPRESSION_CODE = ''
)
INTO TABLE odps_ex_t1;

Parameter description

LOCATION path format

Type	Format	Description
Local file	`file://data/` or `/data/`	The `secure_file_priv` parameter must be configured. The path must be a subdirectory of the `secure_file_priv` directory.
OSS	`oss://bucket/path/`	An AccessKey must be configured for the OBServer node.
S3	`s3://bucket/path/`	Supports IAM Role or AccessKey.
HDFS	`hdfs://namenode:8020/path/`	Java SDK and Kerberos (if enabled) must be deployed.

CSV format options (common)

Parameter	Default value	Description
`FIELD_DELIMITER`	`,`	Field delimiter.
`PARSE_HEADER`	`FALSE`	Specifies whether to treat the first row as the column names.
`SKIP_BLANK_LINES`	`FALSE`	Specifies whether to skip blank lines.
`EMPTY_FIELD_AS_NULL`	`FALSE`	Specifies whether to convert empty fields to NULL.
`ENCODING`	`utf8`	File encoding.

Performance hints (for LOAD DATA)

/*+ DIRECT(TRUE) PARALLEL(8) */

DIRECT(TRUE): Bypasses the transaction layer and writes data directly to the storage engine (fast but not reversible).
PARALLEL(N): Specifies the number of files to read in parallel. N must be less than or equal to the number of files.

OceanBase

Customer Stories

Documentation

URL external tables of OceanBase Database

Supported data sources and formats

Notice

Query external data

General syntax

Method 1: Use the location URL format (simple)

Method 2: Use the table function format (flexible and recommended)

Query examples in different formats

CSV files

Note

Parquet files (schema inferred automatically)

ORC files

MaxCompute tables (directly connected to MaxCompute)

Import external data (LOAD DATA)

Syntax

Note

Examples

Import CSV files (all columns or specified columns)

Import Parquet files

Import ORC files

Import from ODPS

Parameter description

LOCATION path format

CSV format options (common)

Performance hints (for LOAD DATA)

References