Starting from OceanBase Database V4.3.5 BP1, you can directly read data from CSV, Parquet, ORC files, or an ODPS table by using the SELECT statement without manually creating an external table. This feature is suitable for the following typical AP scenarios:
- Data exploration
- Rapid data loading
URL external tables are an unregistered external data access feature provided by OceanBase Database. You can directly specify the data file path and format by using the SELECT or LOAD DATA statement. The system dynamically parses the data and returns the result. You do not need to create an external table in advance.
Supported data sources and formats
| Data source type | Access method | Supported format | Description |
|---|---|---|---|
| Local / object storage file | FILES() |
CSV / Parquet / ORC | Supports HDFS, OSS, S3, and object storage systems that are compatible with the S3 protocol, as well as local files |
| MaxCompute (ODPS) | SOURCE() |
ODPS table | Directly connects to the ODPS Tunnel without requiring export |
Notice
- CSV query results are of the
VARCHARtype by default (you can use theCASTfunction to convert the data type later).
- Parquet/ ORC automatically infers the schema, including nested structures.
- There is no limit on the size of a single file that can be read.
- ODPS does not support the
FILES()function. You must use theSOURCE()function.
Query external data
General syntax
Method 1: Use the location URL format (simple)
SELECT * FROM 'outfiles/'
(
FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
PATTERN = 'regex'
);
Method 2: Use the table function format (flexible and recommended)
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
PATTERN = 'regex'
);
'outfiles/': specifies the path where the external files are stored. Subdirectories can be recursively scanned.FORMAT: specifies the format and parsing options for the external files.PATTERN: specifies a regular expression pattern to filter files in theLOCATIONdirectory. If not specified, all files are read by default.
Query examples in different formats
CSV files
General syntax
The URL external table supports two equivalent syntax forms for reading CSV files, with TYPE = 'CSV' specified:
Location URL form (concise)
SELECT * FROM 'outfiles/'
(
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ','
),
PATTERN = 'data$'
);
Table function form (flexible, recommended)
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ','
),
PATTERN = 'data$'
);
Note
outfiles/specifies the path where the external table files are stored, supporting recursive scanning of subdirectories.
FORMATspecifies the external file format as CSV and related parsing options.
PATTERNspecifies the regular expression pattern for filtering files in theLOCATIONdirectory; if not specified, all files are read.
- Important: When using the URL external table to read CSV files, all column data types default to VARCHAR (which can be converted to other types using CAST later).
The LOCATION parameter can be specified as a string path (e.g., 'oss://bucket/path/', 'outfiles/') or the name of an existing Location object (without quotes).
Example 1: Create a persistent external table and reference a Location object
This example is suitable for scenarios where the external table needs to be used long-term. The table metadata is stored in the database.
-- First, define a Location object named my_hdfs_loc
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';
-- Reference the Location object when creating the external table
CREATE EXTERNAL TABLE ex_t1 (
c1 INT,
c2 INT,
c3 INT
)
LOCATION = @my_hdfs_loc
FORMAT (
TYPE = 'csv',
FIELD_DELIMITER = ',',
LINE_DELIMITER = '\n'
);
Example 2: Reference a Location object in a temporary query using FILES()
This example is suitable for one-time analysis of external files without the need to create a persistent table.
-- Assume the Location object my_hdfs_loc already exists
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';
-- Directly reference the Location object in the FILES table function
SELECT * FROM FILES(
LOCATION = @my_hdfs_loc, -- Use the @ prefix to reference a defined Location
FORMAT (TYPE = 'PARQUET'),
PATTERN = '.*\\.parquet$' -- Match all .parquet files
);
Example 3: Specify the path directly without using a Location object
This example is suitable for simple or temporary scenarios. If the path does not need to be reused, you can directly write the URL or path in the query, supporting two equivalent syntax forms.
-- Method 1: Use the path as the table name (concise syntax)
SELECT * FROM '/data/'
(
FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
PATTERN = '^datafiles.*\\.csv$' -- Note: PATTERN is a regular expression
);
-- Method 2: Use the FILES table function (explicit, recommended for complex configurations)
SELECT * FROM FILES(
LOCATION = '/data/',
FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
PATTERN = '^datafiles.*\\.csv$'
);
Support for header parsing
When accessing a CSV file using the FILES() function, if the table structure (column names and count) is not explicitly defined, the system automatically infers the table structure:
- Column count: Sample the first line of the CSV file in the specified directory.
- Column names: Default to
'c1','c2', etc.
Many CSV files have the first line as the header (column names), with the subsequent lines containing the actual data. OceanBase Database provides the PARSE_HEADER configuration parameter to control whether the first line is parsed as the header.
- If the first line of the CSV file contains column names: Set
PARSE_HEADER = TRUE. - If the first line of the CSV file contains actual data: Keep
PARSE_HEADER = FALSE(default).
-- Basic query, CSV file has no header (first line is data), using default column names c1, c2...
SELECT * FROM FILES(
LOCATION = 'oss://my-bucket/logs/',
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
SKIP_BLANK_LINES = TRUE
),
PATTERN = 'user_log_202504.*\\.csv$'
);
-- CSV file has header (e.g., name, age, city), enable PARSE_HEADER
SELECT * FROM FILES(
LOCATION = '/data/sales.csv',
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
PARSE_HEADER = TRUE -- Parse the first line as the header
)
);
Limitations: PARSE_HEADER and SKIP_HEADER are mutually exclusive and cannot be used together.
Parquet files (schema inferred automatically)
-- Location URL format
SELECT * FROM 'outfiles/'
(
FORMAT (TYPE = 'PARQUET'),
PATTERN = 'data$'
);
-- Table function format
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (TYPE = 'PARQUET'),
PATTERN = 'data$'
);
'outfiles/'specifies the path where the external table files are stored. Subdirectories are recursively scanned.FORMATspecifies the file format asPARQUET.PATTERNis used for file filtering (regular expression matching).
-- Example: Read Parquet partitioned files from S3
SELECT * FROM FILES(
LOCATION = 's3://analytics-bucket/events/',
FORMAT (TYPE = 'PARQUET'),
PATTERN = 'part-.*\\.parquet$'
);
-- Read all Parquet files starting with "datafiles" in the '/data/' path
SELECT * FROM '/data/' (
FORMAT = (TYPE = 'PARQUET'),
PATTERN = 'datafiles$'
);
SELECT * FROM FILES(
LOCATION = '/data/',
FORMAT = (TYPE = 'PARQUET'),
PATTERN = 'datafiles$'
);
ORC files
-- URL location format
SELECT * FROM 'outfiles/'
(
FORMAT (TYPE = 'ORC'),
PATTERN = 'data$'
);
-- Table function format
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (TYPE = 'ORC'),
PATTERN = 'data$'
);
'outfiles/'specifies the path, which supports subdirectories.FORMATspecifies the format asORC.PATTERNis used for file filtering.
-- Example: Read ORC files from HDFS
SELECT * FROM 'hdfs://mycluster/data/orc/'
(
FORMAT (TYPE = 'ORC'),
PATTERN = 'clicks_202504.*'
);
MaxCompute tables (directly connected to MaxCompute)
ODPS data sources are not in file format and do not support FILES(). Only SOURCE() is supported:
SELECT * FROM SOURCE(
TYPE = 'ODPS',
ACCESSID = 'LTAI5tXXXXXX',
ACCESSKEY = 'xxxxxxxxxxxxxx',
ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
TUNNEL_ENDPOINT = 'http://dt.cn-hangzhou.maxcompute.aliyun.com',
PROJECT_NAME = 'sales_analytics',
TABLE_NAME = 'user_behavior'
);
Import external data (LOAD DATA)
Import external data into an internal table of OceanBase Database efficiently. Parallel import and direct write are supported.
Syntax
LOAD DATA
[/*+ INSERT HINT */]
[REPLACE | IGNORE]
FROM {
<url_table_function_expr> |
( SELECT expression_list FROM <url_table_function_expr> )
}
INTO TABLE table_name
[PARTITION (partition_name1, [partition_name2 ...])]
[(column_name_var [, column_name_var] ...)]
where <url_table_function_expr> is:
FILES (
LOCATION = '<string>',
{
FORMAT = (
TYPE = 'CSV',
LINE_DELIMITER = '<string>' | <expr>,
FIELD_DELIMITER = '<string>' | <expr>,
ESCAPE = '<character>' | <expr>,
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>,
ENCODING = 'charset',
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...),
SKIP_HEADER = <int>,
SKIP_BLANK_LINES = { TRUE | FALSE },
TRIM_SPACE = { TRUE | FALSE },
EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
)
| FORMAT = ( TYPE = 'PARQUET' | 'ORC' )
},
[PATTERN = '<regex_pattern>']
)
| SOURCE (
TYPE = 'ODPS',
ACCESSID = '<string>',
ACCESSKEY = '<string>',
ENDPOINT = '<string>',
TUNNEL_ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>'
)
LOCATIONspecifies the file path, which supports recursive scanning of subdirectories.- For a local path, the format is
file://path/or/path/. The path must be a subdirectory of thesecure_file_privdirectory. - For a remote path, the format is
oss://bucket/path/ors3://bucket/path/. Sensitive information, such as AccessKey, is encrypted and stored in system tables by OceanBase Database.
Note
Sensitive credentials, such as AccessKey, are encrypted and stored in system tables by OceanBase Database. You can omit them in the SQL statement by reusing the preconfigured LOCATION object.
Examples
Import CSV files (all columns or specified columns)
-- Import all columns from data1.csv
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
LOCATION = 'data/csv',
FORMAT = (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
SKIP_BLANK_LINES = TRUE
),
PATTERN = 'data1.csv'
)
INTO TABLE csv_ex_t1;
-- Import only the c1 and c2 columns to the col1 and col2 columns of the target table
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM (
SELECT c1, c2 FROM FILES(
LOCATION = 'data/csv',
FORMAT = (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
SKIP_BLANK_LINES = TRUE
),
PATTERN = 'data1.csv'
)
)
INTO TABLE csv_ex_t1 (col1, col2);
Import Parquet files
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
LOCATION = 'data/parquet',
FORMAT = (TYPE = 'PARQUET'),
PATTERN = 'data1.parquet'
)
INTO TABLE parquet_ex_t1;
Import ORC files
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
LOCATION = 'data/orc',
FORMAT = (TYPE = 'ORC'),
PATTERN = 'data1.orc'
)
INTO TABLE orc_ex_t1;
Import from ODPS
LOAD DATA
FROM SOURCE (
TYPE = 'ODPS',
ACCESSID = '$ODPS_ACCESSID',
ACCESSKEY = '$ODPS_ACCESSKEY',
ENDPOINT = '$ODPS_ENDPOINT',
PROJECT_NAME = 'example_project',
SCHEMA_NAME = '',
TABLE_NAME = 'example_table',
QUOTA_NAME = '',
COMPRESSION_CODE = ''
)
INTO TABLE odps_ex_t1;
Parameter description
LOCATION path format
| Type | Format | Description |
|---|---|---|
| Local file | file://data/ or /data/ |
The secure_file_priv parameter must be configured. The path must be a subdirectory of the secure_file_priv directory. |
| OSS | oss://bucket/path/ |
An AccessKey must be configured for the OBServer node. |
| S3 | s3://bucket/path/ |
Supports IAM Role or AccessKey. |
| HDFS | hdfs://namenode:8020/path/ |
Java SDK and Kerberos (if enabled) must be deployed. |
CSV format options (common)
| Parameter | Default value | Description |
|---|---|---|
FIELD_DELIMITER |
, |
Field delimiter. |
PARSE_HEADER |
FALSE |
Specifies whether to treat the first row as the column names. |
SKIP_BLANK_LINES |
FALSE |
Specifies whether to skip blank lines. |
EMPTY_FIELD_AS_NULL |
FALSE |
Specifies whether to convert empty fields to NULL. |
ENCODING |
utf8 |
File encoding. |
Performance hints (for LOAD DATA)
/*+ DIRECT(TRUE) PARALLEL(8) */
DIRECT(TRUE): Bypasses the transaction layer and writes data directly to the storage engine (fast but not reversible).PARALLEL(N): Specifies the number of files to read in parallel. N must be less than or equal to the number of files.