OceanBase Database allows you to directly read data from CSV, Parquet, ORC files, or an ODPS table by using a SELECT statement, without the need to manually create an external table. This feature is particularly useful in the following typical AP scenarios:
- Temporary data exploration
- Rapid data loading
URL external tables in OceanBase Database are a metadata-free external data access feature. You can directly specify the data file path and format using a SELECT or LOAD DATA statement. The system dynamically parses and returns the results, without the need to create an external table in advance.
Supported data sources and formats
Data source type |
Access method |
Supported formats |
Description |
|---|---|---|---|
| Local / object storage file | FILES() |
CSV / Parquet / ORC | Supports HDFS, OSS, S3, and object storage compatible with the S3 protocol, as well as local files |
| MaxCompute (ODPS) | SOURCE() |
ODPS table | Directly access ODPS Tunnel without export |
Notice
- The query result of a CSV file is of the
VARCHARtype by default. You can use theCASTfunction to change the data type.
- The schema of a Parquet or ORC file is automatically inferred. The schema contains nested structures.
- There is no limit on the size of a file to be read at a time.
- ODPS does not support the
FILES()function. You must use theSOURCE()function.
Query external data
General syntax
Method 1: Use the Location URL form (simple)
SELECT * FROM 'outfiles/'
(
FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
PATTERN = 'regex'
);
Method 2: Use the Table Function form (flexible and recommended)
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (TYPE = 'format_type', FIELD_DELIMITER = ','),
PATTERN = 'regex'
);
'outfiles/': specifies the path where external files are stored. Subdirectories can be recursively scanned.FORMAT: specifies the format and parsing options of external files.PATTERN: specifies a regular expression pattern to filter files in theLOCATIONdirectory. If not specified, all files are read by default.
Examples of queries for each format
CSV file
General syntax
The URL external table supports two equivalent syntax forms for reading CSV files. You must specify TYPE = 'CSV':
Location URL form (simple)
SELECT * FROM 'outfiles/'
(
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ','
),
PATTERN = 'data$'
);
Table function form (flexible, recommended)
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ','
),
PATTERN = 'data$'
);
Note
outfiles/specifies the path where external table files are stored. Subdirectories can be recursively scanned.
FORMATspecifies the CSV file format and related parsing options.
PATTERNspecifies a regular expression pattern to filter files in theLOCATIONdirectory. If not specified, all files are read.
- Important: When using the URL external table to read CSV files, all columns are of the VARCHAR type by default. You can convert them to other types using CAST later.
The LOCATION parameter can be a string path (e.g., 'oss://bucket/path/', 'outfiles/') or the name of a created Location object (without quotes).
Example 1: Create a persistent external table and reference a Location object
This example is suitable for scenarios where the external table needs to be used long-term. The table metadata is stored in the database.
-- Define a Location object named my_hdfs_loc
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';
-- Reference the Location object when creating the external table
CREATE EXTERNAL TABLE ex_t1 (
c1 INT,
c2 INT,
c3 INT
)
LOCATION = @my_hdfs_loc
FORMAT (
TYPE = 'csv',
FIELD_DELIMITER = ',',
LINE_DELIMITER = '\n'
);
Example 2: Reference a Location object in a temporary query using FILES()
This example is suitable for one-time analysis of external files without the need to create a persistent table.
-- Assume the Location object my_hdfs_loc already exists
CREATE LOCATION my_hdfs_loc URL = 'hdfs://namenode:8020/data/';
-- Reference the Location object directly in the FILES table function
SELECT * FROM FILES(
LOCATION = @my_hdfs_loc, -- Use the @ prefix to reference a defined Location
FORMAT (TYPE = 'PARQUET'),
PATTERN = '.*\\.parquet$' -- Match all .parquet files
);
Example 3: Specify the path directly without using a Location object
This example is suitable for simple or temporary scenarios. If you do not need to reuse the path, you can directly write the URL or path in the query. Two equivalent syntax forms are supported.
-- Method 1: Use the path as the table name (simple syntax)
SELECT * FROM '/data/'
(
FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
PATTERN = '^datafiles.*\\.csv$' -- Note: PATTERN is a regular expression
);
-- Method 2: Use the FILES table function (explicit, recommended for complex configurations)
SELECT * FROM FILES(
LOCATION = '/data/',
FORMAT (TYPE = 'CSV', FIELD_DELIMITER = ',', SKIP_BLANK_LINES = TRUE),
PATTERN = '^datafiles.*\\.csv$'
);
Support for header parsing
When accessing a CSV file through the FILES() function, if the table schema (column names and count) is not explicitly defined, the system automatically infers the schema:
- Column count: The first line of the CSV file in the specified directory is sampled.
- Column names: Default to
'c1','c2', etc.
Since many CSV files have the first line as the header (column names), and the subsequent lines as data, OceanBase provides the PARSE_HEADER configuration parameter to control whether the first line is parsed as the header.
- If the first line of the CSV file contains column names: Set
PARSE_HEADER = TRUE. - If the first line of the CSV file contains actual data: Keep
PARSE_HEADER = FALSE(default).
-- Basic query, CSV file has no header (first line is data), use default column names c1, c2...
SELECT * FROM FILES(
LOCATION = 'oss://my-bucket/logs/',
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
SKIP_BLANK_LINES = TRUE
),
PATTERN = 'user_log_202504.*\\.csv$'
);
-- CSV file has a header (e.g., name, age, city), enable PARSE_HEADER
SELECT * FROM FILES(
LOCATION = '/data/sales.csv',
FORMAT (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
PARSE_HEADER = TRUE -- Parse the first line as the header
)
);
Limitations: PARSE_HEADER and SKIP_HEADER are mutually exclusive and cannot be used together.
Parquet files (schema automatically inferred)
-- Location URL format
SELECT * FROM 'outfiles/'
(
FORMAT (TYPE = 'PARQUET'),
PATTERN = 'data$'
);
-- Table function format
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (TYPE = 'PARQUET'),
PATTERN = 'data$'
);
'outfiles/'specifies the path where the external table files are stored. Subdirectories are recursively scanned.FORMATspecifies the file format asPARQUET.PATTERNis used for file filtering (regular expression matching).
-- Example: Read Parquet partitioned files from S3
SELECT * FROM FILES(
LOCATION = 's3://analytics-bucket/events/',
FORMAT (TYPE = 'PARQUET'),
PATTERN = 'part-.*\\.parquet$'
);
-- Read all Parquet files starting with "datafiles" from the '/data/' path
SELECT * FROM '/data/' (
FORMAT = (TYPE = 'PARQUET'),
PATTERN = 'datafiles$'
);
SELECT * FROM FILES(
LOCATION = '/data/',
FORMAT = (TYPE = 'PARQUET'),
PATTERN = 'datafiles$'
);
ORC files
-- URL location format
SELECT * FROM 'outfiles/'
(
FORMAT (TYPE = 'ORC'),
PATTERN = 'data$'
);
-- Table function format
SELECT * FROM FILES(
LOCATION = 'outfiles/',
FORMAT (TYPE = 'ORC'),
PATTERN = 'data$'
);
'outfiles/'specifies the path, which supports subdirectories.FORMATspecifies the format asORC.PATTERNis used for file filtering.
-- Example: Read ORC files from HDFS
SELECT * FROM 'hdfs://mycluster/data/orc/'
(
FORMAT (TYPE = 'ORC'),
PATTERN = 'clicks_202504.*'
);
MaxCompute tables (directly connected)
ODPS data sources are not in file format and do not support FILES(). Only SOURCE() is supported:
SELECT * FROM SOURCE(
TYPE = 'ODPS',
ACCESSID = 'LTAI5tXXXXXX',
ACCESSKEY = 'xxxxxxxxxxxxxx',
ENDPOINT = 'http://service.cn-hangzhou.maxcompute.aliyun.com/api',
TUNNEL_ENDPOINT = 'http://dt.cn-hangzhou.maxcompute.aliyun.com',
PROJECT_NAME = 'sales_analytics',
TABLE_NAME = 'user_behavior'
);
Import external data (LOAD DATA)
You can efficiently import external data into an internal table of OceanBase Database. The operation supports parallel import and direct write.
Syntax
LOAD DATA
[/*+ INSERT HINT */]
[REPLACE | IGNORE]
FROM {
<url_table_function_expr> |
( SELECT expression_list FROM <url_table_function_expr> )
}
INTO TABLE table_name
[PARTITION (partition_name1, [partition_name2 ...])]
[(column_name_var [, column_name_var] ...)]
In the preceding syntax, <url_table_function_expr> is:
FILES (
LOCATION = '<string>',
{
FORMAT = (
TYPE = 'CSV',
LINE_DELIMITER = '<string>' | <expr>,
FIELD_DELIMITER = '<string>' | <expr>,
ESCAPE = '<character>' | <expr>,
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>,
ENCODING = 'charset',
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...),
SKIP_HEADER = <int>,
SKIP_BLANK_LINES = { TRUE | FALSE },
TRIM_SPACE = { TRUE | FALSE },
EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
)
| FORMAT = ( TYPE = 'PARQUET' | 'ORC' )
},
[PATTERN = '<regex_pattern>']
)
| SOURCE (
TYPE = 'ODPS',
ACCESSID = '<string>',
ACCESSKEY = '<string>',
ENDPOINT = '<string>',
TUNNEL_ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>'
)
LOCATIONspecifies the file path. Subdirectories can be recursively scanned.- For a local path, the format is
file://path/or/path/. The path must be a subdirectory of thesecure_file_privdirectory. - For a remote path, the format is
oss://bucket/path/. Sensitive information such as AccessKey is encrypted and stored in the system table by OceanBase Database.
Note
Sensitive information such as AccessKey is encrypted and stored in the system table by OceanBase Database. You can omit the sensitive information in the SQL statement by reusing the preconfigured LOCATION object.
Examples
Import CSV files (all columns or specified columns)
-- Import all columns from data1.csv.
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
LOCATION = 'data/csv',
FORMAT = (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
SKIP_BLANK_LINES = TRUE
),
PATTERN = 'data1.csv'
)
INTO TABLE csv_ex_t1;
-- Import only the c1 and c2 columns to the col1 and col2 columns of the target table.
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM (
SELECT c1, c2 FROM FILES(
LOCATION = 'data/csv',
FORMAT = (
TYPE = 'CSV',
FIELD_DELIMITER = ',',
SKIP_BLANK_LINES = TRUE
),
PATTERN = 'data1.csv'
)
)
INTO TABLE csv_ex_t1 (col1, col2);
Import Parquet files
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
LOCATION = 'data/parquet',
FORMAT = (TYPE = 'PARQUET'),
PATTERN = 'data1.parquet'
)
INTO TABLE parquet_ex_t1;
Import ORC files
LOAD DATA /*+ DIRECT(TRUE) PARALLEL(2) */
FROM FILES(
LOCATION = 'data/orc',
FORMAT = (TYPE = 'ORC'),
PATTERN = 'data1.orc'
)
INTO TABLE orc_ex_t1;
Import from ODPS
LOAD DATA
FROM SOURCE (
TYPE = 'ODPS',
ACCESSID = '$ODPS_ACCESSID',
ACCESSKEY = '$ODPS_ACCESSKEY',
ENDPOINT = '$ODPS_ENDPOINT',
PROJECT_NAME = 'example_project',
SCHEMA_NAME = '',
TABLE_NAME = 'example_table',
QUOTA_NAME = '',
COMPRESSION_CODE = ''
)
INTO TABLE odps_ex_t1;
Parameters
LOCATION path format
Type |
Format |
Description |
|---|---|---|
| Local file | file://data/ or /data/ |
The secure_file_priv parameter must be configured, and the path must be a subdirectory of the secure_file_priv directory. |
| OSS | oss://bucket/path/ |
An AccessKey must be configured on the OBServer node. |
| S3 | s3://bucket/path/ |
Supports IAM Role or AccessKey. |
| HDFS | hdfs://namenode:8020/path/ |
Java SDK and Kerberos must be deployed (if Kerberos is enabled). |
CSV format options (common)
Parameter |
Default value |
Description |
|---|---|---|
FIELD_DELIMITER |
, |
The field delimiter. |
PARSE_HEADER |
FALSE |
Specifies whether to use the first line as the column names. |
SKIP_BLANK_LINES |
FALSE |
Specifies whether to skip blank lines. |
EMPTY_FIELD_AS_NULL |
FALSE |
Specifies whether to convert empty fields to NULL values. |
ENCODING |
utf8 |
The file encoding. |
Performance hints (for LOAD DATA)
/*+ DIRECT(TRUE) PARALLEL(8) */
DIRECT(TRUE): Bypasses the transaction layer and directly writes data to the storage engine (high speed, but not rollable).PARALLEL(N): Specifies the number of parallel file reads (N ≤ number of files).
