Purpose
OceanBase Database supports the following input files for the
LOAD DATAstatement:Server-side files (on OBServer nodes): These files are located on the OBServer nodes of OceanBase Database. You can use the
LOAD DATA INFILEorLOAD DATA FROM URLstatement to load data from these server-side files into database tables.Client-side files (local files): These files are located on the local file system of the client. You can use the
LOAD DATA LOCAL INFILEorLOAD DATA FROM URLstatement to load data from these client-side files into database tables.Note
When you execute the
LOAD DATA LOCAL INFILEstatement in OceanBase Database, the system automatically adds theIGNOREoption.OSS files: These files are located on the OSS file system. You can use the
LOAD DATA REMOTE_OSS INFILEstatement to load data from these OSS files into database tables.
The
LOAD DATAstatement can currently import text files in CSV format. The entire import process can be divided into the following steps:File parsing: OceanBase Database reads the data from the specified file and parses it in parallel or serially based on the specified parallelism.
Data distribution: Since OceanBase Database is a distributed database, data in different partitions may be distributed across different OBServer nodes. The
LOAD DATAstatement calculates the parsed data and determines which OBServer node to send the data to.Data insertion: After the target OBServer node receives the data, it executes the
INSERTstatement to insert the data into the corresponding partition.
Considerations
You cannot use the
LOAD DATAstatement for a table with triggers.To import data from an external file, you must have the
FILEprivilege and the following settings:- When loading a server-side file, set the system variable secure_file_priv to specify the path for importing or exporting files.
- When loading a local client-side file, add the
--local-infile[=1]option when starting the MySQL/OBClient client to enable data loading from the local file system.
When using partitioned table import, the target table cannot be a replicated table, and it cannot contain auto-increment columns, identity columns, or global indexes.
To import data from an external file, you must have the FILE privilege. You can use the GRANT FILE ON *.* TO $user_name; statement to grant this privilege, where $user_name is the user who needs to execute the LOAD DATA command.
Syntax
-- Load data from a file
LOAD DATA
[/*+ PARALLEL(N) [load_batch_size(M)] [APPEND | direct(bool, int, [load_mode])] | NO_DIRECT */]
[REMOTE_OSS | LOCAL] INFILE 'file_name'
[REPLACE | IGNORE]
INTO TABLE table_name [PARTITION(PARTITION_OPTION)]
[COMPRESSION [=] {AUTO|NONE|GZIP|DEFLATE|ZSTD}]
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[IGNORE number {LINES | ROWS}]
[(column_name_var
[, column_name_var] ...)]
load_mode:
'full'
| 'inc_replace'
PARTITION_OPTION:
partition_option_list
| subpartition_option_list
-- Load data from a URL
LOAD DATA
[/*+ PARALLEL(N) [load_batch_size(M)] [APPEND | direct(bool, int, [load_mode])] | NO_DIRECT */]
[REPLACE | IGNORE]
FROM { url_table_function_expr |
( SELECT expression_list FROM url_table_function_expr ) }
INTO TABLE table_name
[PARTITION(PARTITION_OPTION)]
[(column_name_var [, column_name_var] ...)]
[LOG ERRORS [REJECT LIMIT {integer | UNLIMITED}]]
load_mode:
'full'
| 'inc_replace'
url_table_function_expr:
| FILES (
LOCATION = '<string>',
{
FORMAT = (
TYPE = 'CSV',
LINE_DELIMITER = '<string>' | <expr>,
FIELD_DELIMITER = '<string>' | <expr>,
PARSE_HEADER = { TRUE | FALSE },
ESCAPE = '<character>' | <expr>,
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>,
ENCODING = 'charset',
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...),
SKIP_HEADER = <int>,
SKIP_BLANK_LINES = { TRUE | FALSE },
TRIM_SPACE = { TRUE | FALSE },
EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
)
| FORMAT = ( TYPE = 'PARQUET' | 'ORC' )
},
[PATTERN = '<regex_pattern>']
)
| SOURCE (
TYPE = 'ODPS',
ACCESSID = '<string>',
ACCESSKEY = '<string>',
ENDPOINT = '<string>',
TUNNEL_ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>'
)
PARTITION_OPTION:
partition_option_list
| subpartition_option_list
Parameters
| Parameter | Description |
|---|---|
| parallel(N) | The degree of parallelism for data loading. Default value: 4. |
| load_batch_size(M) | The size of the batch to be inserted each time. The default value of M is 100. We recommend that you set the value to be in the range [100, 1000]. |
| APPEND | direct() |NO_DIRECT | A hint for enabling direct load.
NoticeDo not upgrade OceanBase Database while a direct load task is in progress. Otherwise, the direct load task may fail.
|
| REMOTE_OSS | LOCAL | Optional.
|
| file_name | The path and name of the input file. file_name can be in the following format:
NoteWhen you import a file from an object storage service, make sure that the following conditions are met:
|
| table_name | The name of the table to which the data is imported. You can import data to a partitioned table or a non-partitioned table. |
| PARTITION_OPTION | The name of the partition to which the data is imported.
NotePartitioning is supported only for direct load. If you do not specify the direct load hint or the direct load configuration, the partitioning option is not effective when you execute the LOAD DATA statement. |
| COMPRESSION | Specifies the compression format of the file.
|
| FIELDS | COLUMNS | Specifies the format of the fields.
|
| LINES STARTING BY | Specifies the starting delimiter of a line. |
| LINES TERMINATED BY | Specifies the ending delimiter of a line. |
| IGNORE number { LINES | ROWS } | Specifies the number of lines to ignore. LINES indicates the number of lines at the beginning of the file. ROWS indicates the number of rows at the beginning of the file, specified by the field delimiter. By default, the system maps each field in the input file to a column in the table. If the input file does not contain all the columns, the missing columns are filled with the following default values:
NoteWhen you import data from multiple files, the behavior is the same as when you import data from a single file. |
| column_name_var | Optional. Specifies the name of the imported column. |
| LOG ERRORS [REJECT LIMIT {integer | UNLIMITED}] | Optional. Specifies to enable error diagnostics during the import of data from a URL. For more information, see log_errors.
NoteFor OceanBase Database V4.3.5, the syntax of the LOAD DATA statement for importing data from a URL supports specifying error diagnostics starting from V4.3.5 BP2. |
FILES
The FILES keyword consists of the LOCATION, FORMAT, and PATTERN clauses.
The
LOCATIONclause specifies the path where the external table files are stored. Typically, the data files of the external table are stored in a separate directory, which can contain subdirectories. When the table is created, the external table automatically collects all files in this directory.For a local location, the format is
LOCATION = '[file://] local_file_path', wherelocal_file_pathcan be a relative or absolute path. If a relative path is specified, the current directory must be the installation directory of OceanBase Database. Thesecure_file_privparameter specifies the file paths that the OBServer nodes have permission to access.local_file_pathmust be a subpath of thesecure_file_privpath.For a remote location, the format is as follows:
LOCATION = '{oss|S3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path', where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access information required to access OSS, COS, and S3, respectively, ands3_regionis the region information selected when using S3. These sensitive access information are stored in the system tables of the database in an encrypted manner.LOCATION = 'hdfs://$ {hdfs_namenode_address}:${port}/PATH.localhost', whereportis the port number of HDFS, andPATHis the directory path in HDFS.- For Kerberos authentication:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'. Where:principal: specifies the user for login authentication.keytab: specifies the path of the key file for user authentication.krb5conf: specifies the path of the description file for the Kerberos environment.configs: specifies additional HDFS configuration items. By default, it is empty. However, in a Kerberos environment, this configuration item usually has a value and needs to be configured, for example:dfs.data.transfer.protection=authentication,privacy, which specifies the data transmission protection level asauthenticationandprivacy.
- For Kerberos authentication:
Notice
When using an object storage path, the parameters of the object storage path are separated by the
&symbol. Please ensure that the parameter values you enter contain only uppercase and lowercase letters, numbers,\/-_$+=, and wildcards. If you enter other characters, the settings may fail.
The
FORMATclause specifies properties related to the file reading format and supports CSV, PARQUET, and ORC file formats.When TYPE = 'CSV', the following fields are included:
LINE_DELIMITER: specifies the line delimiter for the CSV file. The default value isLINE_DELIMITER='\n'.FIELD_DELIMITER: optional. Specifies the column delimiter for the CSV file. The default value isFIELD_DELIMITER='\t'.PARSE_HEADER: optional. Specifies whether the first line of the CSV file is the column name for each column. The default value isFALSE, indicating that the first line of the CSV file is not specified as the column name for each column.ESCAPE: specifies the escape character for the CSV file. It must be one byte. The default value isESCAPE ='\'.FIELD_OPTIONALLY_ENCLOSED_BY: optional. Specifies the symbol that wraps the field values in the CSV file. The default value is empty. Using this option indicates that only certain types of fields (such as CHAR, VARCHAR, TEXT, and JSON) are wrapped.ENCODING: specifies the character set encoding format of the file. If not specified, the default value isUTF8MB4.NULL_IF: specifies the string that is treated asNULL. The default value is empty.SKIP_HEADER: skips the file header and specifies the number of skipped lines.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, indicating that blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from the fields in the file. The default value isFALSE, indicating that leading and trailing spaces in the fields of the file are not removed.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, indicating that empty strings are not treated asNULL.
When TYPE = 'PARQUET/ORC', no additional fields are included.
The
PATTERNclause specifies a regular expression pattern to filter files in theLOCATIONdirectory. For each file path in theLOCATIONdirectory, if it matches the pattern, the external table accesses the file; otherwise, it skips the file. If this parameter is not specified, the external table can access all files in theLOCATIONdirectory by default.
SOURCE
The SOURCE keyword does not contain other clauses. In this case, TYPE = 'ODPS' and the following fields are included:
ACCESSID: specifies the ID of the ODPS user.ACCESSKEY: specifies the password of the ODPS user.ENDPOINT: specifies the connection address of the ODPS service.TUNNEL_ENDPOINT: specifies the connection address of the Tunnel data transmission service.PROJECT_NAME: specifies the project where the table to be queried is located.SCHEMA_NAME: optional. Specifies the schema of the table to be queried.TABLE_NAME: specifies the name of the table to be queried.QUOTA_NAME: optional. Specifies whether to use the specified quota.COMPRESSION_CODE: optional. Specifies the compression format of the data source. Supported compression formats includeZLIB,ZSTD,LZ4, andODPS_LZ4. If not specified, compression is not enabled.
log_errors
LOG ERRORS: indicates that error rows can be skipped during import.REJECT LIMIT: optional. Specifies the maximum number of error rows allowed:- The default value is 0, indicating that no error rows are allowed. If an error occurs, the operation fails.
integer: specifies the maximum number of error rows allowed on a single machine. For example, 10 indicates that the maximum number of error rows on a single machine is 10.UNLIMITED: allows an unlimited number of error rows.
Notice
- If the
LOG ERRORSclause is not specified, the default behavior is normal import, which means that an error will be reported immediately upon encountering the first error. - If the
LOG ERRORSclause is specified but theREJECT LIMITclause is not specified, it is equivalent to specifyingLIMITas 0. In this case, the operation will fail upon encountering the first error, but the first error will be recorded, and the error code will be related to the diagnostic error, i.e., "reject limit reached".Rules for using wildcards during multi-file direct load
To facilitate multi-file direct load, the wildcard feature is introduced for server-side and OSS file imports, but not for client-side file imports.
Server-side wildcard usage
Matching rules:
Matching a file:
load data /*+ parallel(20) direct(true, 0) */ infile '/xxx/test.*.csv' replace into table t1 fields terminated by '|';Matching a directory:
load data /*+ parallel(20) direct(true, 0) */ infile '/aaa*bb/test.1.csv' replace into table t1 fields terminated by '|';Matching both a directory and a file:
load data /*+ parallel(20) direct(true, 0) */ infile '/aaa*bb/test.*.csv' replace into table t1 fields terminated by '|';
Considerations:
At least one matching file must exist; otherwise, an error code 4027 is returned.
For the input
load data /*+ parallel(20) direct(true, 0) */ infile '/xxx/test.1*.csv,/xxx/test.6*.csv' replace into table t1 fields terminated by '|';,/xxx/test.1*.csv,/xxx/test.6*.csvis considered a whole match. If no match is found, an error code 4027 is returned.Only POSIX-compatible GLOB functions are supported. For example,
test.6*(6|0).csvandtest.6*({0.csv,6.csv}|.csv)can be found using thelscommand but cannot be matched by the GLOB function, resulting in an error code 4027.
Wildcard usage in Cloud Object Storage Service (
OSS)Matching rules:
Matching a file:
load data /*+ parallel(20) direct(true, 0) */ remote_oss infile 'oss://xxx/test.*.csv?host=xxx&access_id=xxx&access_key=xxx' replace into table t1 fields terminated by '|';Considerations:
Directory matching is not supported. For example,
load data /*+ parallel(20) direct(true, 0) */ remote_oss infile 'oss://aa*bb/test.*.csv?host=xxx&access_id=xxx&access_key=xxx' replace into table t1 fields terminated by '|';will returnOB_NOT_SUPPORTED.Only the
*and?wildcards are supported for filenames. Other wildcards, although allowed, cannot match any results.
Examples
Import data from a file on the server (OBServer node)
Example 1: Import data from a file on the server.
Set the global secure path.
Notice
Due to security reasons, when you set the system variable
secure_file_priv, you can only modify it by connecting to the database through a local Unix socket. For more information, see secure_file_priv.obclient> SET GLOBAL secure_file_priv = "/";Log out.
Note
Since
secure_file_privis aGLOBALvariable, you need to execute\qto log out and make the change effective.obclinet> \qThe return result is as follows:
ByeAfter reconnecting to the database, use the
LOAD DATAstatement to import data.Normal import.
obclient> LOAD DATA INFILE '/home/admin/test.csv' INTO TABLE t1;Enable direct load by using the
APPENDhint.LOAD DATA /*+ PARALLEL(4) APPEND */ INFILE '/home/admin/test.csv' INTO TABLE t1;
Example 2: Enable direct load by using the
APPENDhint.LOAD DATA /*+ PARALLEL(4) APPEND */ INFILE '/home/admin/a.csv' INTO TABLE t;Example 3: Import a CSV file.
Import all columns from the
test1.csvfile.load data /*+ direct(true,0) parallel(2)*/ from files( location = "data/csv", format = ( type = 'csv', field_delimiter = ',', parse_header = true, skip_blank_lines = true ), pattern = 'test1.csv') into table t1;Read the
c1andc2columns from thetest1.csvfile in thedata/csvdirectory and import them into thecol1andcol2columns of thet1table.load data /*+ direct(true,0) parallel(2)*/ from ( select c1, c2 from files( location = 'data/csv' format = ( type = 'csv', field_delimiter = ',', parse_header = true, skip_blank_lines = true ), pattern = 'test1.csv')) into table t1 (col1, col2);
Example 4: Import a PARQUET file.
load data /*+ direct(true,0) parallel(2)*/ from files( location = "data/parquet", format = ( type = 'PARQUET'), pattern = 'test1.parquet') into table t1;Example 5: Import an ORC file.
load data /*+ direct(true,0) parallel(2)*/ from files( location = "data/orc", format = ( type = 'ORC'), pattern = 'test1.orc') into table t1;Example 5: Import an ODPS file.
load data /*+ direct(true,0) parallel(2)*/ from source ( type = 'ODPS', accessid = '$ODPS_ACCESSID', accesskey = '******', endpoint= '$ODPS_ENDPOINT', project_name = 'example_project', schema_name = '', table_name = 'example_table', quota_name = '', compression_code = '') into table t1;Import data from a file on the client (local machine)
Example 1: Import data from a local file to an OceanBase Database table.
Open the terminal or command prompt window and enter the following command to start the client.
obclient --local-infile -hxxx.xxx.xxx.xxx -P2881 -usys@oracle001 -p******The return result is as follows:
Welcome to the OceanBase. Commands end with ; or \g. Your OceanBase connection id is 3221548006 Server version: OceanBase 4.2.2.0 (r100000032024010510-75c47d4be18a399e13c5309de1a81da5caf4e7c0) (Built Jan 5 2024 10:17:55) Copyright (c) 2000, 2018, OceanBase and/or its affiliates. All rights reserved. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. obclient [SYS]>Notice
To use the
LOAD DATA LOCAL INFILEfeature, use OBClient of V2.2.4 or later.In the client, execute the
LOAD DATA LOCAL INFILEstatement to load the local data file.obclient [SYS]> LOAD DATA LOCAL INFILE '/home/admin/test_data/tbl1.csv' INTO TABLE tbl1 FIELDS TERMINATED BY ',';The return result is as follows:
Query OK, 3 rows affected Records: 3 Deleted: 0 Skipped: 0 Warnings: 0
Example 2: Directly import a compressed file by setting COMPRESSION.
LOAD DATA LOCAL INFILE '/your/file/lineitem.tbl.gz' INTO TABLE lineitem COMPRESSION GZIP FIELDS TERMINATED BY '|';Example 3: Specify a partition for direct load by using PARTITION.
- Specify a partition for direct load.
load data /*+ direct(true,0) parallel(2) load_batch_size(100) */ infile "$FILE_PATH" into table t1 partition(p0, p1) fields terminated by '|' enclosed by '' lines starting by '' terminated by '\n';- Specify a subpartition for direct load.
load data /*+ direct(true,0) parallel(2) load_batch_size(100) */ infile "$FILE_PATH" into table t1 partition(p0sp0, p1sp1) fields terminated by '|' enclosed by '' lines starting by '' terminated by '\n';Import data from an OSS file
Example 1: Enable direct load by using the
direct(bool, int)hint. The file to be directly loaded is on OSS.LOAD DATA /*+ direct(true,1024) parallel(16) */ REMOTE_OSS INFILE 'oss://antsys-oceanbasebackup/backup_rd/xiaotao.ht/lineitem2.tbl?host=***.oss-cdn.***&access_id=***&access_key=***' INTO TABLE tbl1 FIELDS TERMINATED BY ',';Import data from a server-side file as a URL external table
Notice
The IP addresses in the sample commands have been desensitized. You must replace them with the actual IP addresses of your machines.
The following example shows how to import data from a file located on the local server and in OceanBase Database in Oracle mode as a URL external table. The steps are as follows:
Prepare an external file.
Run the following command to create a file named
column_conv.csvin the/home/admin/test_csvdirectory on the server where the OBServer node is located.[admin@xxx /home/admin/external_csv]# vi column_conv.csvThe content of the file is as follows:
1,short,short 2,long_text,long_text 3,long_text,long_textSet the import file path.
Notice
Due to security reasons, you can only modify the
secure_file_privsystem variable by using a local Unix socket connection. For more information, see secure_file_priv.Run the following command to log in to the server where the OBServer node is located.
ssh admin@10.10.10.1Run the following command to connect to the
oracle001tenant by using a local Unix socket connection.obclient -S /home/admin/oceanbase/run/sql.sock -usys@oracle001 -p******Run the following SQL command to set the import path to
/home/admin/test_csv.SET GLOBAL secure_file_priv = "/home/admin/test_csv";
Reconnect to the
oracle001tenant.Here is an example:
obclient -h10.10.10.1 -P2881 -usys@oracle001 -p****** -ACreate a table named
test_tbl1.CREATE TABLE test_tbl1(col1 VARCHAR2(5), col2 VARCHAR2(5), col3 VARCHAR2(5));Use the
LOAD DATAstatement to import data from a URL external table to thetest_tbl1table and specify error diagnostics.LOAD DATA FROM FILES( LOCATION = '/home/admin/test_csv', FORMAT = ( TYPE = 'csv', FIELD_DELIMITER = ','), PATTERN = 'column_conv.csv') INTO TABLE test_tbl1 LOG ERRORS REJECT LIMIT UNLIMITED;The return result is as follows:
Query OK, 1 row affected, 2 warnings Records: 1 Deleted: 0 Skipped: 0 Warnings: 2View the data in the
test_tbl1table.SELECT * FROM test_tbl1;The return result is as follows:
+------+-------+-------+ | COL1 | COL2 | COL3 | +------+-------+-------+ | 1 | short | short | +------+-------+-------+ 1 row in set
References
- For more information about how to use the
LOAD DATAstatement, see Import data by using the LOAD DATA statement. - For more information about how to use the
LOAD DATAstatement for direct load, see Import data by using the LOAD DATA statement for direct load.