Purpose
The
LOAD DATAstatement of OceanBase Database supports the following input files:Server-side (OBServer node) files: Files located on the OBServer nodes of OceanBase Database. You can use the
LOAD DATA INFILEstatement or theLOAD DATA FROM URLstatement to load data from server-side files into database tables.Client-side (local) files: Files located in the local file system of the client. You can use the
LOAD DATA LOCAL INFILEstatement or theLOAD DATA FROM URLstatement to load data from client-side local files into database tables.Note
When executing the
LOAD DATA LOCAL INFILEcommand, OceanBase Database automatically adds theIGNOREoption.OSS files: Files stored in an OSS file system. You can use the
LOAD DATA REMOTE_OSS INFILEstatement to load data from OSS files to a database table.
LOAD DATAcurrently supports importing text files in CSV format. The entire import process can be divided into the following steps:File parsing: OceanBase Database reads the data from the file specified by the user and determines whether to parse the input file data in parallel or serially based on the specified degree of parallelism.
Data distribution: Since OceanBase Database is a distributed database, data for different partitions may be distributed across different OBServer nodes.
LOAD DATAcalculates the parsed data and determines which OBServer node the data should be sent to.Data insertion: Once the target OBServer node receives the data, it performs the
INSERToperation locally to insert the data into the corresponding partition.
Considerations
Tables with triggers are not allowed to use the
LOAD DATAstatement.To import data from external files, you need to have the
FILEprivilege and the following configurations:- When loading server-side files, you need to set the system variable secure_file_priv in advance to configure the accessible path for importing or exporting files.
- When loading client-side local files, you need to add the
--local-infile[=1]option when starting the MySQL/OBClient client to enable the functionality of loading data from the local file system.
When using specified partition bypass import, note that the target table cannot be a replicated table, and it must not contain auto-increment columns, identity columns, or a Global Index.
To import data from an external file, you must have the FILE privilege. You can execute the GRANT FILE ON *.* TO $user_name; statement to grant the privilege to the user specified by $user_name, who needs to execute the LOAD DATA statement.
Syntax
-- Load data from a regular file.
LOAD DATA
[/*+ PARALLEL(N) [load_batch_size(M)] [APPEND | direct(bool, int, [load_mode])] | NO_DIRECT */]
[REMOTE_OSS | LOCAL] INFILE 'file_name'
[REPLACE | IGNORE]
INTO TABLE table_name [PARTITION(PARTITION_OPTION)]
[COMPRESSION [=] {AUTO|NONE|GZIP|DEFLATE|ZSTD}]
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[IGNORE number {LINES | ROWS}]
[(column_name_var
[, column_name_var] ...)]
load_mode:
'full'
| 'inc_replace'
PARTITION_OPTION:
partition_option_list
| subpartition_option_list
-- Load data from a URL.
LOAD DATA
[/*+ PARALLEL(N) [load_batch_size(M)] [APPEND | direct(bool, int, [load_mode])] | NO_DIRECT */]
[REPLACE | IGNORE]
FROM { url_table_function_expr |
( SELECT expression_list FROM url_table_function_expr ) }
INTO TABLE table_name
[PARTITION(PARTITION_OPTION)]
[(column_name_var [, column_name_var] ...)]
[LOG ERRORS [REJECT LIMIT {integer | UNLIMITED}]]
load_mode:
'full'
| 'inc_replace'
url_table_function_expr:
| FILES (
LOCATION = '<string>',
{
FORMAT = (
TYPE = 'CSV',
LINE_DELIMITER = '<string>' | <expr>,
FIELD_DELIMITER = '<string>' | <expr>,
PARSE_HEADER = { TRUE | FALSE },
ESCAPE = '<character>' | <expr>,
FIELD_OPTIONALLY_ENCLOSED_BY = '<character>' | <expr>,
ENCODING = 'charset',
NULL_IF = ('<string>' | <expr>, '<string>' | <expr> ...),
SKIP_HEADER = <int>,
SKIP_BLANK_LINES = { TRUE | FALSE },
TRIM_SPACE = { TRUE | FALSE },
EMPTY_FIELD_AS_NULL = { TRUE | FALSE }
)
| FORMAT = ( TYPE = 'PARQUET' | 'ORC' )
},
[PATTERN = '<regex_pattern>']
)
| SOURCE (
TYPE = 'ODPS',
ACCESSID = '<string>',
ACCESSKEY = '<string>',
ENDPOINT = '<string>',
TUNNEL_ENDPOINT = '<string>',
PROJECT_NAME = '<string>',
SCHEMA_NAME = '<string>',
TABLE_NAME = '<string>',
QUOTA_NAME = '<string>',
COMPRESSION_CODE = '<string>'
)
PARTITION_OPTION:
partition_option_list
| subpartition_option_list
Parameters
| Parameter | Description |
|---|---|
| parallel(N) | The degree of parallelism for loading data, with a default value of 4. |
| load_batch_size(M) | Specifies the batch size for each insertion. The default value of M is 100. The recommended range is [100, 1000]. |
| APPEND | direct() |NO_DIRECT | Uses hints to enable direct load.
NoticeDuring a direct load task, we recommend that you do not upgrade OceanBase Database, because this may cause the direct load task to fail.
LOAD DATA statement for direct load, see Import data or files by using the LOAD DATA statement. |
| REMOTE_OSS | LOCAL | Optional. Specifies whether to read data from an object storage service or the local file system of the client.
|
| file_name | The path and name of the input file. The format of the file_name parameter corresponds to the type of REMOTE_OSS | LOCAL:
NoteWhen you import a file from OSS, make sure of the following:
|
| table_name | The name of the table to which the imported data belongs. You can specify partitioned or non-partitioned tables. |
| PARTITION_OPTION | The name of the partition for direct load. You can specify multiple partitions by separating them with commas (,).
NoteYou can import data only by using direct load to a specified partition. Normal LOAD DATA does not support specifying partitions. If you do not specify direct load hints or parameters, specifying partitions in the LOAD DATA statement is invalid. |
| COMPRESSION | The compression format of the file. The valid values are described as follows, where
|
| FIELDS | COLUMNS | The format of fields.
|
| LINES STARTING BY | The delimiter that indicates the start of a line. |
| LINES TERMINATED BY | The delimiter that indicates the end of a line. |
| IGNORE number { LINES | ROWS } | The number of the first few lines to be ignored. LINES indicates the first few lines of the file, and ROWS indicates the first few rows separated by the field delimiter. By default, the system matches each input file column with a table column. If the input file does not contain all the columns in the table, the system fills in the missing columns based on the following rules:
NoteThe behavior is the same for multi-file import as it is for single-file import. |
| column_name_var | The name of the column to which the imported data belongs. |
| LOG ERRORS [REJECT LIMIT {integer | UNLIMITED}] | Optional. Enables error diagnostics during the process of importing URL external tables. For detailed information, refer to log_errors below.
NoteFor OceanBase Database V4.3.5, the syntax for importing URL external tables with the LOAD DATA statement supports specifying error diagnostics starting from V4.3.5 BP2. |
FILES
The FILES keyword consists of the clauses LOCATION, FORMAT, and PATTERN.
The
LOCATIONclause is used to specify the path where the external table files are stored. Typically, the data files for the external table are stored in a single directory. The directory can include subdirectories, and when the table is created, the external table will automatically collect all files in that directory.The local
LOCATIONformat isLOCATION = '[file://] local_file_path', wherelocal_file_pathcan be either a relative path or an absolute path. If a relative path is specified, the current directory must be the installation directory of the OceanBase database. Thesecure_file_privparameter is used to configure the file paths that OBServer nodes have permission to access. Thelocal_file_pathmust be a subpath of thesecure_file_privpath.The remote
LOCATIONformat is as follows:LOCATION = '{oss|cos|S3}://$ACCESS_ID:$ACCESS_KEY@$HOST:s3_region/remote_file_path', where$ACCESS_ID,$ACCESS_KEY, and$HOSTare the access credentials required to connect to OSS, COS, or S3. Thes3_regionspecifies the region information for S3. These sensitive access credentials are stored in the database's system tables in an encrypted format.LOCATION = 'hdfs://${hdfs_namenode_address}:${port}/PATH.localhost', whereportrefers to the port number of HDFS, andPATHspecifies the directory path in HDFS.- With Kerberos authentication:
LOCATION = 'hdfs://localhost:port/user?principal=xxx&keytab=xxx&krb5conf=xxx&configs=xxx'. Where:principal: the authenticated user.keytab: the key file for user authentication.krb5conf: the Kerberos environment description file used by the user.configs: additional HDFS configuration options. By default, this is empty. However, in a Kerberos environment, this configuration typically needs to be set. For example,dfs.data.transfer.protection=authentication,privacy, which specifies the data transfer protection level asauthenticationandprivacy.
- With Kerberos authentication:
Note
When using an object storage path, the parameters of the object storage path are separated by the
&symbol. Please ensure that the parameter values you enter only contain uppercase and lowercase English letters, numbers,\/-_$+=, and wildcards. If you enter any characters other than those mentioned above, it may cause the configuration to fail.
The
FORMATclause specifies attributes related to the file format. Valid values: CSV, PARQUET, and ORC.When TYPE = 'CSV', the following fields are included:
LINE_DELIMITER: the line delimiter in the CSV file. The default value isLINE_DELIMITER='\n'.FIELD_DELIMITER: optional. Specifies the column delimiter in the CSV file. The default value isFIELD_DELIMITER='\t'.PARSE_HEADER: optional. Specifies whether the first line of the CSV file is the column name for each column. The default value isFALSE, meaning the first line of the CSV file is not treated as column names.ESCAPE: the escape character in the CSV file. It must be a single byte. The default value isESCAPE='\'.FIELD_OPTIONALLY_ENCLOSED_BY: optional. Specifies the symbol used to enclose field values in the CSV file. The default value is empty. Using this option means that only certain types of fields (such as CHAR, VARCHAR, TEXT, and JSON) will be enclosed.ENCODING: the character set encoding format of the file. If not specified, the default value isUTF8MB4.NULL_IF: the string that should be treated asNULL. The default value is empty.SKIP_HEADER: skips the file header and specifies the number of lines to skip.SKIP_BLANK_LINES: specifies whether to skip blank lines. The default value isFALSE, meaning blank lines are not skipped.TRIM_SPACE: specifies whether to remove leading and trailing spaces from fields in the file. The default value isFALSE, meaning leading and trailing spaces are not removed.EMPTY_FIELD_AS_NULL: specifies whether to treat empty strings asNULL. The default value isFALSE, meaning empty strings are not treated asNULL.
When TYPE = 'PARQUET/ORC', there are no additional fields.
The
PATTERNclause specifies a regular pattern string for filtering files in the directory specified by theLOCATIONclause. For each file in the directory specified by theLOCATIONclause, if the file path matches the pattern string, the external table accesses the file. Otherwise, the external table skips the file. If this parameter is not specified, the external table accesses all files in the directory specified by theLOCATIONclause by default.
SOURCE
The SOURCE clause takes no other clauses and contains the following fields when TYPE = 'ODPS':
ACCESSID: specifies the ID of the ODPS user.ACCESSKEY: specifies the password of the ODPS user.ENDPOINT: specifies the connection address of the ODPS service.TUNNEL_ENDPOINT: specifies the connection address of the Tunnel data transmission service.PROJECT_NAME: specifies the project in which the table to be queried is located.SCHEMA_NAME: optional. Specifies the schema in which the table to be queried is located.TABLE_NAME: specifies the name of the table to be queried.QUOTA_NAME: optional. Specifies whether to use the specified quota.COMPRESSION_CODE: optional. Specifies the compression format of the data source. Valid values:ZLIB,ZSTD,LZ4, andODPS_LZ4. The default value is an empty string, which specifies not to enable compression.
log_errors
LOG ERRORS: Indicates that error rows will be skipped during the import process.REJECT LIMIT: Optional, used to set the maximum allowable number of error rows:- Default value is 0: This means no error rows are allowed, and the operation will fail upon encountering the first error.
integer: Specifies the maximum number of allowable error rows per machine. For example, 10 means a maximum of 10 error rows per machine.UNLIMITED: Allows an unlimited number of error rows.
Notice
- If the
LOG ERRORSclause is not specified, the behavior will follow the standard import process, meaning the operation will fail immediately upon encountering the first error. - If the
LOG ERRORSclause is specified but theREJECT LIMITclause is not specified, it is equivalent to setting theLIMITto 0 for diagnostics. In this case, the operation will fail upon encountering the first error, but the first encountered error will be logged, and the error code will indicate a diagnostic-related error, such as "reject limit reached".
Wildcard rules for direct load of multiple files
To facilitate the import of multiple files, the wildcard feature is introduced for server-side and OSS file imports, but not for client-side file imports.
Use of wildcards on the server
Matching rules:
Matching filenames:
load data /*+ parallel(20) direct(true, 0) */ infile '/xxx/test.*.csv' replace into table t1 fields terminated by '|';Matching directories:
load data /*+ parallel(20) direct(true, 0) */ infile '/aaa*bb/test.1.csv' replace into table t1 fields terminated by '|';Matching both directories and filenames:
load data /*+ parallel(20) direct(true, 0) */ infile '/aaa*bb/test.*.csv' replace into table t1 fields terminated by '|';
Considerations:
At least one matching file must exist. Otherwise, an error code 4027 is returned.
For
load data /*+ parallel(20) direct(true, 0) */ infile '/xxx/test.1*.csv,/xxx/test.6*.csv' replace into table t1 fields terminated by '|';,/xxx/test.1*.csv,/xxx/test.6*.csvwill be considered as a whole for matching. If no file matches, an error code 4027 will be returned.Only wildcards supported by the POSIX GLOB function can be used. For example,
test.6*(6|0).csvandtest.6*({0.csv,6.csv}|.csv)are allowed. Although these wildcards can be used, they will not match any result, and an error code 4027 will be returned when you use them.
Use of wildcards in object storage service (OSS)
Matching rules:
Matching filenames:
load data /*+ parallel(20) direct(true, 0) */ remote_oss infile 'oss://xxx/test.*.csv?host=xxx&access_id=xxx&access_key=xxx' replace into table t1 fields terminated by '|';Considerations:
Directory matching is not supported. For example,
load data /*+ parallel(20) direct(true, 0) */ remote_oss infile 'oss://aa*bb/test.*.csv?host=xxx&access_id=xxx&access_key=xxx' replace into table t1 fields terminated by '|';will returnOB_NOT_SUPPORTED.Only
*and?are supported as wildcard characters for filenames. Other wildcard characters can be entered but will not match any result.
Examples
Import data from a server
Example 1: Import data from a server.
Set the global secure path.
Notice
For security reasons, when you set the system variable
secure_file_priv, you can connect to the database only through a local socket to execute the SQL statement that modifies the global variable. For more information, see secure_file_priv.obclient> SET GLOBAL secure_file_priv = "/";Log out.
Note
Since
secure_file_privis aGLOBALvariable, you need to execute\qto make the setting take effect.obclient> \qThe return result is as follows:
ByeReconnect to the database and use the
LOAD DATAstatement to import data.Perform a regular import.
obclient> LOAD DATA INFILE '/home/admin/test.csv' INTO TABLE t1;Use the
APPENDhint to enable direct load.LOAD DATA /*+ PARALLEL(4) APPEND */ INFILE '/home/admin/test.csv' INTO TABLE t1;
Example 2: Use the APPEND hint to enable direct load.
LOAD DATA /*+ PARALLEL(4) APPEND */
INFILE '/home/admin/a.csv'
INTO TABLE t;
Example 3: Import a CSV file.
Import all columns from the
test1.csvfile.load data /*+ direct(true,0) parallel(2)*/ from files( location = "data/csv", format = ( type = 'csv', field_delimiter = ',', parse_header = true, skip_blank_lines = true ), pattern = 'test1.csv') into table t1;Read the
c1andc2columns from thetest1.csvfile in thedata/csvdirectory and import them into thecol1andcol2columns of thet1table.load data /*+ direct(true,0) parallel(2)*/ from ( select c1, c2 from files( location = 'data/csv' format = ( type = 'csv', field_delimiter = ',', parse_header = true, skip_blank_lines = true ), pattern = 'test1.csv')) into table t1 (col1, col2);
Example 4: Import a PARQUET file.
load data /*+ direct(true,0) parallel(2)*/
from files(
location = "data/parquet",
format = ( type = 'PARQUET'),
pattern = 'test1.parquet')
into table t1;
Example 5: Import an ORC file.
load data /*+ direct(true,0) parallel(2)*/
from files(
location = "data/orc",
format = ( type = 'ORC'),
pattern = 'test1.orc')
into table t1;
Example 5: Import an ODPS file.
load data /*+ direct(true,0) parallel(2)*/
from source (
type = 'ODPS',
accessid = '$ODPS_ACCESSID',
accesskey = '******',
endpoint= '$ODPS_ENDPOINT',
project_name = 'example_project',
schema_name = '',
table_name = 'example_table',
quota_name = '',
compression_code = '')
into table t1;
Import data from a local file
Example 1: Import data from a local file to a table in OceanBase Database.
Open the terminal or command prompt window and enter the following command to start the client.
obclient --local-infile -hxxx.xxx.xxx.xxx -P2881 -usys@oracle001 -p******The return result is as follows:
Welcome to the OceanBase. Commands end with ; or \g. Your OceanBase connection id is 3221548006 Server version: OceanBase 4.2.2.0 (r100000032024010510-75c47d4be18a399e13c5309de1a81da5caf4e7c0) (Built Jan 5 2024 10:17:55) Copyright (c) 2000, 2018, OceanBase and/or its affiliates. All rights reserved. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. obclient [SYS]>Notice
To use the
LOAD DATA LOCAL INFILEfeature, you must use OBClient V2.2.4 or later.In the client, execute the
LOAD DATA LOCAL INFILEstatement to load data from the local file.obclient [SYS]> LOAD DATA LOCAL INFILE '/home/admin/test_data/tbl1.csv' INTO TABLE tbl1 FIELDS TERMINATED BY ',';The return result is as follows:
Query OK, 3 rows affected Records: 3 Deleted: 0 Skipped: 0 Warnings: 0
Example 2: Directly import a compressed file by setting COMPRESSION.
LOAD DATA LOCAL INFILE '/your/file/lineitem.tbl.gz'
INTO TABLE lineitem
COMPRESSION GZIP
FIELDS TERMINATED BY '|';
Example 3: Specify a partition for direct load.
- Specify a partition by using the partitioning key.
load data /*+ direct(true,0) parallel(2) load_batch_size(100) */
infile "$FILE_PATH"
into table t1 partition(p0, p1)
fields terminated by '|' enclosed by '' lines starting by '' terminated by '\n';
- Specify a subpartition by using the subpartitioning key.
load data /*+ direct(true,0) parallel(2) load_batch_size(100) */
infile "$FILE_PATH"
into table t1 partition(p0sp0, p1sp1)
fields terminated by '|' enclosed by '' lines starting by '' terminated by '\n';
Import data from an OSS file
Example 1: Use the direct(bool, int) hint to enable direct load, where the file to be loaded is stored in OSS.
LOAD DATA /*+ direct(true,1024) parallel(16) */ REMOTE_OSS INFILE 'oss://antsys-oceanbasebackup/backup_rd/xiaotao.ht/lineitem2.tbl?host=***.oss-cdn.***&access_id=***&access_key=***' INTO TABLE tbl1 FIELDS TERMINATED BY ',';
Import data from server-side files as URL external tables
Notice
The commands involving IP addresses in the examples have been anonymized. Replace the IP addresses with the actual ones when you verify the commands.
The following steps illustrate how to create external tables in the Oracle-compatible mode of OceanBase Database with external files located locally:
Prepare the external file.
Execute the following command to create the file
column_conv.csvin the/home/admin/test_csvdirectory on the machine where you will log into the OBServer node.[admin@xxx /home/admin/external_csv]# vi column_conv.csvThe content of the file is as follows:
1,short,short 2,long_text,long_text 3,long_text,long_textSet the file path for import.
Notice
For security reasons, when setting the system variable
secure_file_priv, you can only modify this global variable via a local Unix Socket connection to the database. For more information, see secure_file_priv.Execute the following command to log in to the machine where the OBServer node is located.
ssh admin@10.10.10.1Execute the following command to connect to the tenant
oracle001using a local Unix Socket connection.obclient -S /home/admin/oceanbase/run/sql.sock -usys@oracle001 -p******Execute the following SQL command to set the import path to
/home/admin/test_csv.SET GLOBAL secure_file_priv = "/home/admin/test_csv";
Reconnect to tenant
oracle001.Example:
obclient -h10.10.10.1 -P2881 -usys@oracle001 -p****** -ACreate the table
test_tbl1.CREATE TABLE test_tbl1(col1 VARCHAR2(5), col2 VARCHAR2(5), col3 VARCHAR2(5));Use the
LOAD DATAstatement with URL external table syntax to import data into the tabletest_tbl1, and specify error diagnostics.LOAD DATA FROM FILES( LOCATION = '/home/admin/test_csv', FORMAT = ( TYPE = 'csv', FIELD_DELIMITER = ','), PATTERN = 'column_conv.csv') INTO TABLE test_tbl1 LOG ERRORS REJECT LIMIT UNLIMITED;The result returned is as follows:
Query OK, 1 row affected, 2 warnings Records: 1 Deleted: 0 Skipped: 0 Warnings: 2View the data in
test_tbl1.SELECT * FROM test_tbl1;The result returned is as follows:
+------+-------+-------+ | COL1 | COL2 | COL3 | +------+-------+-------+ | 1 | short | short | +------+-------+-------+ 1 row in set
References
- For more information about the
LOAD DATAstatement, see Import data by using the LOAD DATA statement. - For more information about direct load by using the
LOAD DATAstatement, see Import data by using the LOAD DATA statement.