Description
This statement is used to load data from external files to a database table.
Currently, the LOAD DATA statement can import text files in the CSV format. The import process comprises the following steps:
Parse the file: OceanBase Database reads data from the file based on the filename you specified and parses the data in the file in parallel or serially based on the specified degree of parallelism.
Distribute the data: As an distributed database, OceanBase Database distributes the parsed data to OBServer nodes.
LOAD DATAcalculates the OBServer node to which each data record should be sent.Insert the data: After the data is received by the destination OBServer node, the node executes an
INSERTstatement to insert the data into the corresponding partition.
Limitations and considerations
Limitations
- Tables with triggers (triggered on DML operations) are not supported.
- Local (
LOCAL) data files are not supported.
Considerations
To increase the data import rate, OceanBase Database uses parallel design for the LOAD DATA operation. During this process, the data to be imported is divided into multiple subtasks for parallel execution. Each subtask is processed as an independent transaction. The subtasks are executed in random order. Therefore, note the following considerations:
- The atomicity of the entire data import cannot be guaranteed.
- For tables without primary keys, the order of data written to the table can be different from the order of data in the file.
Privilege requirements
To execute the LOAD DATA statement, you must have the FILE privilege and the INSERT privilege on the destination table. For more information about OceanBase Database privileges, see Privilege types in MySQL mode.
Syntax
LOAD DATA [hint_options]
[REMOTE_OSS]
INFILE 'file_name'
[IGNORE | REPLACE]
INTO TABLE table_name
[PARTITION (partition_name [, partition_name] ...)]
[CHARACTER SET charset_name_or_default]
[field_opt]
[line_opt]
[{IGNORE | GENERATED} number {LINES | ROWS}]
[(column_name_var [, column_name_var] ...)]
[SET load_set_list]
[load_data_extended_option_list];
hint_options:
[/*+ PARALLEL(N) [load_batch_size(M)] [APPEND | direct(bool, int)] */]
charset_name_or_default:
`string_charset_name`
| BINARY
| DEFAULT
field_opt:
{COLUMNS | FIELDS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
line_opt:
LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
load_set_list:
load_set_element [, load_set_element ...]
load_set_element:
column_definition_ref = {expr | DEFAULT}
load_data_extended_option_list:
load_data_extended_option [load_data_extended_option ...]
load_data_extended_option:
LOGFILE [=] string_value
| REJECT LIMIT [=] int_num
| BADFILE [=] string_value
Parameters
| Parameter | Description |
|---|---|
| hint_options | Optional. Specifies hint options. For more information, see hint_options. |
| REMOTE_OSS | Optional. Specifies to read data files from the Alibaba Cloud OSS file system. If you specify this option, the data files are read from the Alibaba Cloud OSS file system. If you do not specify this option, the data files are read from the file system of the server where the OBServer node resides. |
| file_name | Specifies the path and name of the input file. The file_name parameter is in the following format:
NoteWhen you import a file from OSS, make sure that the following conditions are met:
|
| IGNORE | REPLACE | Optional, used to specify the behavior when there is a conflict between the source data and the unique key of the target table. LOAD DATA determines whether the data is duplicated based on the primary key of the table. If the table does not have a primary key, there is no difference between the REPLACE and IGNORE options. If this option is omitted, when duplicate data is encountered, LOAD DATA will log the erroneous data records into the log file. The specific options are described as follows:
NoteWhen using the |
| table_name | Specifies the name of the destination table, namely, the table to which the data is to be imported. Partitioned and non-partitioned tables are supported. |
| PARTITION (partition_name [, partition_name ...]) | Optional. If the destination table is a partitioned table, you can use the PARTITION clause to explicitly specify the partitions to which the data is to be imported. Multiple partitions are separated with commas (,).
NoteThe |
| CHARACTER SET charset_name_or_default | Optional. Specifies the character set of the import file. For more information, see charset_name_or_default. |
| field_opt | Optional. Specifies the field options to define how the field delimiter, text qualifier, and field escape character are interpreted. For more information, see field_opt. |
| line_opt | Optional. Specifies how to process the content of rows, including the start and end characters of rows. For more information, see line_opt. |
| {IGNORE | GENERATED} number {LINES | ROWS} | Optional. Specifies the options to process the content of data files. The LINES and ROWS options are synonyms and both indicate rows in the file. The options are described as follows:
|
| column_name_var | Optional. Specifies the column names in the destination table, namely, the columns to which the data from the data file is to be imported. If you do not specify this option, the data in the fields in the input file is imported to the destination table by sequentially matching columns. If the input file does not contain all the columns of the destination table, the missing columns are filled with default values based on the following rules:
|
| SET load_set_list | Optional. Allows you to set or modify the values of fields before they are imported to the destination table, namely, to convert or calculate data when it is being imported. load_set_list is a comma-separated list. Each item load_set_element in the list specifies how to set or convert a specific field (column). For more information, see load_set_element. |
| load_data_extended_option_list | Optional. Provides further control over the data loading process. Each load_data_extended_option can be used to define the behavior of a specific row. For more information, see load_data_extended_option. |
hint_options
PARALLEL(N): specifies the degree of parallelism for loading data. The default value ofNis4.load_batch_size(M): specifies the batch size for inserting data. The default value ofMis100. We recommend that you set the value ofMin the range of [100, 1000].APPEND | direct(bool, int): enables the direct load feature by using a hint.APPEND: enables the direct load feature, which allows space to be allocated and data to be written directly in data files. By default, theAPPENDhint is equivalent todirect(true, 0). In this case, online statistics collection is supported (GATHER_OPTIMIZER_STATISTICShint). For more information, see Online statistics collection.direct(bool, int): enables the direct load feature. The parameters are described as follows:bool: specifies whether the data to be written needs to be sorted. The valuetrueindicates that the data needs to be sorted, and the valuefalseindicates that the data does not need to be sorted.int: the maximum number of error rows tolerated.
For more information about how to use the
LOAD DATAstatement to enable the direct load feature, see Bypass data import.
charset_name_or_default
string_charset_name: the name of the character set, which is usually a string value, such asgbkorlatin1.BINARY: specifies thebinarycharacter set, which means no conversion, that is, conversion from or to thebinarycharacter set.DEFAULT: specifies to use the default character set of the database to interpret character data in files. The default character set of OceanBase Database isutf8mb4.
field_opt
{COLUMNS | FIELDS}: specifies the keywordCOLUMNSorFIELDS, which are synonyms, to indicate the columns in the data file.TERMINATED BY 'string': specifies the delimiter for fields in the data file, namely, the end character for exported columns. When a row of the file is read, the specified end character indicates the end of one field and the beginning of the next field.[OPTIONALLY] ENCLOSED BY 'char': Specifies the character that can optionally enclose (quote) the beginning and end of fields in the data file, i.e., setting a modifier for the exported values. If the field contains delimiters (e.g., commas), this field is often enclosed by double quotes.OPTIONALLY: an optional modifier that indicates that the enclosing characters are not mandatory. When you use theLOAD DATAstatement, this means that some fields may be enclosed while others are not. This does not affect the normal operation of the statement.
ESCAPED BY 'char': specifies the escape character for escaping special characters in the data file, namely, the characters to be ignored for exported values. For example, if a field value contains a double quotation mark character, this character must be escaped by preceding it with a backslash ("Hello \"world\"").
For example, you can use the FIELDS keyword to configure the field handling mode of the LOAD DATA statement, the TERMINATED BY keyword to set the delimiter for fields (comma , in this example), the OPTIONALLY ENCLOSED BY keyword to specify that field values can be enclosed with double quotation marks ("), and the ESCAPED BY keyword to specify the backslash (\) as the escape character to handle special characters in field values.
Notice
The following SQL statement is an example of the field_opt option of the LOAD DATA statement. It is not a complete LOAD DATA statement, so you cannot execute it.
...
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '\\'
...
line_opt
LINES: keyword that indicates the following parameters in theLOAD DATAstatement apply to the rows in the file.STARTING BY 'string': specifies the start delimiter of a row, which is the identifier of the start of each row.TERMINATED BY 'string': specifies the end delimiter of a row, which is the identifier of the end of each row.
For example, you can set the start delimiter of each row to the specific string ***, namely, the data in each row starts with ***. You can set the end delimiter of a row to the line break character (\n).
Notice
The following SQL statement is an example of the line_opt option in the LOAD DATA statement. It is not a complete LOAD DATA statement, so you cannot execute it.
...
LINES
STARTING BY '***'
TERMINATED BY '\n'
...
load_set_element
column_definition_ref PARSER_SYNTAX_ERROR {expr | DEFAULT}: specifies a single setting or conversion operation.
column_definition_ref: the name of the destination table column. The data in this column will be set or converted during the import process.PARSER_SYNTAX_ERROR: keyword that indicates a syntax error occurred in the parser.{expr | DEFAULT}:expr: an expression that defines how the values in the column are calculated or converted. The expression can operate on the original values in the data file. For example, you can use an expression to concatenate two fields or perform a mathematical operation on a numeric field.DEFAULT: specifies that the values in the column are set to the default values. The default values are usually specified when the column is defined.
For example, set the value of the col_name1 column to xxxx for each imported data record, and set the value of the col_name2 column to its default value.
Notice
The following SQL statement is an example of the load_set_element option in the LOAD DATA statement. It is not a complete LOAD DATA statement, so you cannot execute it.
...
SET col_name1 = 'xxxx', col_name2 = DEFAULT;
...
load_data_extended_option
Notice
The load_data_extended_option option is supported only in syntax. It does not take effect during execution.
LOGFILE [=] string_value: specifies a log file where specific events or errors that occur during data loading are recorded.string_valueis the path and file name of the destination log file. For example,LOGFILE = 'load_data.log'records the log information in the file namedload_data.log.REJECT LIMIT [=] int_num: specifies the maximum number of errors allowed before theLOAD DATAoperation is terminated.int_numis an integer that represents the number of errors. If more than this number of errors occur during data loading, the entire loading operation will fail. For example,REJECT LIMIT = 10will stop loading data after 10 errors occur.BADFILE [=] string_value: Similar to theLOGFILEparameter, it is used to specify a file that stores data rows identified as invalid or unable to be imported during the data loading process. Thestring_valueis the path and name of the file where these data rows are stored. For example,BADFILE = 'bad_data.txt'places the bad data rows into thebad_data.txtfile.
Examples
Example 1: Import data from a file on the server (OBServer node)
Set the global secure path.
Notice
For security reasons, when you set the system variable
secure_file_priv, you can connect to the database only through a local socket to execute the SQL statement that modifies the global variable. For more information, see secure_file_priv.SET GLOBAL secure_file_priv = "/";Log out.
Note
Since
secure_file_privis aGLOBALvariable, you must execute\qto exit and make the change take effect.obclinet> \qReconnect to the database and use the
LOAD DATAstatement to import data.Perform a regular import.
LOAD DATA INFILE '/home/admin/test.csv' INTO TABLE t1 FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n';Use the
APPENDhint to enable direct load.LOAD DATA /*+ PARALLEL(4) APPEND */ INFILE '/home/admin/test.csv' INTO TABLE t1 FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n';
Example 2: Import data from a file in Alibaba Cloud OSS
Use the direct(bool, int) hint to enable direct load. The file to be imported can be located in Alibaba Cloud OSS.
LOAD DATA /*+ direct(true,1024) parallel(16) */
REMOTE_OSS INFILE 'oss://antsys-oceanbasebackup/backup_rd/xiaotao.ht/lineitem2.tbl?host=***.oss-cdn.***&access_id=***&access_key=***'
INTO TABLE tbl1
FIELDS
TERMINATED BY ',';
References
- For more information about how to use the
LOAD DATAstatement, see Import data by using the LOAD DATA statement. - For more information about how to use the
LOAD DATAstatement for direct load, see Import data by using the LOAD DATA statement.