Description
The statement is used to import data from an external source.
In OceanBase Database, the LOAD DATA statement can load input files from OceanBase Database and Alibaba Cloud OSS, where the files are stored on OBServer nodes.
Note
- The current version of OceanBase Database does not support using the
LOCAL INFILEsyntax to load data files. - Tables with triggers are not supported.
The LOAD DATA statement can currently import text files in CSV format. The entire import process consists of the following steps:
Parse the file: OceanBase Database reads data from the file based on the file name specified by the user and parses the data in the file in parallel or serially based on the specified degree of parallelism.
Distribute data: As OceanBase Database is a distributed database, data in partitions can be distributed across OBServer nodes. The
LOAD DATAstatement calculates the destination OBServer node for each data record parsed from the input file.Insert data: Once a data record reaches its destination OBServer node, the node executes an
INSERTstatement to insert the record into the corresponding partition.
To import data from external files, you must have the FILE privilege. You can use the GRANT FILE ON *.* TO $user_name; statement to grant the privilege to the user who needs to execute the LOAD DATA statement, where $user_name is the name of the user.
Notice
If the statement is executed with the REPLACE or IGNORE clause and the degree of parallelism is greater than 1, the last inserted record for a conflicting row may be different from that obtained by serial execution. To strictly ensure the insertion results for conflicting rows, do not specify the degree of parallelism in the statement or set the degree of parallelism to 1.
Syntax
LOAD DATA
[ [/*+ PARALLEL(N) load_batch_size(M) APPEND */]
| [/*+ PARALLEL(N) direct(bool, int) */] ]
[REMOTE_OSS] INFILE 'file_name'
[REPLACE | IGNORE]
INTO TABLE table_name
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[IGNORE number {LINES | ROWS}]
[(column_name_var
[, column_name_var] ...)]
Parameter description
| Parameter | Description |
|---|---|
| parallel(N) | The degree of parallelism for loading data. The default value of N is 4. |
| load_batch_size(M) | The number of rows to be inserted at a time. The default value of M is 100. We recommend that you set this parameter to a value between 100 and 1000. |
| APPEND | Use this hint to enable direct load, which allows you to allocate space and write data directly in the data file. By default, the APPEND hint is equivalent to direct(true, 0). It also enables online collection of optimizer statistics (GATHER_OPTIMIZER_STATISTICS). |
| direct | Use this hint to enable direct load. The bool parameter in direct(bool, int) specifies whether to sort the given CSV file, where true indicates that sorting is required, and the int parameter specifies the maximum number of allowed error rows. |
| REMOTE_OSS | Specifies whether to read data from the OSS file system.
NoticeIf you specify this parameter, the |
| file_name | The path and name of the input file. The format of the file_name parameter is as follows:
NoteBefore you import a file from OSS, make sure that:
|
| REPLACE | IGNORE | Specifies the action to take when a unique key conflict occurs. REPLACE specifies to overwrite the conflicting row, and IGNORE specifies to ignore the conflicting row. LOAD DATA determines whether data is duplicated based on the primary key of the table. If the table does not have a primary key, the REPLACE and IGNORE options are the same. By default, when LOAD DATA encounters duplicate data, it records the error in a log file. |
| table_name | The name of the table to which the imported data is to be loaded. The table can be a partitioned or non-partitioned table. |
| FIELDS | COLUMNS | Specifies the format of fields.
|
| LINES STARTING BY | Specifies the start delimiter of a line. |
| LINES TERMINATED BY | Specifies the end delimiter of a line. |
| IGNORE number { LINES | ROWS } | Specifies to ignore the first several lines. LINES specifies to ignore several lines in the file, and ROWS specifies to ignore several rows separated by field delimiters. By default, the system matches the fields in the input file one by one with the columns in the table. If the input file does not contain all the columns, the missing columns are filled with default values that are specified in the following rules:
|
| column_name_var | The name of the column to be imported. |
Examples
Example 1: Import data from a file on the server (OBServer node)
Set the global secure path.
Notice
For security reasons, when you set the system variable
secure_file_priv, you can connect to the database only through a local socket to execute the SQL statement that modifies the global variable. For more information, see secure_file_priv.obclient> SET GLOBAL secure_file_priv = "";Log out.
Note
After you set the
secure_file_privvariable, you must execute the\qcommand to log out for the change to take effect.obclinet> \qThe return result is as follows:
ByeAfter you reconnect to the database, execute the
LOAD DATAstatement to import data.Perform a regular import.
obclient> LOAD DATA INFILE '/home/admin/test.csv' INTO TABLE t1;Use the
APPENDhint to enable direct load.LOAD DATA /*+ PARALLEL(4) APPEND */ INFILE '/home/admin/test.csv' INTO TABLE t1;
Example 2: Import data from an OSS file
Notice
When you use an object storage path, make sure that the values of the parameters in the object storage path are composed of uppercase and lowercase letters, digits, and the following special characters: \/-_$+=. If you enter characters other than the preceding ones, the setting may fail.
Use the direct(bool, int) hint to enable direct load for a file that is stored in OSS.
LOAD DATA /*+ direct(true,1024) parallel(16) */ REMOTE_OSS INFILE 'oss://antsys-oceanbasebackup/backup_rd/xiaotao.ht/lineitem2.tbl?host=***.oss-cdn.***&access_id=***&access_key=***' INTO TABLE tbl1 FIELDS TERMINATED BY ',';