Use the LOAD DATA statement to load CSV data files to OceanBase Database|V4.0.0| docs|Distributed Database

Use the LOAD DATA statement to load CSV data files to OceanBase Database

Last Updated：2023-07-21 09:11:01 Updated

The LOAD DATA statement in OceanBase Database in MySQL mode is equivalent to the LOAD DATA statement in MySQL. The t1.csv file imported to OceanBase Database is the file exported in the previous topic. The .csv file must be placed on the OBServer node. OceanBase Database cannot load local files. This feature is still under development.

Syntax of the LOAD DATA statement

LOAD DATA
  [/*+ parallel(N) load_batch_size(M)*/]
    INFILE 'file_name'
    [REPLACE | IGNORE]
    INTO TABLE tbl_name
    [{FIELDS | COLUMNS}
        [TERMINATED BY 'string']
        [[OPTIONALLY] ENCLOSED BY 'char']
        [ESCAPED BY 'char']
    ]
    [LINES
        [STARTING BY 'string']
        [TERMINATED BY 'string']
    ]
    [IGNORE number {LINES | ROWS}]
    [(col_name_var
        [, col_name_var] ...)]
    [SET col_name={expr | DEFAULT},
        [, col_name={expr | DEFAULT}] ...]

Principles of the LOAD DATA statement

You can use the LOAD DATA statement to import a CSV file in the following procedure:

Parse the file: OceanBase Database reads data from the file based on the file name you entered and determines whether to perform parallel or serial parsing of data from the input file based on the degree of parallelism (DOP) you specified.
Distribute the data: OceanBase Database is a distributed database. Data of each partition may be distributed across different OBServer nodes. The LOAD DATA statement is used to process the parsed data and determine to which OBServer node the data is to be sent.
Insert the data: After the destination OBServer node receives the data, it executes the INSERT statement to insert the data into the corresponding partitions.

To improve the performance of the LOAD DATA statement, you can specify the DOP for loading data. The LOAD DATA statement can work in multiple threads during file parsing, partition computation, and data distribution.

To avoid the impact of distributed transactions on performance, the LOAD DATA statement groups data by partition and distributes data to the OBServer node for multiple write operations. It opens an independent transaction for each write operation that writes data from one group. If an error occurs during the execution of the LOAD DATA statement, you must manually delete the loaded data. If a large data file is imported, it may take a long time to insert data into each node. In this case, adjust the value of the ob_query_timeout parameter as needed.

The LOAD DATA statement provides many options to fit your needs, including:

DOP

The /* + parallel(N)*/ option specifies the DOP for data loading. The default value of N is 4. Recommended value range: [0, Maximum number of CPU cores in the tenant].
```
load data /*+ parallel(4) */infile '/home/admin/a.csv' into table t
```
Batch size

/ *+ load_batch_size(M)* / specifies the batch size of each insertion. The default value of M is 1000. You can adjust the value of M based on the total length of the data rows to be imported. Recommended value range: [[100, 1000]].
Input file

The INFILE 'file_name' keyword specifies the directory and name of the input file. In the current OceanBase Database version, the LOAD DATA statement supports only loading a file located on the local OBServer. Therefore, before you import a file, you need to copy the file to the OBServer where the OBServer node runs and execute the LOAD DATA statement.
Index of the destination table

To improve the import efficiency, we recommend that you create a base table first and create an index for the table after data import is completed. You must create the global index after the data import is completed. Otherwise, the not support error may be reported.
Execution privileges

You need to perform the following steps to grant access to a file on the server:
1. Modify the path where the security file is located.
  
  Set the path of the security file to null to skip security check. Then exit the session and log on again for the setting to take effect.
```
Run set global secure_file_priv = "".
```
2. Grant the file privilege to a user.
  
  You can run the following command to grant the file privilege to a user in MySQL mode.
```
`grant file on *.* to USER_NAME;`
```
Duplicate data processing

REPLACE specifies to replace the original data in the table with data in the input file. IGNORE specifies to ignore duplicate data in the input file.

The LOAD DATA statement uses the primary key of a table to identify duplicate data. If the table does not have a primary key, the LOAD DATA statement cannot identify duplicate data, and the REPLACE and IGNORE options do not take effect.

If you do not specify this option, the LOAD DATA statement records the erroneous data into the log file in the case of duplicate data.
Destination table option

The INTO TABLE tbl_name keyword is used to specify the name of the destination table. The LOAD DATA statement supports partitioned and non-partitioned tables.
Field format

The FIELDS\|COLUMNS clause specifies the delimiter options for fields in the input file. Notes:
- TERMINATED BY specifies the end character of each field.
- Enclosed By specifies a special character that encloses each field.
- Escaped By specifies a wildcard character for the imported fields.
Line format

`` The line format specifies the delimiter for each line in the input file. Notes:
- Starting By specifies the start character of each line.
- Terminated By specifies the end character of each line.
- The IGNORE number {LINES \| ROWS} clause indicates to ignore a specified number of rows from the beginning of the input file.
```
load data /*+ parallel(4) */infile '/home/admin/a.csv' into table t fields terminated by ',' lines terminated by '\n';
```

Log file

If an error occurs during the import, according to the design of the LOAD DATA statement, the INSERT statement will be rolled back, and the LOAD DATA statement generates a log file named obloaddata.log in the \log subdirectory of the installation directory of the OBServer. The following shows a sample log file.

Tenant name: mysql
File name: /home/admin/a.csv
Into table: `test`.`t`
Parallel: 1
Batch size: 1000
SQL trace: YD7A20BA65670-0005AADAAA3CAB52
Start time: 2020-07-29 21:08:13.073741
Load query:
load data infile '/home/admin/test.csv' into table t fields terminated by ',' lines terminated by '\n'

Row ErrCode ErrMsg
1 1062 Duplicated primary key
2 1062 Duplicated primary key

The log contains the basic information about the task generated by the LOAD DATA statement, including the tenant name, input file name, destination table name, DOP, LOAD DATA statement used, and error details in lines.

Example

Import the t1.csv file exported from MySQL in the preceding topic to an OceanBase database.

$ mysql -h127.1 -u****@obmysql#obdemo -P2881 -p123456 -c -A test -Ns
MySQL [test]> select * from t1;
MySQL [test]> load data infile '/tmp/t1.csv' into table t1 fields terminated by ',' enclosed by '"' lines terminated by '\n' ;
MySQL [test]> select * from t1;
1        a b   a b
2        a,b   "a b"
3        a\nb a \nb
4        a\\b  "a\\b"
MySQL [test]>