Prepare data|V4.3.2| docs|Distributed Database

This topic describes the data file requirements and considerations to observe when you import data to or export data from OceanBase Database. To successfully import your data to OceanBase Database, make sure that your data meets the requirements. To generate data files that meet the requirements by using any data import/export tools, take note of the limitations and considerations in this topic.

By default, OBDUMPER exports data to a file sized up to 1 GB. When the exported data exceeds 1 GB, OBDUMPER generates a new file and continues to write data to it.

OBLOADER supports logical splitting of a data file without generating temporary files. It can split a file sized more than 100 GB within a short time, parse all sub-files in parallel, and then concurrently import them to the database. Therefore, you do not need to start multiple OBLOADER processes for concurrent import.

File size limit

We recommend that you do not import or export large-sized files. If your data export tool cannot export data by file block, split the file.

Split a file in Linux or macOS

You can run the built-in split command to split a CSV file into multiple sub-files.

split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]

A sample command is as follows:

split -l 100000 pagecounts-20210723.csv pages

This command splits the pagecounts-20210723.csv file by rows. Assume that the file is 8 GB in size and contains 10,000,000 rows. It is split into 100 (1,000,000/100,000 = 100) sub-files with each sub-file containing 100,000 rows. Each sub-file is 80 MB in size and has pages as the file extension name.

Notice

When the split command splits a file by rows, an error may occur if the file contains line breaks.

Split a file in Windows

Windows provides no built-in splitting tool. You can split a large file by using a third-party splitting tool or script.

File formats

OBDUMPER supports CSV, SQL, and delimited text files. SQL files must contain only INSERT statements. Note that the file extension name is not equivalent to the file format. The file format is the way to organize file content. For example, a file named 123.csv stores data in the CSV format. The file name extension .csv is used to identify the file type. You can also store data in the CSV format in a file named 123.txt. A software program can still parse the data file. However, it may be difficult for you to identify file content from the file name. To identify the data format, preview some of the data content. This section describes how to identify data formats.

CSV

The CSV format is the most commonly used data format. Take note of the basic format specifications defined in RFC 4180 and make sure that the following conditions are met:

Your CSV file does not contain special characters.

If your CSV file contains delimiters, column separators, line separators, or NULL values, specify the escape character parameter in the export command. Otherwise, the generated CSV file cannot be correctly parsed.
Your CSV file does not contain carriage returns.
You understand the differences between the CSV format and the Excel format.

Here is a sample CSV file:

Name,Age,Occupation,City
John Smith,32,Engineer,New York
Emma Watson,28,Teacher,London
Michael Chen,45,Doctor,San Francisco
Sarah Johnson,39,Lawyer,Chicago

The CSV header is the first row of the CSV file and corresponds to column names in the table. You can choose not to insert the header. CSV records are table data stored in the CSV file.

INSERT

To generate a correct INSERT statement, make sure that the following conditions are met:

You have escaped object names or column names that contain SQL keywords.

If your object name or column name is an SQL keyword, add the escape character to it when you generate an INSERT statement. For example, use double quotation marks (") for escape in Oracle syntax and backticks (`) in MySQL syntax.
You have escaped data with single quotation marks (').

If your data contains single quotation marks ('), escape them when you generate an INSERT statement. By default, OBDUMPER escapes data during export to ensure that the syntax of the generated INSERT statement is correct.

Here are some sample INSERT statements:

-- Examples in the Oracle compatible mode
INSERT INTO "employees" ("Name", "Age", "SELECT", "City") VALUES ('John Smith', 32, 'Engineer', 'New York');

INSERT INTO "employees" ("Name", "Age", "SELECT", "City") VALUES ('Emma O''Brien', 28, 'Teacher', 'London');

INSERT INTO "employees" ("Name", "Age", "SELECT", "City") VALUES ('Michael Chen', 45, 'Doctor', 'San Francisco''s Bay Area');

-- Examples in the MySQL compatible mode
INSERT INTO `employees` (`Name`, `Age`, `SELECT`, `City`) VALUES ('John Smith', 32, 'Engineer', 'New York');

INSERT INTO `employees` (`Name`, `Age`, `SELECT`, `City`) VALUES ('Emma O\'Brien', 28, 'Teacher', 'London');

INSERT INTO `employees` (`Name`, `Age`, `SELECT`, `City`) VALUES ('Michael Chen', 45, 'Doctor', 'San Francisco\'s Bay Area');

Delimited text

The delimited text and CSV formats can be confusing. In the CSV format, the separator is one character, which is the comma (,) by default. However, in the delimited text format, the separator can be one or more characters. Basic symbols in the delimited text format include column separators and line breaks, without delimiters. Data in the delimited text cannot contain characters that conflict with the basic symbols. Otherwise, it cannot be correctly parsed. For example, if the data contains a line break, it is parsed into two rows.

Here is a sample delimited text file:

Name|Age|Occupation|City
John Smith|32|Engineer|New York
Emma Watson|28|Teacher|London
Michael Chen|45|Doctor|San Francisco
Sarah Johnson|39|Lawyer|Chicago

Processing of semi-structured data

Common semi-structured data formats are JSON and XML. JSON and XML are also composite data definition formats. For example, a JSON or XML data node can store data in any format such as CSV data or INSERT statements. Therefore, when you prepare data, you must properly delimit and escape the data. At present, a file containing data in the JSON or XML format cannot be correctly split. Therefore, when you use OBLOADER to import the file, you can modify the file splitting threshold to make the program skip automatic splitting for the file.

Processing of binary large objects

You must encode binary large objects such as data of the RAW, BINARY, or Large Object Binary (LOB) type into hexadecimal strings for storage and parsing. By default, OBDUMPER and mysqldump encode binary large objects in the same way. In extreme cases, the LOB type can store hundreds of megabytes or even several gigabytes of data. This poses a great challenge to the performance and storage space of import and export tools. In your daily operations, you can process tables that contain binary large objects separately.

Processing of datetime types

In an Oracle database or the Oracle compatible mode of OceanBase Database, datetime types are complex and likely to cause errors in operations, such as precision, format, and time zone errors.

You must specify the datetime formats when you export data from the source database and import data to the destination database. The datetime format in the destination database must be the same as that in the source database. Otherwise, datetime types may not be imported correctly.

Datetime types vary based on databases. For example, the DATE type represents different datetime information in different databases such as MySQL, DB2, Oracle, and OceanBase Database. Read the database documentation to resolve compatibility issues. Otherwise, precision loss or write failures may occur.