Data Importing|V3.3.3|OceanBase TuGraph| docs|Distributed Database

This document describes the data import function of TuGraph. These include delimiters in CSV format, sample formats for jsonline, and two modes for importing online and offline.

After the installation is successful, you can use the lgraph_import batch import tool to import existing data into TuGraph. lgraph_import supports importing data from CSV files and JSON data sources.

sonline format, a row of a json array string

CSV format

[movies.csv]
id, name, year, rating
tt0188766,King of Comedy,1999,7.3
tt0286112,Shaolin Soccer,2001,7.3
tt4701660,The Mermaid,2016,6.3

jsonline format

["tt0188766","King of Comedy",1999,7.3]
["tt0286112","Shaolin Soccer",2001,7.3]
["tt4701660","The Mermaid",2016,6.3]

TuGraph supports two data importing modes：

offline：Read the data and import it into a data file for the specified server, it should only be done when the server is offline.
online：Read the data and send it to the running server, which then imports the data into its database.

CSV file format delimiter

CSV delimiters can be single-character or multi-character strings, which cannot contain 'r' or 'n'. Note that different shells handle input strings differently, so different escape handling may be required for different shell input parameters.

In addition, 'lgraph_import' also supports the following escape characters for entering special symbols:

Escape character	Instructions
\	The backslash`\\`
\a	ASCII code 0x07
\f	form-feed，ASCII code 0x0c
\t	Horizontal tab characters， ASCII code 0x09
\v	Vertical tab characters， ASCII code 0x0b
\xnn	Two hexadecimal digits representing one byte，example \x9A
\nnn	A three-digit octal number representing a single byte，example \001, \443，data range cannot exceed 255

Demo：

$ ./lgraph_import -c ./import.config --delimiter "\001\002"

The configuration file

`lgraph_import`tool configures the environment using the specified configuration file. The configuration file describes the paths to the input files, the vertices/edges they represent, and the vertex/edge format.

Configuration File Format

The configuration file consists of two parts: schema and files. The 'schema' part defines label, and the 'files' part describes the data files to be imported.

The keyword

schema (Array）
- label（required，string）
- type（required， VERTEX or EDGE）
- properties（Array，required for vertex，If there is no property for the edge, you can leave it unconfigured）
  - name（required, string）
  - type （required,，BOOL，INT8，INT16，INT32，INT64，DATE，DATETIME，FLOAT，DOUBLE，STRING，BLOB）
  - optional（optional，Indicates that the field can be configured or not configured）
  - index（optional，Whether the field needs to be indexed）
  - unique（Optional, whether the field indexed and is of type unique）
- primary (Specify a property that uniquely identifies a point in the primary key field)
- constraints (Edge-only configuration, optional, array form, start and end labels, unconfigured or empty means unrestricted)
files （Array）
- path（required，string，The value can be a file path or a directory path. If it is a directory, all files in the directory will be imported. Ensure that they have the same schema）
- header（Optional, numeric, header in the first few lines of the file, or 0）
- format（required，JSON or CSV）
- label（required，string）
- columns（Array）
  - SRC_ID (Special string，only on the edges,That means this column is the source data)
  - DST_ID (A special string, only on the edges, indicates that this column is the destination data)
  - SKIP (A special string to skip this column of data)
  - [property]
- SRC_ID (Edge-only configuration. The value is the start label)
- DST_ID (Edge-only configuration. The value is the destination point label)

Example Configuration File

{
  "schema": [
    {
      "label": "actor",
      "type": "VERTEX",
      "properties": [
        { "name": "aid", "type": "STRING" },
        { "name": "name", "type": "STRING" }
      ],
      "primary": "aid"
    },
    {
      "label": "movie",
      "type": "VERTEX",
      "properties": [
        { "name": "mid", "type": "STRING" },
        { "name": "name", "type": "STRING" },
        { "name": "year", "type": "INT16" },
        { "name": "rate", "type": "FLOAT", "optional": true }
      ],
      "primary": "mid"
    },
    {
      "label": "play_in",
      "type": "EDGE",
      "properties": [{ "name": "role", "type": "STRING", "optional": true }],
      "constraints": [["actor", "movie"]]
    }
  ],
  "files": [
    {
      "path": "actors.csv",
      "header": 2,
      "format": "CSV",
      "label": "actor",
      "columns": ["aid", "name"]
    },
    {
      "path": "movies.csv",
      "header": 2,
      "format": "CSV",
      "label": "movie",
      "columns": ["mid", "name", "year", "rate"]
    },
    {
      "path": "roles.csv",
      "header": 2,
      "format": "CSV",
      "label": "play_in",
      "SRC_ID": "actor",
      "DST_ID": "movie",
      "columns": ["SRC_ID", "role", "DST_ID"]
    }
  ]
}

For the above Example Configuration File, three labels defined: two point types' actor 'and' movie ', and one edge type 'role'. Each label describes: the name of the label, the type (dot or edge), which attribute fields are available, and the type of each field. For points, the primary field also defined; For edges, the constraints field also defined, which limits the starting and ending points of the edges.

It also describes three data files, two dot data files' actors.csv 'and' movies.csv ', and one edge data file 'roles.csv'. Each section describes the path of the file, the format of the data, the first few lines of the header, the label of the data, and the corresponding field of each column in each row of data in the file.

For the above configuration files, the import tool will first create labels' actor ', 'movie', and 'role' in TuGraph, and then perform data import of the three files.

Offline full import

The offline mode can be used only on offline servers. An offline import creates a new graph, so it's more suitable for your first data import on a newly installed TuGraph server.

To use the 'lgraph_import' tool in offline mode, you can specify the 'lgraph_import --online false' option. To see the command-line options available, use 'lgraph_import --online false --help' :

$ ./lgraph_import --online false -help
Available command line options:
    --log               Log file to use, empty means stderr. Default="".
    -v, --verbose       Verbose level to use, higher means more verbose.
                        Default=1.
    ...
    -h, --help          Print this help message. Default=0.

Command line arguments:

-c, --config_file config_file: Import the configuration file name in the following format:
--log log_dir: Log directory. The default is an empty string, in which case the log information output to the console.
--verbose 0/1/2: Log level. The higher the log level, the more detailed the output information is. The default value is 1.
-i, --continue_on_error true/false: If an error occurs, skip the error and continue. The default is false. If an error occurs, exit.
-d, --dir {diretory}:The database directory to which the import tool will write data. Default is'./db '.
--delimiter {delimiter}: Data file separator. This is used only when the data source is in CSV format. The default is', '。
-u, --username {user}: Username of the database. You need to be an administrator to perform offline import.
-p, --password {password}: Specifies the password of the database user
--overwrite true/false: Whether to overwrite data. When set to true, the data directory overwritten if it already exists. Default to 'false'.
-g, --graph {graph_name}: Specify the kind of graph to import.
-h, --help: The help information displayed.

Offline Import Example

In this example, we use the movie-actor data described above to demonstrate the use of the import tool. The data to be imported is divided into three files: 'movies.csv', 'actors.csv', and 'roles.csv'.

'movies.csv' contains information about movies, where each movie has an id (as a primary key for retrieval), and each movie also has attributes such as title, year, and rating. (Data from IMDb).

  [movies.csv]
  id, name, year, rating
  tt0188766,King of Comedy,1999,7.3
  tt0286112,Shaolin Soccer,2001,7.3
  tt4701660,The Mermaid,2016,6.3

The jsonline format is as follows: All fields can be strings, which will be converted to the corresponding type when imported

["tt0188766","King of Comedy",1999,7.3]
["tt0286112","Shaolin Soccer",2001,7.3]
["tt4701660","The Mermaid",2016,6.3]

["tt0188766","King of Comedy","1999","7.3"]
["tt0286112","Shaolin Soccer","2001","7.3"]
["tt4701660","The Mermaid","2016","6.3"]

actors.csvIt contains information about the actors. Each actor also has an id, as well as properties such as name.

  [actors.csv]
  id, name
  nm015950,Stephen Chow
  nm0628806,Man-Tat Ng
  nm0156444,Cecilia Cheung
  nm2514879,Yuqi Zhang

The corresponding jsonline format is as follows:

["nm015950","Stephen Chow"]
["nm0628806","Man-Tat Ng"]
["nm0156444","Cecilia Cheung"]
["nm2514879","Yuqi Zhang"]

roles.csvIt contains information about which role an actor played in which movie. Each row records a character played by a given actor in a given movie, corresponding to an edge in the database.SRC_ID and DST_ID are the source and target vertices of the edge, which are the 'primary' properties defined in 'actors.csv' and 'movies.csv', respectively.

  [roles.csv]
  actor, role, movie
  nm015950,Tianchou Yin,tt0188766
  nm015950,Steel Leg,tt0286112
  nm0628806,,tt0188766
  nm0628806,coach,tt0286112
  nm0156444,PiaoPiao Liu,tt0188766
  nm2514879,Ruolan Li,tt4701660

The corresponding jsonline format is as follows:

["nm015950","Tianchou Yin","tt0188766"]
["nm015950","Steel Leg","tt0286112"]
["nm0628806",null,"tt0188766"]
["nm0628806","coach","tt0286112"]
["nm0156444","PiaoPiao Liu","tt0188766"]
["nm2514879","Ruolan Li","tt4701660"]

'configuration file import.conf', notice that there are two HEADER lines in each file, so we need to specify the 'HEADER=2' option.

{
  "schema": [
    {
      "label": "actor",
      "type": "VERTEX",
      "properties": [
        { "name": "aid", "type": "STRING" },
        { "name": "name", "type": "STRING" }
      ],
      "primary": "aid"
    },
    {
      "label": "movie",
      "type": "VERTEX",
      "properties": [
        { "name": "mid", "type": "STRING" },
        { "name": "name", "type": "STRING" },
        { "name": "year", "type": "INT16" },
        { "name": "rate", "type": "FLOAT", "optional": true }
      ],
      "primary": "mid"
    },
    {
      "label": "play_in",
      "type": "EDGE",
      "properties": [{ "name": "role", "type": "STRING", "optional": true }],
      "constraints": [["actor", "movie"]]
    }
  ],
  "files": [
    {
      "path": "actors.csv",
      "header": 2,
      "format": "CSV",
      "label": "actor",
      "columns": ["aid", "name"]
    },
    {
      "path": "movies.csv",
      "header": 2,
      "format": "CSV",
      "label": "movie",
      "columns": ["mid", "name", "year", "rate"]
    },
    {
      "path": "roles.csv",
      "header": 2,
      "format": "CSV",
      "label": "play_in",
      "SRC_ID": "actor",
      "DST_ID": "movie",
      "columns": ["SRC_ID", "role", "DST_ID"]
    }
  ]
}

Using the import configuration file, we can now import data using the following command:

$ ./lgraph_import
        -c import.conf             # Read configuration information from import.conf
        --dir /data/lgraph_db      # Store the data in /data/lgraph_db
        --graph mygraph            # Import the graph named mygraph

Notice：

If a graph named 'mygraph' already exists, the import tool will print an error message and exit. To force a graph to be overwritten, use the '--overwrite true' option.
Configuration and data files must be stored in UTF-8 encoding (or normal ASCII encoding, which is a subset of UTF-8). If any file uses an encoding other than UTF-8 (for example, UTF-8 with BOM or GBK), the import will fail and output a profiler error.

Online incremental Import

The online import mode can be used to import a batch of files into an already running instance of TuGraph. This is handy for processing incremental batch updates that typically occur at fixed intervals. The 'lgraph_import --online true' option enables the import tool to work in online mode. Like offline mode, online mode has its own set of command-line options, which can be printed using the '-h, --help' options:

$ lgraph_import --online true -h
Available command line options:
    --online            Whether to import online.
    -h, --help          Print this help message. Default=0.

Available command line options:
    --log               Log file to use, empty means stderr. Default="".
    -v, --verbose       Verbose level to use, higher means more verbose.
                        Default=1.
    -c, --config_file   Config file path.
    -r, --url           DB REST API address.
    -u, --username      DB username.
    -p, --password      DB password.
    -i, --continue_on_error
                        When we hit a duplicate uid or missing uid, should we
                        continue or abort. Default=0.
    -g, --graph         The name of the graph to import into. Default=default.
    --skip_packages     How many packages should we skip. Default=0.
    --delimiter         Delimiter used in the CSV files
    --breakpoint_continue
                        When the transmission process is interrupted,whether
                        to re-transmit from zero package next time. Default=false
    -h, --help          Print this help message. Default=0.

The configuration related to the file specified in the configuration file, and its format is exactly the same as' offline mode '. However, instead of importing the data into a local database, we are now sending the data to a running TuGraph instance, which is typically running on a different computer than the client machine on which the import tool is running. Therefore, we need to specify the URL, DB user, and password for the HTTP address of the remote machine.

If the user and password are valid, and the specified graph exists, the import tool sends the data to the server, which then parses the data and writes it to the specified graph. The data will be sent in a packet of about 16 MB, interrupted at the nearest newline. Each package imported atomically, which means that if the package is successfully imported, all data is successfully imported; otherwise, none of the data enters the database. If '--continue_on_error true' is specified, data integrity errors are ignored and offending lines are ignored. Otherwise, the import stops at the first error package and prints out the number of packages that have been imported. In this case, the user can modify the data to eliminate errors, and then use '--skip_packages N' to redo the import to skip the imported packages.