This topic describes how to import data in direct load mode by using obloader. The following table describes the command-line options for direct load.
Notice
When you use obloader for direct load, you can connect to an OBServer node directly or through OceanBase Database Proxy (ODP). The version requirements are as follows:
- In the case of direct connection, the version of the OBServer node must be OceanBase Database V4.2.0 or later.
- In the case of connection through ODP, the version of ODP must be V4.3.0 or later, and that of the OBServer node must be OceanBase Database V4.2.1 or later.
| Command-line option | Description | ApsaraDB for OceanBase & ODP | OceanBase Database & ODP | OceanBase Database |
|---|---|---|---|---|
| --direct | Specifies to enable direct load. | Required | Required | Required |
| --parallel | The degree of parallelism (DOP) on the server. The default value is 1. We recommend that the value be consistent with the number of CPU cores of the tenant. We recommend that you specify this option to ensure performance stability. |
Optional | Optional | Optional |
| --rpc-port | The inner RPC port of the server. You can obtain the RPC port as follows:
|
Required | Required | Required |
| -u(--user) | The database username. | Required | Required | Required |
| -P(--port) | The SQL port number. | Required | Required | Required |
| -t(--tenant ) | The name of a tenant in the cluster. | Optional. If this option is not specified, partition calculation may be skipped. |
Required | Required |
| -c(--cluster) | The name of the cluster. | Optional | Required | - |
| --public-cloud | Indicates that the database environment is ApsaraDB for OceanBase. | Required | - | - |
| --no-sys | Specifies that the import does not rely on the sys tenant. This option applies only to OceanBase Database of a version earlier than V4.0.0. |
Optional | Optional | Optional |
| --sys-user | The user on which the import relies in the sys tenant. If this option is not specified, the default value root takes effect. This option applies only to OceanBase Database of a version earlier than V4.0.0. |
Optional This option is mutually exclusive with the --no-sys option. |
Optional This option is mutually exclusive with the --no-sys option. |
Optional This option is mutually exclusive with the --no-sys option. |
| --sys-password | The password of the user on which the import relies in the sys tenant. This option applies only to OceanBase Database of a version earlier than V4.0.0. |
Optional This option is mutually exclusive with the --no-sys option. |
Optional This option is mutually exclusive with the --no-sys option. |
Optional This option is mutually exclusive with the --no-sys option. |
Example
Run the following command to import data to OceanBase Database:
bin/obloader -h xx.x.x.xx -P 2883 -u TPCH -t oboracle -c OB4216 -p -D TPCH --table LINEITEM --external-data --csv -f /data/1/tpch/s4/bak/ --truncate-table --column-separator='|' --thread 16 --rpc-port=2885 --direct --parallel=16
In direct load mode, you connect to an SQL port (2881 by default) instead of an RPC port (2882 by default) of the OBServer node. If obloader connects to the OBServer node directly instead of using ODP, you can specify another RPC port for direct load. The RPC port for ODP V4.3.0 is 2885. However, we recommend that you do not get around ODP in a production environment, because the location of the leader of business data is transparent to the client. The client does not need to know the OBServer node where the data resides, and ODP is responsible for SQL routing. When you directly connect to an OBServer node to write data in direct load mode, a transaction that spans across servers is produced if the leader does not reside on this OBServer node.
The preceding command imports all supported CSV data files in the /data/1/tpch/s4/bak/ directory to the LINEITEM table. If the size of data files to be imported exceeds 1 TB, we recommend that you use direct load to improve the import efficiency. Direct load in combination with higher tenant resource specifications can achieve even higher import efficiency.
Direct load parameters
You can configure direct load parameters in the session.config.json file in the {ob-loader-dumper}/conf directory.
Here is an example:
"direct_path_load": {
"rpc_connect_timeout": "15000",
"rpc_execute_timeout": "20000",
"runtime_retry_times": "5",
"runtime_retry_intervals": "50",
"task_timeout": "2592000000000",
"heartbeat_timeout": "60000000"
}
rpc_connect_timeout: the timeout period for establishing an RPC connection, in milliseconds.rpc_execute_timeout: the timeout period for executing an RPC request, in milliseconds.runtime_retry_times: the maximum number of retries allowed. If an import operation fails, it will be retried based on this parameter.runtime_retry_intervals: the retry interval, namely, the amount of time to wait before a retry is triggered, in milliseconds.task_timeout: the timeout period of an import operation, in μs. If an import operation is not completed within the configured period, it is considered timed out. The default value is0, which indicates that the import does not time out.heartbeat_timeout: the heartbeat timeout period, which is used to detect the active status of an import operation, in milliseconds. The default value is0, which specifies not to enable heartbeat detection.
Considerations
In direct load, the RPC port instead of the SQL port is used to transmit data.
Data to be imported is submitted at the table level instead of the session or transaction level.
Retry or resumable transmission is not supported.
Bit data types are not supported.
Virtual generated columns are not supported.
We recommend that you do not import a small amount of data in direct load mode.
At present, write requests in direct load are not idempotent. In rare cases, packet retransmission may result in duplicate data when packets are finally sorted. Duplicate data may occur when you import a heap table without indexes.
The
--replace-dataoption cannot help address unique index conflicts.The differences between the
--threadand--paralleloptions are as follows:--threadspecifies the thread pool for connections from the client to the server, and is maintained on the client.--parallelspecifies the number of worker threads that can be called by the OBServer node for data writing and sorting.We recommend that you specify consistent values for
--threadand--parallel.