Introduction to DataX
DataX is an offline data synchronization tool or platform widely used in Alibaba Group. It efficiently synchronizes data between heterogeneous data sources such as MySQL, Oracle, SQL Server, Postgre SQL, Hadoop Distributed File System (HDFS), Hive, ADS, HBase, Table Store (formerly known as OTS), MaxCompute (formerly known as ODPS), Distributed Relational Database Service (DRDS), and OceanBase Database.
As a data synchronization framework, DataX abstracts the synchronization between different data sources into a Reader plug-in that reads the data from the data source and a Writer plug-in that writes the data to the destination. DataX theoretically supports the synchronization between all types of data sources. The DataX plug-ins form an ecosystem. A new data source can communicate with an existing data source immediately after it joins the ecosystem.
The source code of DataX available on GitHub at github.com/Alibaba/datax. The open-source DataX does not support OceanBase and DB2. The OceanBase product team provides the Reader and Writer plug-ins for OceanBase and DB2.
Instructions for DataX
The default installation directory of DataX is /home/admin/datax3. By default, the job folder in this directory stores the configuration files of data migration tasks. You can also store the configuration files in a custom directory.
The parameter file for each task is in JSON format and includes a reader and a writer. The job folder contains a sample configuration file: job.json.
[admin /home/admin/datax3/job]
$cat job.json
{
"job": {
"setting": {
"speed": {
"byte":10485760
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column" : [
{
"value": "DataX",
"type": "string"
},
{
"value": 19890604,
"type": "long"
},
{
"value": "1989-06-04 00:00:00",
"type": "date"
},
{
"value": true,
"type": "bool"
},
{
"value": "test",
"type": "bytes"
}
],
"sliceRecordCount": 100000
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": false,
"encoding": "UTF-8"
}
}
}
]
}
}
The reader and writer for this task are a Stream Reader and Stream Writer. This task checks whether DataX is properly installed. You need to ensure that the Java Development Kit (JDK) runtime environment has been installed before you run job.json.
[admin@h07g12092.sqa.eu95 /home/admin/datax3/job]
$cd ../
[admin@h07g12092.sqa.eu95 /home/admin/datax3]
$bin/datax.py job/job.json
The output is as follows:
2021-03-16 16:51:42.030 [job-0] INFO JobContainer -
=== total summarize info ===
1. all phase average time info and max time task info:
PHASE | AVERAGE USED TIME | ALL TASK NUM | MAX USED TIME | MAX TASK ID | MAX TASK INFO
TASK_TOTAL | 0.402s | 1 | 0.402s | 0-0-0 | null
READ_TASK_INIT | 0.001s | 1 | 0.001s | 0-0-0 | null
READ_TASK_PREPARE | 0.001s | 1 | 0.001s | 0-0-0 | null
READ_TASK_DATA | 0.091s | 1 | 0.091s | 0-0-0 | null
READ_TASK_POST | 0.000s | 1 | 0.000s | 0-0-0 | null
READ_TASK_DESTROY | 0.000s | 1 | 0.000s | 0-0-0 | null
WRITE_TASK_INIT | 0.001s | 1 | 0.001s | 0-0-0 | null
WRITE_TASK_PREPARE | 0.001s | 1 | 0.001s | 0-0-0 | null
WRITE_TASK_DATA | 0.290s | 1 | 0.290s | 0-0-0 | null
WRITE_TASK_POST | 0.000s | 1 | 0.000s | 0-0-0 | null
WRITE_TASK_DESTROY | 0.000s | 1 | 0.000s | 0-0-0 | null
WAIT_READ_TIME | 0.052s | 1 | 0.052s | 0-0-0 | null
WAIT_WRITE_TIME | 0.025s | 1 | 0.025s | 0-0-0 | null
2. record average count and max count task info :
PHASE | AVERAGE RECORDS | AVERAGE BYTES | MAX RECORDS | MAX RECORD`S BYTES | MAX TASK ID | MAX TASK INFO
READ_TASK_DATA | 100000 | 2.60M | 100000 | 2.60M | 0-0-0 | null
2021-03-16 16:51:42.030 [job-0] INFO MetricReportUtil - reportJobMetric is turn off
2021-03-16 16:51:42.031 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 2.48MB/s, 100000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.025s | All Task WaitReaderTime 0.052s | Percentage 100.00%
2021-03-16 16:51:42.032 [job-0] INFO LogReportUtil - report datax log is turn off
2021-03-16 16:51:42.032 [job-0] INFO JobContainer -
Time of task startup : 2021-03-16 16:51:40
Time of task end : 2021-03-16 16:51:42
Total time elapsed : 1s
Average traffic of the task : 2.48MB/s
Record writing speed : 100000rec/s
Total records read : 100000
Total read/write failures : 0
Note
No output is displayed because this default parameter has disabled the data output.