Introduction
DataX is an offline data synchronization tool or platform widely used in Alibaba Group. It efficiently synchronizes data between heterogeneous data sources such as MySQL, Oracle, SQL Server, PostgreSQL, Hadoop Distributed File System (HDFS), Hive, ADS, HBase, Table Store (OTS), MaxCompute (formerly known as ODPS), Distributed Relational Database Service (DRDS), and OceanBase Database.
As a data synchronization framework, DataX abstracts the synchronization between different data sources into a Reader plug-in that reads the data from the data source and a Writer plug-in that writes the data to the destination. DataX theoretically supports the synchronization between all types of data sources. The DataX plug-ins form an ecosystem. A new data source can communicate with an existing data source immediately after it joins the ecosystem.
The source code of DataX is available on GitHub at: github.com/Alibaba/datax. Open-source products do not support OceanBase and DB2. The OceanBase product team provides the Reader and Writer plug-ins for OceanBase and DB2.
Examples of using DataX
By default, DataX is installed in the /home/admin/datax3 directory. You can find a job folder in this directory, which contains configuration files for data migration tasks. You can also customize the installation directory of DataX.
The parameter file for each task is in JSON format and includes a reader and a writer. The job folder contains a sample configuration file: job.json.
[admin /home/admin/datax3/job]
$cat job.json
{
"job": {
"setting": {
"speed": {
"byte":10485760
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column" : [
{
"value": "DataX",
"type": "string"
},
{
"value": 19890604,
"type": "long"
},
{
"value": "1989-06-04 00:00:00",
"type": "date"
},
{
"value": true,
"type": "bool"
},
{
"value": "test",
"type": "bytes"
}
],
"sliceRecordCount": 100000
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": false,
"encoding": "UTF-8"
}
}
}
]
}
}
The reader and writer for this task are a Stream Reader and Stream Writer. This task checks whether DataX is properly installed and ensures that the JDK runtime environment is installed.
[admin@h07g12092.sqa.eu95 /home/admin/datax3/job]
$cd ../
[admin@h07g12092.sqa.eu95 /home/admin/datax3]
$bin/datax.py job/job.json
Result:
2021-03-16 16:51:42.030 [job-0] INFO JobContainer -
=== total summarize info ===
1. all phase average time info and max time task info:
PHASE | AVERAGE USED TIME | ALL TASK NUM | MAX USED TIME | MAX TASK ID | MAX TASK INFO
TASK_TOTAL | 0.402s | 1 | 0.402s | 0-0-0 | null
READ_TASK_INIT | 0.001s | 1 | 0.001s | 0-0-0 | null
READ_TASK_PREPARE | 0.001s | 1 | 0.001s | 0-0-0 | null
READ_TASK_DATA | 0.091s | 1 | 0.091s | 0-0-0 | null
READ_TASK_POST | 0.000s | 1 | 0.000s | 0-0-0 | null
READ_TASK_DESTROY | 0.000s | 1 | 0.000s | 0-0-0 | null
WRITE_TASK_INIT | 0.001s | 1 | 0.001s | 0-0-0 | null
WRITE_TASK_PREPARE | 0.001s | 1 | 0.001s | 0-0-0 | null
WRITE_TASK_DATA | 0.290s | 1 | 0.290s | 0-0-0 | null
WRITE_TASK_POST | 0.000s | 1 | 0.000s | 0-0-0 | null
WRITE_TASK_DESTROY | 0.000s | 1 | 0.000s | 0-0-0 | null
WAIT_READ_TIME | 0.052s | 1 | 0.052s | 0-0-0 | null
WAIT_WRITE_TIME | 0.025s | 1 | 0.025s | 0-0-0 | null
2. record average count and max count task info :
PHASE | AVERAGE RECORDS | AVERAGE BYTES | MAX RECORDS | MAX RECORD`S BYTES | MAX TASK ID | MAX TASK INFO
READ_TASK_DATA | 100000 | 2.60M | 100000 | 2.60M | 0-0-0 | null
2021-03-16 16:51:42.030 [job-0] INFO MetricReportUtil - reportJobMetric is turn off
2021-03-16 16:51:42.031 [job-0] INFO StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 2.48MB/s, 100000 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.025s | All Task WaitReaderTime 0.052s | Percentage 100.00%
2021-03-16 16:51:42.032 [job-0] INFO LogReportUtil - report datax log is turn off
2021-03-16 16:51:42.032 [job-0] INFO JobContainer -
Time of task startup : 2021-03-16 16:51:40
Time of task end : 2021-03-16 16:51:42
Total time elapsed : 1s
Average traffic of the task : 2.48MB/s
Record writing speed : 100000rec/s
Total records read : 100000
Total read/write failures : 0
Note
No output is displayed because the default task parameter disables data output.