Background information
Latin1 (officially known as ISO-8859-1) is an 8-bit character encoding standard that is part of the ISO/IEC 8859 series. The Latin1 character set has the following basic characteristics:
Encoding range: 0x00-0xFF (256 characters).
Supported languages: Western European languages (including English, French, German, Spanish, and Italian).
ASCII compatibility: 0x00-0x7F is the same as ASCII.
Why does the Latin1 character set cause garbled characters?
The Latin1 character set is a byte encoding standard and does not support Chinese. Therefore, when you migrate data of the Latin1 character set, you need to know which character set's byte array is stored in Latin1. It is usually UTF-8. The process of storing Chinese in the Latin1 character set is as follows:
Obtain the UTF-8 encoded byte array of the Chinese text.
Store the byte array in a field defined by the Latin1 character set.
Convert the character set to a byte array.
Convert the byte array based on the UTF-8 encoding rules to obtain the correct Chinese text.
OMS Community Edition solution
Two new parameters, latin1byte and names, are added for the source and the sink. The supported data sources include MySQL, OceanBase, and TiDB.
latin1bytespecifies the actual character set of Latin1.namesspecifies whether to executeset names 'character set'when you connect to the database.By default, automatic settings are used in the program. For an OceanBase database, you do not need to set this parameter. For a MySQL database, you need to set this parameter to
utf8.Set
names="null". null is a string that indicates thatset namesis not executed.The reason for not setting
set namesis that when a field of the Latin1 character set exists, even ifset namesis executed, you cannot obtain the original Latin1 character set byte array by usinggetBytes.For other character sets, set
set names 'character set specified by names'.
Read data
Specify the character set that actually stores data of the Latin1 character set. When you read data, use getBytes and then use new String(byte,"character set") to obtain the correct string.
Write data
This setting takes effect only when both the source and target fields are of the Latin1 character set.
Full migration
After the source reads data, it stores the data as a string and passes it to the sink. The sink obtains the byte array of the actual Latin1 character set by using
getBytes("actual character set of Latin1")and then writes the byte array to the target by usingsetBytes.Incremental synchronization
The incremental synchronization component consumes data from the store and directly uses
setBytesto write the consumedbytesto the target.
Configure scenarios for data migration tasks
MySQL to OceanBase
Full migration
All data is migrated and verified normally with the default configurations.
Incremental synchronization
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.Reverse increment
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.
OceanBase to MySQL
Full migration
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration to bothSourceandSink.Incremental synchronization
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.Reverse increment
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.
TiDB to OceanBase
Full migration
Garbled characters occur with the default configurations. When TiDB reads data, Latin1 character fields use
getBytesand thennew String(bytes,utf8)by default. You can add thelatin1byte=utf8configuration toSinkto resolve the garbled characters.Full data verification fails with the default configurations. You can add the
task.sourceImageSection.latin1byte=utf8configuration.
Incremental synchronization
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.Reverse increment
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.
OceanBase to OceanBase
Full migration
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration to bothSourceandSink.Incremental synchronization
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.Reverse increment
The default configurations result in garbled characters. You can add the
latin1byte=utf8configuration toSink.
Direct load
Direct load does not support writing data to the Latin1 character set. An exception will be reported.