Garbled characters in the Latin1 character set|V4.2.11|OceanBase Migration Service| docs|Distributed Database

Garbled characters in the Latin1 character set

Last Updated：2026-02-13 03:11:02 Updated

Background information

Latin1 (officially known as ISO-8859-1) is an 8-bit character encoding standard that is part of the ISO/IEC 8859 series. The Latin1 character set has the following basic characteristics:

Encoding range: 0x00-0xFF (256 characters).
Supported languages: Western European languages (including English, French, German, Spanish, and Italian).
ASCII compatibility: 0x00-0x7F is the same as ASCII.

Why does the Latin1 character set cause garbled characters?

The Latin1 character set is a byte encoding standard and does not support Chinese. Therefore, when you migrate data of the Latin1 character set, you need to know which character set's byte array is stored in Latin1. It is usually UTF-8. The process of storing Chinese in the Latin1 character set is as follows:

Obtain the UTF-8 encoded byte array of the Chinese text.
Store the byte array in a field defined by the Latin1 character set.
Convert the character set to a byte array.
Convert the byte array based on the UTF-8 encoding rules to obtain the correct Chinese text.

OMS Community Edition solution

Two new parameters, latin1byte and names, are added for the source and the sink. The supported data sources include MySQL, OceanBase, and TiDB.

latin1byte specifies the actual character set of Latin1.
names specifies whether to execute set names 'character set' when you connect to the database.
- By default, automatic settings are used in the program. For an OceanBase database, you do not need to set this parameter. For a MySQL database, you need to set this parameter to utf8.
- Set names="null". null is a string that indicates that set names is not executed.
  
  The reason for not setting set names is that when a field of the Latin1 character set exists, even if set names is executed, you cannot obtain the original Latin1 character set byte array by using getBytes.
- For other character sets, set set names 'character set specified by names'.

Read data

Specify the character set that actually stores data of the Latin1 character set. When you read data, use getBytes and then use new String(byte,"character set") to obtain the correct string.

Write data

This setting takes effect only when both the source and target fields are of the Latin1 character set.

Full migration

After the source reads data, it stores the data as a string and passes it to the sink. The sink obtains the byte array of the actual Latin1 character set by using getBytes("actual character set of Latin1") and then writes the byte array to the target by using setBytes.
Incremental synchronization

The incremental synchronization component consumes data from the store and directly uses setBytes to write the consumed bytes to the target.

Configure scenarios for data migration tasks

The following configurations are based on the assumption that the actual character set of Latin1 is UTF-8, meaning that the source stores UTF-8 character set data. You need to set the configurations based on your actual situation, following the principles for configuring the source and sink in the OMS Community Edition solution.

MySQL to OceanBase

Full migration

All data is migrated and verified normally with the default configurations.
Incremental synchronization

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.
Reverse increment

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.

OceanBase to MySQL

Full migration

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to both Source and Sink.
Incremental synchronization

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.
Reverse increment

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.

TiDB to OceanBase

Full migration
- Garbled characters occur with the default configurations. When TiDB reads data, Latin1 character fields use getBytes and then new String(bytes,utf8) by default. You can add the latin1byte=utf8 configuration to Sink to resolve the garbled characters.
- Full data verification fails with the default configurations. You can add the task.sourceImageSection.latin1byte=utf8 configuration.
Incremental synchronization

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.
Reverse increment

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.

OceanBase to OceanBase

Full migration

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to both Source and Sink.
Incremental synchronization

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.
Reverse increment

The default configurations result in garbled characters. You can add the latin1byte=utf8 configuration to Sink.

Direct load

Direct load does not support writing data to the Latin1 character set. An exception will be reported.

Enterprise Edition

Community Edition