Character data types store character (alphanumeric) data, such as words and free-form text, in a database character set or national character set. Compared with other data types, character data types have fewer attributes. This topic provides an overview of the data types supported in the current version of OceanBase Database and an introduction to character sets.
Overview of character data types
Character data is stored in strings, and its byte value corresponds to one of the character sets specified when the database was created. OceanBase Database supports both single-byte and multi-byte character sets.
Note
Columns of character data types can store all alphanumeric values, but columns of the NUMBER data type can store only numeric values.
The following table describes the data types supported in the current version of OceanBase Database.
| Data type | Length type | Usage | Length description |
|---|---|---|---|
| CHAR(size [BYTE | CHAR]) | Fixed-length | High index efficiency. Use the trim function to remove extra spaces in the program. |
The value of parameter size ranges from 1 to 2000, with a default value and minimum value of 1. The storage size is size bytes or characters. |
| NCHAR[(size)] | Fixed-length | Uses the Unicode character set (each character is represented by two bytes). | The value of parameter size ranges from 1 to 2000, with a default value and minimum value of 1. The storage size is twice the value of size. |
| NVARCHAR2(size) | Variable-length | Uses the Unicode character set (each character is represented by two bytes). | The value of parameter size ranges from 1 to 32767. The storage size is twice the number of input characters. |
| VARCHAR2(size [BYTE | CHAR]) | Variable-length | Uses the Unicode character set (each character is represented by two bytes). | The value of parameter size ranges from 1 to 32767. The storage size is the actual length of the input data in bytes or characters, not size bytes or characters. |
| VARCHAR(size [BYTE | CHAR]) | Variable-length | In OceanBase Database, VARCHAR and VARCHAR2 are the same. |
The value of parameter size ranges from 1 to 32767. The storage size is the actual length of the input data in bytes, not size bytes. |
Note
For the CHAR and VARCHAR2 data types, you must specify the length semantic. The default value is controlled by the system variable NLS_LENGTH_SEMANTICS.
Overview of character sets
Unicode character set
A Unicode character set is an encoding method for characters. The specific encoding methods include UTF-8, UTF-16, UTF-32, and compression conversion. The encoding method determines the storage size of a character. The space required for Chinese and English characters differs in different storage methods.
The following table describes the three encoding methods.
| Encoding method | Number of encoding bytes | BOM | Advantages | Disadvantages |
|---|---|---|---|---|
| UTF-8 | Variable-length encoding. Single-byte (ASCII characters) or multi-byte (non-ASCII characters). The minimum code unit is 8 bits. | No byte order: If the beginning of a text contains the byte stream EF BB BF, it indicates that the text is encoded in UTF-8. | An ideal Unicode encoding method: fully compatible with ASCII encoding; no byte order; strong self-synchronization and error-correction capabilities, suitable for network transmission and communication; good extensibility. | Variable-length encoding is not convenient for internal program processing. |
| UTF-16 | Two or four bytes. The minimum code unit is 16 bits. | Byte order: UTF-16LE (little-endian) is represented by FF FE, and UTF-16BE (big-endian) is represented by FE FF. | The earliest Unicode encoding method, which has been widely used in many environments. Suitable for Unicode processing in memory. Many APIs of programming languages use this encoding method for the STRING type. | Not compatible with ASCII encoding. Supplementary plane code points are encoded using surrogate pairs, making the encoding complex. Poor extensibility. |
| UTF-32 | Fixed-length encoding. Four bytes. The minimum code unit is 16 bits. | Byte order: UTF-16LE (little-endian) is represented by FF FE, and UTF-16BE (big-endian) is represented by FE FF. | Fixed-length encoding is convenient for reading and internal program processing. Each Unicode code point corresponds to a code unit. | All characters are encoded in fixed-length four-byte format, which wastes storage space and bandwidth. Not compatible with ASCII encoding. Poor extensibility. Rarely used in practice. |
Database character set
The database character set is used for the following purposes:
To store data of the
CHAR,VARCHAR2, andCLOBdata types.To identify database objects such as table names, column names, and PL/SQL variables.
To store SQL and PL/SQL code.
National character set
The national character set is used to store data of the NCHAR, NVARCHAR2, and NCLOB data types.
The national character set is an additional character set selected for OceanBase Database. It enables OceanBase Database to support both the database character set provided by the CHAR data type and the national character set provided by the NCHAR data type, thereby enhancing its character processing capabilities.
