The editDistance(string1, string2) and editDistanceUTF8(string1, string2) functions calculate the minimum number of single-character insertions, deletions, or substitutions required to transform one string into the other, using the Levenshtein edit distance.
Note
These functions were introduced in V4.6.0 as compatible functions of ClickHouse. By default, they are disabled. To use them, set the tenant-level parameter sql_func_extension_mode to ClickHouse. For more information, see sql_func_extension_mode.
Syntax
editDistance(string1, string2)
editDistanceUTF8(string1, string2)
Parameters
Parameter |
Description |
|---|---|
string1、string2 |
Two strings to compare. |
Return type
BIGINT UNSIGNED type, indicating the edit distance between string1 and string2.
Behavior description
editDistancecalculates distances in bytes, which is suitable for ASCII or single-byte character sets. For multi-byte character sets (like UTF-8), a single Chinese character is counted as multiple bytes in the differences.editDistanceUTF8first decodes the string into characters (code points) using UTF-8, then calculates the edit distance based on characters (treating each multi-byte character as a single character).- If any input is
NULL, the result isNULL. - If either input string is an empty string, the result is the length of the other string (in bytes for
editDistanceand in characters foreditDistanceUTF8). - Each parameter has a maximum length of 65536 (the length of
editDistanceis in bytes and the length ofeditDistanceUTF8is in characters). If the length exceeds this value, an error of the formERROR 1153 (HY000): Got a packet bigger than 'max_allowed_packet' bytesis returned. The error number isOB_ERR_PARAMETER_TOO_LONGand the error message isstring length is too long for edit distance, should be at most 65536.
Examples
Enable the ClickHouse extension function mode.
obclient> ALTER SYSTEM SET sql_func_extension_mode = 'ClickHouse';Calculate the edit distance between the strings
BaseandOceanBase.obclient> SELECT editDistance('Base', 'OceanBase') AS d;The returned result is as follows:
+------+ | d | +------+ | 5 | +------+ 1 row in setDifference between byte-level and character-level. The
editDistancefunction counts the Chinese character as three UTF-8 bytes, while theeditDistanceUTF8function counts the character as one UTF-8 character.obclient> SELECT editDistance('OB', '中OB') AS bytes_diff, editDistanceUTF8('OB', '中OB') AS chars_diff;The returned result is as follows:
+------------+------------+ | bytes_diff | chars_diff | +------------+------------+ | 3 | 1 | +------------+------------+ 1 row in setNULL and empty string edge cases.
obclient> SELECT editDistance(NULL, 'abc') AS null_case, editDistance('', 'abc') AS empty_case;The result is as follows:
+-----------+------------+ | null_case | empty_case | +-----------+------------+ | NULL | 3 | +-----------+------------+ 1 row in set
