In OceanBase Database in MySQL mode, you can create a full-text index on a CHAR, VARCHAR, or TEXT column. You can create multiple full-text indexes on the same table or on the same column.
You can create a full-text index on a non-partitioned table or a partitioned table with or without a primary key. The following rules apply:
- A full-text index can be created only on a
CHAR,VARCHAR, orTEXTcolumn. - Only a local (
LOCAL) full-text index can be created in the current version. - You cannot specify the
UNIQUEkeyword when you create a full-text index. - If you want to create a full-text index that involves multiple columns, make sure that the columns have the same character set.
By using these syntaxes and rules, OceanBase Database provides efficient search and retrieval capabilities for text data.
DML operations
For tables with full-text indexes, the following complex DML operations are supported: INSERT INTO ON DUPLICATE KEY, REPLACE INTO, multi-table updates and deletes, and updatable views.
Here are some examples:
INSERT INTO ON DUPLICATE KEY:INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key') ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';REPLACE INTO:REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');Multi-table updates and deletes.
Create a table named
tbl1.CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));Create a table named
tbl2.CREATE TABLE tbl2 (a int PRIMARY KEY, b text);Use the
UPDATEstatement for multiple tables.UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd', tbl2.b = 'eeee';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;Use the
DELETEstatement for multiple tables.DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
DML operations on updatable views.
Create a view named
fts_view.CREATE VIEW fts_view AS SELECT * FROM tbl1;Use the
INSERTstatement for updatable views.INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');Use the
UPDATEstatement for updatable views.UPDATE fts_view SET b = 'dddd';UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a SET fts_view.b = 'dddd', tbl2.b = 'eeee';Use the
DELETEstatement for updatable views.DELETE FROM fts_view WHERE b = 'dddd';DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
Full-text index tokenizer
OceanBase Database supports multiple built-in tokenizers for full-text indexes, allowing users to select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is Space tokenizer, and other tokenizers must be explicitly specified using the WITH PARSER parameter.
Tokenizer list:
- Space tokenizer
- Basic English tokenizer
- IK tokenizer
- Ngram tokenizer
- jieba tokenizer
Configuration method:
When creating or modifying a table, use the CREATE TABLE/ALTER TABLE statement to specify the tokenizer type for the full-text index by setting the WITH PARSER tokenizer_option parameter.
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
-- Modify the tokenizer for the full-text index of an existing table
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
Space tokenizer (default)
Concepts:
- Splits text based on spaces, punctuation (such as commas and periods), or non-alphanumeric characters (excluding underscores
_). - The tokenization result only includes valid tokens with lengths between
min_token_size(default 3) andmax_token_size(default 84). - Chinese characters are treated as single-character tokens.
Scenarios:
- Languages separated by spaces, such as English ("apple watch series 9").
- Chinese text with manually added separators, such as "Nanjing Yangtze River Bridge".
Tokenization effect:
OceanBase(rooteoceanbase)>select tokenize("Nanjing Yangtze River Bridge has 1 km, see details at www.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Yangtze River Bridge has 1 km, see details at www.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'space') |
+-------------------------------------------------------------------------------------------------------------+
|["see details at www", "1 square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 km", "email xx", "word", "hello", "h_name"] |
+-------------------------------------------------------------------------------------------------------------+
Example explanation:
- Spaces, commas, periods, and other symbols are used as separators. Continuous Chinese characters are treated as words.
Basic English (Beng) tokenizer
Concepts:
- Similar to the Space tokenizer, but underscores
_are not preserved and are treated as separators. - Suitable for splitting English phrases but may not effectively split terms without spaces (such as "iPhone15").
Scenarios:
- Basic retrieval of English documents, such as logs and comments.
Tokenization effect:
OceanBase(rooteoceanbase)>select tokenize("Nanjing Yangtze River Bridge has 1 km, see details at WWW.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Yangtze River Bridge has 1 km, see details at WWW.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'beng') |
+-----------------------------------------------------------------------------------------------------------------------+
|["see details at www", "1 square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 km", "email xx", "word", "hello", "name"] |
+-----------------------------------------------------------------------------------------------------------------------+
Example explanation:
- Underscores
_are split. The core difference from the Space tokenizer is how underscores_are handled.
Ngram tokenizer
Concepts:
- Fixed n-value tokenization: By default,
n=2. It splits consecutive non-separator characters into subsequences of lengthn. - Separator determination rules are the same as the Space tokenizer (retaining
_and alphanumeric characters). - Length limit parameters are not supported. All possible
n-length tokens are output.
Use Cases:
- Fuzzy matching for short texts (e.g., user IDs, order numbers).
- Scenarios requiring fixed-length feature extraction (e.g., password policy analysis).
Tokenization Effect:
OceanBase(rooteoceanbase)>select tokenize("Nanjing City Yangtze River Bridge has 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge has 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["email", "ww", "Bridge", "ob", "me", "also", "or", "_n", "kilometer", "small", "meter long", "ll", "email x", "square kilometer", "see w", "co", "also", "1 thousand", "City", "lo", "Yangtze River", "el", "rd", "one square", "square kilometer", "he", "am", "Nanjing", "h_", "City", "wo", "xx", "Yangtze River", "has 1", "na", "see", "square", "om", "Bridge has" |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Example Explanation:
- When
n=2by default, all consecutive 2-character tokens, including overlapping parts, are output.
Ngram2 tokenizer
Concepts:
- Supports dynamic n-value range: Set the token length range using
min_ngram_sizeandmax_ngram_size. - Suitable for scenarios requiring coverage with multiple token lengths.
Use Cases: Scenarios requiring multiple fixed-length tokens.
Note
When using the ngram2 tokenizer, be aware of its high memory usage. For example, if you set a large range for min_ngram_size and max_ngram_size, it will generate a large number of token combinations, which may lead to excessive resource consumption.
Tokenization Effect:
OceanBase(rooteoceanbase)>select tokenize("Nanjing City Yangtze River Bridge 1 kilometer", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]');
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge 1 kilometer", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]') |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["Yangtze River Bridge", "Bridge 1 thousand", "River Bridge 1 thousand", "City Yangtze River", "Nanjing City Yangtze", "River Bridge 1", "Nanjing City", "City Yangtze River", "Bridge 1 kilometer", "River Bridge 1 kilometer", "City Yangtze River Bridge 1", "Yangtze River Bridge 1", "Nanjing City Yangtze", "Bridge 1 kilometer", "Nanjing City Yangtze River", "Yangtze River Bridge 1 thousand", "Nanjing City Yangtze River Bridge", "Nanjing City Yangtze River" |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Example Explanation:
- Outputs all consecutive subsequences of lengths 4 to 6, and tokens can overlap.
IK tokenizer
Concepts:
A Chinese tokenizer based on the open-source tool IK Analyzer, supporting two modes:
- Smart mode: Prioritizes longer words to minimize segmentation (e.g., "Nanjing" is not segmented into "Nanjing" and "City").
- Max Word mode: Outputs all possible short words (e.g., "Nanjing" is segmented into "Nanjing" and "City").
Automatically recognizes formats such as English words, email addresses, URLs (excluding
://), and IP addresses.
Scenarios: Chinese text segmentation
Business scenarios:
E-commerce product description search (e.g., precise matching of "Huawei Mate 60").
Social media content analysis (e.g., keyword extraction from user comments).
Smart mode: Ensures that each character belongs to only one word, with no overlaps, and maximizes the length of each word while minimizing the total number of words. It also attempts to combine numerals and measure words into a single word.
OceanBase (rooteoceanbase)> select tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Email", "hello_word", "192.168.1.1", "hello-word", "Yangtze River Bridge", "www.baidu.com", "www.xxx.com", "xx@ob.com", "Long", "http", "1 km", "See", "Nanjing City", "Have"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- Max Word mode: Includes the same character in different segments to provide as many possible words as possible.
OceanBase (rooteoceanbase)> select tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Mile", "Yangtze River Bridge", "Mayor", "Dry", "Nanjing City", "Nanjing", "km", "xx", "www.xxx.com", "Long", "www", "xx@ob.com", "Yangtze", "ob", "XXX", "com", "See", "l", "Have", "Bridge", "Email"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
jieba tokenizer
Concepts: A tokenizer based on the open-source tool jieba in the Python ecosystem, supporting precise mode, full mode, and search engine mode.
Features:
- Precise mode: Segments strictly according to the dictionary (e.g., "cannot" is not segmented into "not" and "can").
- Full mode: Lists all possible segmentation combinations.
- Search engine mode: Balances precision and recall rate (e.g., "Nanjing City Yangtze River Bridge" is segmented into "Nanjing", "Mayor", and "Yangtze River Bridge").
- Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).
Scenarios:
- Medical and scientific terminology analysis (e.g., precise segmentation of "artificial intelligence").
- Processing of mixed-language text (e.g., social media content with both Chinese and English).
To use the jieba tokenizer plugin, you need to install it yourself. For installation instructions on the compilation machine, see Tokenizer plugin.
Notice
The current tokenizer plugin is an experimental feature and is not recommended for use in a production environment.
Tokenizer selection strategy
| Business scenario | Recommended tokenizer | Reason |
|---|---|---|
| Search for English product titles | Space or Basic English | Simple and efficient, aligning with English tokenization habits. |
| Retrieval of Chinese product descriptions | IK Tokenizer | Accurately identifies Chinese terms and supports custom dictionaries. |
| Fuzzy matching in logs (e.g., error codes) | Ngram Tokenizer | No dictionary required, covering fuzzy query needs for text without spaces. |
| Keyword extraction from scientific papers | jieba Tokenizer | Supports new word discovery and complex mode switching. |
References
For more information about how to create a full-text index, see the Create a Full-Text Index section in Create an Index.
