In MySQL mode of OceanBase Database, full-text indexes can be applied to CHAR, VARCHAR, and TEXT columns. Additionally, OceanBase Database allows the creation of multiple full-text indexes on primary tables, and multiple full-text indexes can also be created on the same column.
You can create a full-text index on a non-partitioned table or a partitioned table, whether the table has a primary key or not. However, the following restrictions apply when you create a full-text index:
- Full-text index only supports columns of the
CHAR,VARCHAR, andTEXTtypes. - The current version supports the creation of only local (
LOCAL) full-text indexes. - Do not specify the
UNIQUEkeyword when creating a full-text index. - If you want to create a full-text index on multiple columns, you must ensure that they use the same character set.
By using these syntaxes and rules, the full-text search and retrieval capabilities of the full-text search feature of OceanBase Database are implemented.
DML operations
For tables that contain full-text indexes, complex DML operations such as INSERT INTO ON DUPLICATE KEY, REPLACE INTO, multi-table updates and deletes, and updatable views are supported.
Here is an example:
INSERT INTO ON DUPLICATE KEY:
INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key') ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';REPLACE INTO:
REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');Updating or deleting data in multiple tables.
Create a table named
tbl1.CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));Create a table named
tbl2.CREATE TABLE tbl2 (a int PRIMARY KEY, b text);Update (
UPDATE) statements for multiple tables.UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd', tbl2.b = 'eeee';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;DELETEstatement for dropping multiple tables.DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
DML operations on a updatable view.
Create a view named
fts_view.CREATE VIEW fts_view AS SELECT * FROM tbl1;The
INSERTstatement is supported for updatable views.INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');The
UPDATEstatement applies to updatable views.UPDATE fts_view SET b = 'dddd';UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a SET fts_view.b = 'dddd', tbl2.b = 'eeee';The
DELETEstatement can be used for updatable views.DELETE FROM fts_view WHERE b = 'dddd';DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
Tokenizers supported for full-text indexes
OceanBase Database supports multiple built-in tokenizers for you to choose the optimal text tokenization strategy based on your business scenarios. The Space Tokenizer is the default tokenizer. You must specify other tokenizers by using the WITH PARSER clause.
Tokenizers:
- Space Tokenizer
- Basic English Tokenizer
- IK Tokenizer (
IKtokenizer is supported starting from OceanBase Database V4.3.5 BP1.) - Ngram Tokenizer (
NGRAM2tokenizer is supported starting from OceanBase Database V4.3.5 BP2.) - jieba Tokenizer
Examples:
Specify the tokenizer by using the CREATE TABLE/ALTER TABLE statement when you create or modify a table and create a full-text index on the table. For more information about how to specify a tokenizer, see Create an index.
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
-- Change the tokenizer of an existing full-text index.
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
Space Tokenizer (default)
Concept:
- Splits text by spaces, punctuation marks (such as commas and periods), or non-alphanumeric characters except underscores (_).
- The tokenization result contains only valid tokens whose length ranges from
min_token_size(default value: 3) tomax_token_size(default value: 84). - Chinese characters are treated as single characters.
Scenarios:
- English and other languages that use spaces as separators (such as "apple watch series 9").
- English phrases with manually added separators (such as "Nanjing Changjiang Bridge").
Tokenization result:
OceanBase (rooteoceanbase) >select tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see www.XXX.com. Contact xx@OB.COM. One square kilometer is also very small. hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see www.XXX.com. Contact xx@OB.COM. One square kilometer is also very small. hello-word h_name", 'space') |
+-------------------------------------------------------------------------------------------------------------+
|["For more information", "One square kilometer is also very small", "www", "Nanjing Changjiang Bridge is 1 km long", "Contact xx", "word", "hello", "h_name"] |
+-------------------------------------------------------------------------------------------------------------+
Example:
- Spaces, commas, and periods are separators. Chinese continuous characters are treated as words.
Basic English Tokenizer
Concept:
- Similar to the Space Tokenizer, but does not preserve underscores (_), treating them as separators.
- Effective for English phrase separation but less effective for terms without spaces (such as "iPhone15").
Scenarios:
- Basic English document search (such as logs and comments).
Tokenization result:
OceanBase (rooteoceanbase) >select tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM. One square kilometer is also very small. hello-word h_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM. One square kilometer is also very small. hello-word h_name", 'beng') |
+-----------------------------------------------------------------------------------------------------------------------+
|["For more information", "One square kilometer is also very small", "www", "Nanjing Changjiang Bridge is 1 km long", "Contact xx", "word", "hello", "name"] |
+-----------------------------------------------------------------------------------------------------------------------+
Example:
- Underscores () are tokenized. The main difference from the Space Tokenizer is how they handle underscores ().
Ngram Tokenizer
Concept:
- Fixed-n-value tokenization: By default,
n=2. Splits consecutive non-separator characters into subsequences of lengthn. - The separator rule is the same as that of the Space Tokenizer, retaining underscores (_).
- Does not support the limit parameter. Outputs all possible
n-length tokens.
Scenarios:
- Fuzzy matching of short texts (such as user IDs and order numbers).
- Scenarios requiring fixed-length feature extraction (such as password strategy analysis).
Tokenization result:
OceanBase (rooteoceanbase) >select tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM. One square kilometer is also very small. hello-word h_name", 'ngram');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM. One square kilometer is also very small. hello-word h_name", 'ngram') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Contact", "www", "Bridge", "ob", "me", "also", "or", "_n", "km", "very", "small", "km", "Bridge", "xx", "square", "com", "also", "1", "Nanjing", "also", "Changjiang", "xx@", "OB", "Nanjing", "Changjiang", "Changjiang", "Bridge" |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Example:
- When
n=2by default, outputs all continuous 2-character tokens, including overlapping parts.
Ngram2 Tokenizer
Concept:
- Supports dynamic n-value range: Specify the token length range by using the
min_ngram_sizeandmax_ngram_sizeparameters. - Suitable for scenarios requiring coverage of tokens of multiple lengths.
Scenarios:
- Scenarios requiring tokens of multiple fixed lengths.
Note
When you use the ngram2 tokenizer, be aware that it has a high memory usage. For example, if you set the min_ngram_size and max_ngram_size parameters to large values, it will generate a large number of token combinations, which may result in excessive resource consumption.
Tokenization result:
OceanBase (rooteoceanbase) >select tokenize("Nanjing Changjiang Bridge 1 km", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]');
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Changjiang Bridge 1 km", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]') |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["Nanjing Changjiang Bridge", "Changjiang Bridge 1 km", "Nanjing Changjiang Bridge 1 km", "Nanjing Changjiang", "Nanjing Changjiang Bridge", "Changjiang Bridge", "Nanjing", "Nanjing Changjiang", "Changjiang Bridge 1 km", "Nanjing Changjiang Bridge 1 km", "Nanjing Changjiang Bridge", "Changjiang Bridge", "Nanjing", "Changjiang Bridge", "Nanjing Changjiang Bridge", "Changjiang Bridge 1 km", "Nanjing Changjiang Bridge", "Nanjing Changjiang Bridge" |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Example:
- Outputs all continuous subsequence of 4 to 6 characters, with overlapping allowed.
IK Tokenizer
Concept:
+ A Chinese tokenizer based on the open-source tool IK Analyzer, supporting two modes:
- **Smart mode**: Prioritizes outputting long words to reduce the number of splits (e.g., "Nanjing" is not split into "Nan", "jing").
- Max Word mode: Outputs all possible short words (e.g., "Nanjing" is split into "Nan" and "jing").
- Automatically recognizes formats such as English words, email addresses, URLs (excluding
://), and IP addresses.
Scenarios:
Chinese tokenization. Scenarios:
E-commerce product description search (e.g., accurately matching "Huawei Mate 60").
Social media content analysis (e.g., extracting keywords from user comments).
Smart mode: Ensures that each character belongs to only one word, with no overlaps, and aims to form the longest words with the fewest words. It tries to combine numbers and units of measurement as one word.
OceanBase (rooteoceanbase) >select tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]') |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Contact", "hello_word", "192.168.1.1", "hello-word", "Changjiang Bridge", "www.baidu.com", "www.xxx.com", "xx@ob.com", "km", "www", "www.xxx.com", "xx@ob.com", "Changjiang", "ob", "XXX", "com", "For more information", "Nanjing"] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- Max_word mode: Allows the same character to be part of different words, providing as many possible words as possible.
OceanBase (rooteoceanbase) >select tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Changjiang Bridge is 1 km long. For more information, see WWW.XXX.com. Contact xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]') |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["km", "Changjiang Bridge", "Nanjing", "Nanjing", "km", "www.xxx.com", "www", "xx@ob.com", "Changjiang", "ob", "XXX", "com", "For more information", "Nanjing", "Changjiang", "Changjiang", "Bridge", "Contact"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
jieba Tokenizer
Concept:
A tokenizer based on the open-source tool
jiebain the Python ecosystem, supporting precise, full, and search engine modes. Characteristics:Precise mode: Strictly splits text based on the dictionary (e.g., "can't" is not split into "can" and "not").
Full mode: Lists all possible splitting combinations.
Search engine mode: Balances precision and recall (e.g., "Nanjing Changjiang Bridge" is split into "Nanjing", "Changjiang", and "Bridge").
Supports custom dictionaries and new word discovery, compatible with multiple languages (Chinese, English, Japanese, etc.).
Scenarios:
- Medical and scientific field term analysis (e.g., precise splitting of "artificial intelligence").
- Mixed-language text processing (e.g., social media content with both Chinese and English). The jieba tokenizer plugin needs to be installed by yourself. For more information about how to install the plugin, see Tokenizer plugin.
Note
- For OceanBase Database V4.3.5, the support for the tokenizer plugin was added in V4.3.5 BP1.
- The tokenizer plugin is an experimental feature and is not recommended for use in production environments.
References
For more information about how to create a full-text index, see the Create a full-text index section of Create an index.
