Full-text index|V4.6.0|OceanBase Database| docs|Distributed Database

Full-text index

Last Updated：2026-05-07 11:26:25 Updated

In OceanBase Database in MySQL mode, you can create a full-text index on a CHAR, VARCHAR, or TEXT column. You can create multiple full-text indexes on the same table or on the same column.

You can create a full-text index on a non-partitioned table or a partitioned table with or without a primary key. The following rules apply:

A full-text index can be created only on a CHAR, VARCHAR, or TEXT column.
Only a local (LOCAL) full-text index can be created in the current version.
You cannot specify the UNIQUE keyword when you create a full-text index.
If you want to create a full-text index that involves multiple columns, make sure that the columns have the same character set.

By using these syntaxes and rules, OceanBase Database provides efficient search and retrieval capabilities for text data.

DML operations

For tables with full-text indexes, the following complex DML operations are supported: INSERT INTO ON DUPLICATE KEY, REPLACE INTO, multi-table updates and deletes, and updatable views.

Here are some examples:

INSERT INTO ON DUPLICATE KEY:

INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key')
    ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';

REPLACE INTO:

REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');

Multi-table updates and deletes.

Create a table named tbl1.

CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));

Create a table named tbl2.

CREATE TABLE tbl2 (a int PRIMARY KEY, b text);

Use the UPDATE statement for multiple tables.

UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a
    SET tbl1.b = 'dddd', tbl2.b = 'eeee';

UPDATE  tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';

UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;

Use the DELETE statement for multiple tables.

DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;

DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;

DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;

DML operations on updatable views.

Create a view named fts_view.

CREATE VIEW fts_view AS SELECT * FROM tbl1;

Use the INSERT statement for updatable views.

INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');

Use the UPDATE statement for updatable views.

UPDATE fts_view SET b = 'dddd';

UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a
   SET fts_view.b = 'dddd', tbl2.b = 'eeee';

Use the DELETE statement for updatable views.

DELETE FROM fts_view WHERE b = 'dddd';

DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;

Full-text index tokenizer

OceanBase Database supports multiple built-in tokenizers for full-text indexes, allowing users to select the optimal text tokenization strategy based on their business scenarios. The default tokenizer is Space tokenizer, and other tokenizers must be explicitly specified using the WITH PARSER parameter.

Tokenizer list:

Space tokenizer
Basic English tokenizer
IK tokenizer
Ngram tokenizer
jieba tokenizer

Configuration method:

When creating or modifying a table, use the CREATE TABLE/ALTER TABLE statement to specify the tokenizer type for the full-text index by setting the WITH PARSER tokenizer_option parameter.

CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
    FULLTEXT INDEX full_idx1_tbl2(name, doc)
      WITH PARSER NGRAM
      PARSER_PROPERTIES=(ngram_token_size=3));


-- Modify the tokenizer for the full-text index of an existing table  
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
    FULLTEXT INDEX full_idx1_tbl2(name, doc)
      WITH PARSER NGRAM
      PARSER_PROPERTIES=(ngram_token_size=3));  -- Ngram example

Space tokenizer (default)

Concepts:

Splits text based on spaces, punctuation (such as commas and periods), or non-alphanumeric characters (excluding underscores _).
The tokenization result only includes valid tokens with lengths between min_token_size (default 3) and max_token_size (default 84).
Chinese characters are treated as single-character tokens.

Scenarios:

Languages separated by spaces, such as English ("apple watch series 9").
Chinese text with manually added separators, such as "Nanjing Yangtze River Bridge".

Tokenization effect:

OceanBase(rooteoceanbase)>select tokenize("Nanjing Yangtze River Bridge has 1 km, see details at www.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'space');
+-------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Yangtze River Bridge has 1 km, see details at www.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'space')   |
+-------------------------------------------------------------------------------------------------------------+
|［"see details at www", "1 square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 km", "email xx", "word", "hello", "h_name"］             |
+-------------------------------------------------------------------------------------------------------------+

Example explanation:

Spaces, commas, periods, and other symbols are used as separators. Continuous Chinese characters are treated as words.

Basic English (Beng) tokenizer

Concepts:

Similar to the Space tokenizer, but underscores _ are not preserved and are treated as separators.
Suitable for splitting English phrases but may not effectively split terms without spaces (such as "iPhone15").

Scenarios:

Basic retrieval of English documents, such as logs and comments.

Tokenization effect:

OceanBase(rooteoceanbase)>select tokenize("Nanjing Yangtze River Bridge has 1 km, see details at WWW.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'beng');
+-----------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing Yangtze River Bridge has 1 km, see details at WWW.XXX.COM, email xx@OB.COM, 1 square kilometer is also small hello-word h_name", 'beng')                 |
+-----------------------------------------------------------------------------------------------------------------------+
|["see details at www", "1 square kilometer is also small", "xxx", "Nanjing Yangtze River Bridge has 1 km", "email xx", "word", "hello", "name"]                              |
+-----------------------------------------------------------------------------------------------------------------------+

Example explanation:

Underscores _ are split. The core difference from the Space tokenizer is how underscores _ are handled.

Ngram tokenizer

Concepts:

Fixed n-value tokenization: By default, n=2. It splits consecutive non-separator characters into subsequences of length n.
Separator determination rules are the same as the Space tokenizer (retaining _ and alphanumeric characters).
Length limit parameters are not supported. All possible n-length tokens are output.

Use Cases:

Fuzzy matching for short texts (e.g., user IDs, order numbers).
Scenarios requiring fixed-length feature extraction (e.g., password policy analysis).

Tokenization Effect:

OceanBase(rooteoceanbase)>select tokenize("Nanjing City Yangtze River Bridge has 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge has 1 kilometer long, see WWW.XXX.COM, email xx@OB.COM, one square kilometer is also small hello-word h_name", 'ngram')                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|［"email", "ww", "Bridge", "ob", "me", "also", "or", "_n", "kilometer", "small", "meter long", "ll", "email x", "square kilometer", "see w", "co", "also", "1 thousand", "City", "lo", "Yangtze River", "el", "rd", "one square", "square kilometer", "he", "am", "Nanjing", "h_", "City", "wo", "xx", "Yangtze River", "has 1", "na", "see", "square", "om", "Bridge has"   |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example Explanation:

When n=2 by default, all consecutive 2-character tokens, including overlapping parts, are output.

Ngram2 tokenizer

Concepts:

Supports dynamic n-value range: Set the token length range using min_ngram_size and max_ngram_size.
Suitable for scenarios requiring coverage with multiple token lengths.

Use Cases: Scenarios requiring multiple fixed-length tokens.

Note

When using the ngram2 tokenizer, be aware of its high memory usage. For example, if you set a large range for min_ngram_size and max_ngram_size, it will generate a large number of token combinations, which may lead to excessive resource consumption.

Tokenization Effect:

OceanBase(rooteoceanbase)>select tokenize("Nanjing City Yangtze River Bridge 1 kilometer", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]');
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge 1 kilometer", 'ngram2', '[{"additional_args":[{"min_ngram_size": 4},{"max_ngram_size": 6}]}]')                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ["Yangtze River Bridge", "Bridge 1 thousand", "River Bridge 1 thousand", "City Yangtze River", "Nanjing City Yangtze", "River Bridge 1", "Nanjing City", "City Yangtze River", "Bridge 1 kilometer", "River Bridge 1 kilometer", "City Yangtze River Bridge 1", "Yangtze River Bridge 1", "Nanjing City Yangtze", "Bridge 1 kilometer", "Nanjing City Yangtze River", "Yangtze River Bridge 1 thousand", "Nanjing City Yangtze River Bridge", "Nanjing City Yangtze River"        |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example Explanation:

Outputs all consecutive subsequences of lengths 4 to 6, and tokens can overlap.

IK tokenizer

Concepts:

A Chinese tokenizer based on the open-source tool IK Analyzer, supporting two modes:
- Smart mode: Prioritizes longer words to minimize segmentation (e.g., "Nanjing" is not segmented into "Nanjing" and "City").
- Max Word mode: Outputs all possible short words (e.g., "Nanjing" is segmented into "Nanjing" and "City").
Automatically recognizes formats such as English words, email addresses, URLs (excluding ://), and IP addresses.

Scenarios: Chinese text segmentation

Business scenarios:

E-commerce product description search (e.g., precise matching of "Huawei Mate 60").
Social media content analysis (e.g., keyword extraction from user comments).
Smart mode: Ensures that each character belongs to only one word, with no overlaps, and maximizes the length of each word while minimizing the total number of words. It also attempts to combine numerals and measure words into a single word.

OceanBase (rooteoceanbase)> select tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM 192.168.1.1 http://www.baidu.com hello-word hello_word", 'IK', '[{"additional_args":[{"ik_mode": "smart"}]}]')                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Email", "hello_word", "192.168.1.1", "hello-word", "Yangtze River Bridge", "www.baidu.com", "www.xxx.com", "xx@ob.com", "Long", "http", "1 km", "See", "Nanjing City", "Have"]                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Max Word mode: Includes the same character in different segments to provide as many possible words as possible.

OceanBase (rooteoceanbase)> select tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize("Nanjing City Yangtze River Bridge is 1 km long, see WWW.XXX.COM, email xx@OB.COM", 'IK', '[{"additional_args":[{"ik_mode": "max_word"}]}]')                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|["Mile", "Yangtze River Bridge", "Mayor", "Dry", "Nanjing City", "Nanjing", "km", "xx", "www.xxx.com", "Long", "www", "xx@ob.com", "Yangtze", "ob", "XXX", "com", "See", "l", "Have", "Bridge", "Email"]       |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

jieba tokenizer

Concepts: A tokenizer based on the open-source tool jieba in the Python ecosystem, supporting precise mode, full mode, and search engine mode.

Features:

Precise mode: Segments strictly according to the dictionary (e.g., "cannot" is not segmented into "not" and "can").
Full mode: Lists all possible segmentation combinations.
Search engine mode: Balances precision and recall rate (e.g., "Nanjing City Yangtze River Bridge" is segmented into "Nanjing", "Mayor", and "Yangtze River Bridge").
Supports custom dictionaries and new word discovery, and is compatible with multiple languages (Chinese, English, Japanese, etc.).

Scenarios:

Medical and scientific terminology analysis (e.g., precise segmentation of "artificial intelligence").
Processing of mixed-language text (e.g., social media content with both Chinese and English).

To use the jieba tokenizer plugin, you need to install it yourself. For installation instructions on the compilation machine, see Tokenizer plugin.

Notice

The current tokenizer plugin is an experimental feature and is not recommended for use in a production environment.

Tokenizer selection strategy

Business scenario	Recommended tokenizer	Reason
Search for English product titles	Space or Basic English	Simple and efficient, aligning with English tokenization habits.
Retrieval of Chinese product descriptions	IK Tokenizer	Accurately identifies Chinese terms and supports custom dictionaries.
Fuzzy matching in logs (e.g., error codes)	Ngram Tokenizer	No dictionary required, covering fuzzy query needs for text without spaces.
Keyword extraction from scientific papers	jieba Tokenizer	Supports new word discovery and complex mode switching.

References

For more information about how to create a full-text index, see the Create a Full-Text Index section in Create an Index.

OceanBase

Customer Stories

Documentation

Full-text index

DML operations

Full-text index tokenizer

Space tokenizer (default)

Basic English (Beng) tokenizer

Ngram tokenizer

Ngram2 tokenizer

Note

IK tokenizer

jieba tokenizer

Notice

Tokenizer selection strategy

References