In the MySQL-compatible mode of OceanBase Database, you can create full-text indexes on columns of the CHAR, VARCHAR, and TEXT types. In addition, you can create multiple full-text indexes in the primary table and on the same column.
Full-text indexes can be created on both partitioned and non-partitioned tables, regardless of whether they have a primary key. The limitations for creating full-text indexes are as follows:
- You can create full-text indexes only on columns of the
CHAR,VARCHAR, andTEXTtypes. - In the current version of OceanBase Database, you can create only local full-text indexes.
- When you create a full-text index, you cannot specify the
UNIQUEkeyword. - To create a full-text index across multiple columns, you must ensure that these columns use the same character set.
Based on these syntax and rules, you can use the full-text index feature of OceanBase Database to efficiently search and retrieve text data.
DML operations
You can perform complex DML operations such as INSERT INTO ON DUPLICATE KEY, REPLACE INTO, updates and deletions on multiple tables, and updates on updatable views on tables that contain full-text indexes.
Here are some examples:
INSERT INTO ON DUPLICATE KEY:INSERT INTO articles VALUES ('OceanBase', 'Fulltext search index support insert into on duplicate key') ON DUPLICATE KEY UPDATE title = 'OceanBase 4.3.3';REPLACE INTO:REPLACE INTO articles(title, context) VALUES ('Oceanbase 4.3.3', 'Fulltext search index support replace');Updates and deletions on multiple tables.
Create a table named
tbl1.CREATE TABLE tbl1 (a int PRIMARY KEY, b text, FULLTEXT INDEX(b));Create a table named
tbl2.CREATE TABLE tbl2 (a int PRIMARY KEY, b text);Execute an
UPDATEstatement on multiple tables.UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd', tbl2.b = 'eeee';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl1.b = 'dddd';UPDATE tbl1 JOIN tbl2 ON tbl1.a = tbl2.a SET tbl2.b = tbl1.b;Execute a
DELETEstatement on multiple tables.DELETE tbl1, tbl2 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;DELETE tbl1 FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.a;
DML operations on updatable views.
Create a view named
fts_view.CREATE VIEW fts_view AS SELECT * FROM tbl1;Execute an
INSERTstatement on the updatable view.INSERT INTO fts_view VALUES(3, 'cccc'), (4, 'dddd');Execute an
UPDATEstatement on the updatable view.UPDATE fts_view SET b = 'dddd';UPDATE fts_view JOIN normal ON fts_view.a = tbl2.a SET fts_view.b = 'dddd', tbl2.b = 'eeee';Execute a
DELETEstatement on the updatable view.DELETE FROM fts_view WHERE b = 'dddd';DELETE tbl1 FROM fts_view JOIN tbl1 ON fts_view.a = tbl1.a AND 1 = 0;
Full-text index tokenizers
OceanBase Database supports multiple built-in tokenizers for full-text indexes, so you can choose the best text tokenization strategy for your business scenario. The default tokenizer is the Space tokenizer. Other tokenizers must be explicitly specified by using the WITH PARSER parameter.
Tokenizer list:
- Space tokenizer
- Basic English tokenizer
- IK tokenizer
- Ngram tokenizer
- jieba tokenizer
Configuration example:
When you create or modify a table, use the CREATE TABLE or ALTER TABLE statement to set the WITH PARSER tokenizer_option parameter when creating a full-text index, so as to specify the tokenizer type. For more information about tokenizer property parameters, see Create an index.
CREATE TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3));
-- Modify the tokenizer of an existing full-text index
ALTER TABLE tbl2(id INT, name VARCHAR(18), doc TEXT,
FULLTEXT INDEX full_idx1_tbl2(name, doc)
WITH PARSER NGRAM
PARSER_PROPERTIES=(ngram_token_size=3)); -- Ngram example
Space tokenizer (default)
Concept:
- Splits text by spaces, punctuation (such as commas and periods), or non-alphanumeric characters (except the underscore
_). - Tokenization results include only valid tokens whose length is between
min_token_size(default: 3) andmax_token_size(default: 84). - Chinese characters are treated as individual characters.
Example description:
- Spaces, commas, periods, and other symbols act as separators. Consecutive Chinese characters are treated as a single word.
Basic English (Beng) tokenizer
Concept:
- Similar to the Space tokenizer, but does not preserve underscores (
_), which are treated as separators. - Suitable for separating English phrases, but has limited effectiveness for terms without spaces (such as "iPhone15").
Use cases:
- Basic retrieval of English documents (such as logs and comments).
Example description:
- Underscores (
_) are split. The main difference from the Space tokenizer is the handling of_.
Ngram tokenizer
Concept:
- Fixed n-gram tokenization: By default,
n=2. Splits consecutive non-separator characters into sub-sequences of lengthn. - Separator rules are the same as the Space tokenizer (preserves
_and alphanumeric characters). - Does not support length limit parameters. Outputs all possible n-length tokens.
Use cases:
- Fuzzy matching of short text (such as user IDs and order numbers).
- Scenarios that require fixed-length feature extraction (such as password policy analysis).
Example description:
- When
n=2by default, outputs all consecutive 2-character tokens, including overlapping parts.
Ngram2 tokenizer
Concept:
- Supports dynamic n-gram range: Use the
min_ngram_sizeandmax_ngram_sizeparameters to set the token length range. - Suitable for scenarios that require coverage of multiple token lengths.
Use cases: Scenarios that require multiple fixed-length tokens at the same time.
Note
When you use the Ngram2 tokenizer, note that it consumes more memory. For example, when the range between min_ngram_size and max_ngram_size is large, it generates a large number of token combinations, which may lead to high resource consumption.
Example description:
- Outputs all consecutive sub-sequences with lengths between 4 and 6 characters. Tokens can overlap.
IK tokenizer
Concept:
A Chinese tokenizer based on the open-source IK Analyzer.
Automatically recognizes English words, email addresses, URLs (without
://), IP addresses, and similar formats.
Use cases: Chinese tokenization
Business scenarios:
Social media content analysis (for example, keyword extraction from user comments).
Smart mode: Ensures that each character belongs to only one token, with no overlap, and that the length of each token is as long as possible and the number of tokens is as small as possible. Attempts to combine numerals and measure words into a single token.
max_word mode: Includes the same character in different tokens and provides as many possible tokens as possible.
jieba tokenizer
Concept: A tokenizer based on the open-source Python tool jieba. Supports precise, full, and search-engine modes.
Features:
- Precise mode: Strictly splits according to the dictionary.
- Full mode: Lists all possible split combinations.
Use cases:
- Multilingual mixed text processing (for example, Chinese-English mixed content on social media).
You must install the jieba tokenizer plugin yourself. For build machine installation steps, see Tokenizer plugin.
Notice
The tokenizer plugin is currently an experimental feature and is not recommended for production use.
Tokenizer selection strategy
| Business scenario | Recommended tokenizer | Reason |
|---|---|---|
| English product title search | Space or Basic English | Simple and efficient, conforms to English tokenization habits. |
| Chinese product description search | IK tokenizer | Accurately identifies Chinese terms and supports custom dictionaries. |
| Log fuzzy matching (e.g., error codes) | Ngram tokenizer | No dictionary required, covers fuzzy query needs for text without spaces. |
| Technology paper keyword extraction | jieba tokenizer | Supports new word discovery and complex mode switching. |
References
For more information about how to create a full-text index, see the Create a full-text index section in Create an index.