Tokenizer plugin|V4.3.5| docs|Distributed Database

Tokenizer plugin

Last Updated：2025-05-22 11:27:53 Updated

This article explains how to install and use the tokenizer plugin in OceanBase Database. The tokenizer plugin can be used for full-text indexing and supports custom tokenization logic to meet specific business requirements.

Notice

For OceanBase Database V4.3.5, the tokenizer plugin is supported starting from V4.3.5 BP1.
The tokenizer plugin is currently experimental and is not recommended for use in production environments.

Prerequisites

Before installing and using the tokenizer plugin, ensure the following conditions are met:

The operating system supports the yum package management tool.

Note

If your system does not support yum, you can install the required dependencies through alternative methods, such as manually downloading RPM packages or using other package management tools.
An OceanBase cluster is deployed and running properly.
You have the privileges to modify configuration parameters and restart the cluster.

Procedure

Step 1: Install the development environment

Developing a tokenizer plugin requires a C/C++ compilation environment and the plugin development toolkit provided by OceanBase. Execute the following commands to install the C/C++ development environment and the OceanBase plugin development kit:

Install the basic compilation tools.

Install the essential tools and libraries required for compiling C/C++ programs.
```
yum install -y cmake make glibc-devel glibc-headers gcc gcc-c++
```
Configure the OceanBase software source.

Add the official OceanBase software source so that you can use yum to install OceanBase tools and dependencies later.
1. Install the yum-utils tool.
```
yum install -y yum-utils
```
2. Add the OceanBase software source.
```
yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
```
Install the plugin development toolkit.
```
yum install -y oceanbase-plugin-dev-kit
```
After installation, you can find the example code file for the OceanBase tokenizer plugin, space_ftparser.cpp, in the /usr/share/examples/ObPlugin/ftparser directory.
```
ls /usr/share/examples/ObPlugin/ftparser
```
The return result is as follows:
```
CMakeLists.txt  space_ftparser.cpp
```

Step 2: Obtain a development template

Copy the OceanBase tokenizer plugin example code space_ftparser.cpp to your own development directory. You can then modify the space_ftparser.cpp file to develop your custom tokenizer plugin.

Here is an example:

[root@xxx packages]# cp /usr/share/examples/ObPlugin/ftparser/* /home/admin/test_plugin_dev

[root@xxx packages]# ls /home/admin/test_plugin_dev

The return result is as follows:

CMakeLists.txt  space_ftparser.cpp

The core file of the example code is space_ftparser.cpp. You can modify this file according to your business requirements to implement custom tokenization logic.

Step 3: Compile and install the plugin

Modify the build configuration (CMakeLists.txt).

In the root directory of the sample code, you will find a CMakeLists.txt file. This file contains "TODO" markers indicating the sections that need to be modified:
- PLUGIN_NAME: the name of the current plugin, which will also be the name of the project and the generated dynamic library. Modify it to the desired name.
- SOURCES: the list of implementation files, which can include C or C++ source files. If you add new implementation files, include their paths here. Do not include header files in this list.
Compile the code.

Follow these steps to compile the code:
1. Switch to the working directory.
```
cd /your/work/path/ftparser
```
  Navigate to your tokenizer plugin development directory. Replace /your/work/path/ftparser with the actual path to your development directory.
2. Create a build directory.
```
mkdir -p build
```
  Create a directory named build to store intermediate files and final outputs generated during the compilation process. The -p option ensures no error occurs if the directory already exists.
3. Enter the build directory.
```
cd build
```
  Navigate to the build directory. Subsequent compilation steps will be performed here to keep the source directory clean.
4. Configure the build environment.
```
cmake ..
```
  Run the cmake command to read the CMakeLists.txt file from the parent directory and generate the required Makefile. The .. indicates that the CMakeLists.txt file is located in the parent directory.
5. Compile the source code.
```
make
```
  Run the make command to compile the source code based on the generated Makefile. This will create a dynamic library file (for example, libexample_ftparser.so).
  
  After a successful compilation, the dynamic library file will be generated in the current directory (the build directory).
Copy the compiled output.

Once the compilation is complete, the dynamic library file (such as libexample_ftparser.so) will be located in the build directory. Copy this file to the plugin_dir directory on each Observer node in the OceanBase cluster.
```
cp libexample_ftparser.so /path/to/plugin_dir/
```
Replace /path/to/plugin_dir/ with the actual path to the plugin_dir, which can be obtained by querying the system parameter plugin_dir.
Load the plugin.

Log in to OceanBase Database as the sys tenant and modify the configuration parameter plugins_load to load the plugin:
```
ALTER SYSTEM SET plugins_load='libexample_ftparser.so';
```
Restart the cluster to make the plugin take effect.
- For an OceanBase cluster managed by OceanBase Deployer (obd), you can run the following command to restart the cluster.
```
obd cluster restart <cluster_name>
```
  Replace <cluster_name> with the actual cluster name.
- For a cluster managed by OceanBase Cloud Platform (OCP), you can directly restart the cluster in OCP.

View the installed tokenizer plugin.

select * from oceanbase.GV$OB_PLUGINS;

The return result is as follows:

+-----------+----------+------------------+--------+----------+------------------------+-----------------+------------------+-------------------+-----------------------+---------------+---------------------------------------------+
| SVR_IP    | SVR_PORT | NAME             | STATUS | TYPE     | LIBRARY                | LIBRARY_VERSION | LIBRARY_REVISION | INTERFACE_VERSION | AUTHOR                | LICENSE       | DESCRIPTION                                 |
+-----------+----------+------------------+--------+----------+------------------------+-----------------+------------------+-------------------+-----------------------+---------------+---------------------------------------------+
| 127.0.0.1 |    55801 | ngram            | READY  | FTPARSER | NULL                   | 1.0.0           | NULL             | 0.1.0             | OceanBase Corporation | Mulan PubL v2 | This is a ngram fulltext parser plugin.     |
| 127.0.0.1 |    55801 | beng             | READY  | FTPARSER | NULL                   | 1.0.0           | NULL             | 0.1.0             | OceanBase Corporation | Mulan PubL v2 | This is a basic english parser plugin.      |
| 127.0.0.1 |    55801 | space            | READY  | FTPARSER | NULL                   | 1.0.0           | NULL             | 0.1.0             | OceanBase Corporation | Mulan PubL v2 | This is a default whitespace parser plugin. |
| 127.0.0.1 |    55801 | example_ftparser | READY  | FTPARSER | libexample_ftparser.so | 1.0.0           | NULL             | 0.1.0             | OceanBase Corporation | Mulan PSL v2  | This is an example ftparser.                |
+-----------+----------+------------------+--------+----------+------------------------+-----------------+------------------+-------------------+-----------------------+---------------+---------------------------------------------+

If LIBRARY is NULL, it indicates that the tokenizer is built in.

Step 4: Test the tokenizer plug-in

Create a table and specify the example_ftparser tokenizer using the WITH PARSER clause.

CREATE TABLE t_example(
    c1 INT,
    c2 VARCHAR(200),
    c3 TEXT,
    FULLTEXT INDEX (c2, c3) WITH PARSER example_ftparser
);

Insert the test data into the table.

INSERT INTO t_example (c1, c2, c3) VALUES
    (1, 'Alice', 'Alice loves programming and enjoys long walks.'),
    (2, 'Bob', 'Bob is an avid reader and a coffee enthusiast.'),
    (3, 'Charlie', 'Charlie is a skilled musician who plays the guitar.'),
    (4, 'Diana', 'Diana is passionate about painting and arts.'),
    (5, 'Eve', 'Eve is a fitness coach and a healthy lifestyle advocate.');

Query records containing the keyword loves.

SELECT * FROM t_example WHERE MATCH(c2, c3) AGAINST ('loves') > 0;

The return result is as follows:

+------+-------+------------------------------------------------+
| c1   | c2    | c3                                             |
+------+-------+------------------------------------------------+
|    1 | Alice | Alice loves programming and enjoys long walks. |
+------+-------+------------------------------------------------+
1 row in set

Query records containing the keyword reader.

SELECT * FROM t_example WHERE MATCH(c2, c3) AGAINST ('reader') > 0;

The return result is as follows:

+------+------+------------------------------------------------+
| c1   | c2   | c3                                             |
+------+------+------------------------------------------------+
|    2 | Bob  | Bob is an avid reader and a coffee enthusiast. |
+------+------+------------------------------------------------+
1 row in set

Test the tokenizer scores.

SELECT c1, 
    MATCH (c2, c3) AGAINST ('he loves programming and reading') AS score,
    c2,
    c3
FROM t_example;

The return result is as follows:

+------+--------------------+---------+----------------------------------------------------------+
| c1   | score              | c2      | c3                                                       |
+------+--------------------+---------+----------------------------------------------------------+
|    1 |  2.665294094128556 | Alice   | Alice loves programming and enjoys long walks.           |
|    2 | 0.2849740932642488 | Bob     | Bob is an avid reader and a coffee enthusiast.           |
|    3 |                  0 | Charlie | Charlie is a skilled musician who plays the guitar.      |
|    4 | 0.2989130434782609 | Diana   | Diana is passionate about painting and arts.             |
|    5 | 0.2722772277227723 | Eve     | Eve is a fitness coach and a healthy lifestyle advocate. |
+------+--------------------+---------+----------------------------------------------------------+
5 rows in set

Tokenizer plugin

Notice

Prerequisites

Note

Procedure

Step 1: Install the development environment

Step 2: Obtain a development template

Step 3: Compile and install the plugin

Step 4: Test the tokenizer plug-in