This article explains how to install and use the tokenizer plugin in OceanBase Database. The tokenizer plugin can be used for full-text indexing and supports custom tokenization logic to meet specific business requirements.
Notice
- For OceanBase Database V4.3.5, the tokenizer plugin is supported starting from V4.3.5 BP1.
- The tokenizer plugin is currently experimental and is not recommended for use in production environments.
Prerequisites
Before installing and using the tokenizer plugin, ensure the following conditions are met:
The operating system supports the yum package management tool.
Note
If your system does not support
yum, you can install the required dependencies through alternative methods, such as manually downloading RPM packages or using other package management tools.An OceanBase cluster is deployed and running properly.
You have the privileges to modify configuration parameters and restart the cluster.
Procedure
Step 1: Install the development environment
Developing a tokenizer plugin requires a C/C++ compilation environment and the plugin development toolkit provided by OceanBase. Execute the following commands to install the C/C++ development environment and the OceanBase plugin development kit:
Install the basic compilation tools.
Install the essential tools and libraries required for compiling C/C++ programs.
yum install -y cmake make glibc-devel glibc-headers gcc gcc-c++Configure the OceanBase software source.
Add the official OceanBase software source so that you can use yum to install OceanBase tools and dependencies later.
Install the yum-utils tool.
yum install -y yum-utilsAdd the OceanBase software source.
yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
Install the plugin development toolkit.
yum install -y oceanbase-plugin-dev-kitAfter installation, you can find the example code file for the OceanBase tokenizer plugin,
space_ftparser.cpp, in the/usr/share/examples/ObPlugin/ftparserdirectory.ls /usr/share/examples/ObPlugin/ftparserThe return result is as follows:
CMakeLists.txt space_ftparser.cpp
Step 2: Obtain a development template
Copy the OceanBase tokenizer plugin example code space_ftparser.cpp to your own development directory. You can then modify the space_ftparser.cpp file to develop your custom tokenizer plugin.
Here is an example:
[root@xxx packages]# cp /usr/share/examples/ObPlugin/ftparser/* /home/admin/test_plugin_dev
[root@xxx packages]# ls /home/admin/test_plugin_dev
The return result is as follows:
CMakeLists.txt space_ftparser.cpp
The core file of the example code is space_ftparser.cpp. You can modify this file according to your business requirements to implement custom tokenization logic.
Step 3: Compile and install the plugin
Modify the build configuration (
CMakeLists.txt).In the root directory of the sample code, you will find a
CMakeLists.txtfile. This file contains "TODO" markers indicating the sections that need to be modified:PLUGIN_NAME: the name of the current plugin, which will also be the name of the project and the generated dynamic library. Modify it to the desired name.SOURCES: the list of implementation files, which can include C or C++ source files. If you add new implementation files, include their paths here. Do not include header files in this list.
Compile the code.
Follow these steps to compile the code:
Switch to the working directory.
cd /your/work/path/ftparserNavigate to your tokenizer plugin development directory. Replace
/your/work/path/ftparserwith the actual path to your development directory.Create a build directory.
mkdir -p buildCreate a directory named
buildto store intermediate files and final outputs generated during the compilation process. The-poption ensures no error occurs if the directory already exists.Enter the build directory.
cd buildNavigate to the
builddirectory. Subsequent compilation steps will be performed here to keep the source directory clean.Configure the build environment.
cmake ..Run the
cmakecommand to read theCMakeLists.txtfile from the parent directory and generate the requiredMakefile. The..indicates that theCMakeLists.txtfile is located in the parent directory.Compile the source code.
makeRun the
makecommand to compile the source code based on the generatedMakefile. This will create a dynamic library file (for example,libexample_ftparser.so).After a successful compilation, the dynamic library file will be generated in the current directory (the
builddirectory).
Copy the compiled output.
Once the compilation is complete, the dynamic library file (such as
libexample_ftparser.so) will be located in thebuilddirectory. Copy this file to theplugin_dirdirectory on each Observer node in the OceanBase cluster.cp libexample_ftparser.so /path/to/plugin_dir/Replace
/path/to/plugin_dir/with the actual path to theplugin_dir, which can be obtained by querying the system parameterplugin_dir.Load the plugin.
Log in to OceanBase Database as the sys tenant and modify the configuration parameter
plugins_loadto load the plugin:ALTER SYSTEM SET plugins_load='libexample_ftparser.so';Restart the cluster to make the plugin take effect.
For an OceanBase cluster managed by OceanBase Deployer (obd), you can run the following command to restart the cluster.
obd cluster restart <cluster_name>Replace
<cluster_name>with the actual cluster name.For a cluster managed by OceanBase Cloud Platform (OCP), you can directly restart the cluster in OCP.
View the installed tokenizer plugin.
select * from oceanbase.GV$OB_PLUGINS;The return result is as follows:
+-----------+----------+------------------+--------+----------+------------------------+-----------------+------------------+-------------------+-----------------------+---------------+---------------------------------------------+ | SVR_IP | SVR_PORT | NAME | STATUS | TYPE | LIBRARY | LIBRARY_VERSION | LIBRARY_REVISION | INTERFACE_VERSION | AUTHOR | LICENSE | DESCRIPTION | +-----------+----------+------------------+--------+----------+------------------------+-----------------+------------------+-------------------+-----------------------+---------------+---------------------------------------------+ | 127.0.0.1 | 55801 | ngram | READY | FTPARSER | NULL | 1.0.0 | NULL | 0.1.0 | OceanBase Corporation | Mulan PubL v2 | This is a ngram fulltext parser plugin. | | 127.0.0.1 | 55801 | beng | READY | FTPARSER | NULL | 1.0.0 | NULL | 0.1.0 | OceanBase Corporation | Mulan PubL v2 | This is a basic english parser plugin. | | 127.0.0.1 | 55801 | space | READY | FTPARSER | NULL | 1.0.0 | NULL | 0.1.0 | OceanBase Corporation | Mulan PubL v2 | This is a default whitespace parser plugin. | | 127.0.0.1 | 55801 | example_ftparser | READY | FTPARSER | libexample_ftparser.so | 1.0.0 | NULL | 0.1.0 | OceanBase Corporation | Mulan PSL v2 | This is an example ftparser. | +-----------+----------+------------------+--------+----------+------------------------+-----------------+------------------+-------------------+-----------------------+---------------+---------------------------------------------+If
LIBRARYisNULL, it indicates that the tokenizer is built in.
Step 4: Test the tokenizer plug-in
Create a table and specify the
example_ftparsertokenizer using theWITH PARSERclause.CREATE TABLE t_example( c1 INT, c2 VARCHAR(200), c3 TEXT, FULLTEXT INDEX (c2, c3) WITH PARSER example_ftparser );Insert the test data into the table.
INSERT INTO t_example (c1, c2, c3) VALUES (1, 'Alice', 'Alice loves programming and enjoys long walks.'), (2, 'Bob', 'Bob is an avid reader and a coffee enthusiast.'), (3, 'Charlie', 'Charlie is a skilled musician who plays the guitar.'), (4, 'Diana', 'Diana is passionate about painting and arts.'), (5, 'Eve', 'Eve is a fitness coach and a healthy lifestyle advocate.');Query records containing the keyword
loves.SELECT * FROM t_example WHERE MATCH(c2, c3) AGAINST ('loves') > 0;The return result is as follows:
+------+-------+------------------------------------------------+ | c1 | c2 | c3 | +------+-------+------------------------------------------------+ | 1 | Alice | Alice loves programming and enjoys long walks. | +------+-------+------------------------------------------------+ 1 row in setQuery records containing the keyword
reader.SELECT * FROM t_example WHERE MATCH(c2, c3) AGAINST ('reader') > 0;The return result is as follows:
+------+------+------------------------------------------------+ | c1 | c2 | c3 | +------+------+------------------------------------------------+ | 2 | Bob | Bob is an avid reader and a coffee enthusiast. | +------+------+------------------------------------------------+ 1 row in setTest the tokenizer scores.
SELECT c1, MATCH (c2, c3) AGAINST ('he loves programming and reading') AS score, c2, c3 FROM t_example;The return result is as follows:
+------+--------------------+---------+----------------------------------------------------------+ | c1 | score | c2 | c3 | +------+--------------------+---------+----------------------------------------------------------+ | 1 | 2.665294094128556 | Alice | Alice loves programming and enjoys long walks. | | 2 | 0.2849740932642488 | Bob | Bob is an avid reader and a coffee enthusiast. | | 3 | 0 | Charlie | Charlie is a skilled musician who plays the guitar. | | 4 | 0.2989130434782609 | Diana | Diana is passionate about painting and arts. | | 5 | 0.2722772277227723 | Eve | Eve is a fitness coach and a healthy lifestyle advocate. | +------+--------------------+---------+----------------------------------------------------------+ 5 rows in set