Efficiently retrieve large volumes of news data using full-text indexing

2026-04-02 06:23:56  Updated

Background

The full-text indexing feature of OceanBase Database effectively addresses various challenges in real-world applications, especially in scenarios such as system log analysis and user behavior and profile analysis. This feature enables efficient data filtering and screening, as well as high-quality relevance assessment. Additionally, by combining sparse and dense vectors in a multi-stage retrieval architecture, OceanBase Database achieves more efficient retrieval performance in RAG systems within specific knowledge domains.

This tutorial uses a news and information business scenario as an example. In this scenario, the retrieval system faces three core challenges:

  • Real-time requirements: Quickly locate target information from TB-level data.
  • Semantic complexity: Address challenges in synonym handling and phrase recognition in natural language processing.
  • Hybrid query demands: Enhance the joint optimization of text retrieval and structured queries.

This tutorial demonstrates how to use the full-text indexing feature to quickly find target news in a large amount of information. We will showcase the improvements in OceanBase Database's full-text indexing feature in terms of functionality, performance, and ease of use through keyword queries.

Overview

In the OceanBase storage engine, user documents (doc) and queries (query) are split into multiple keywords (word/token) using a tokenizer. These keywords, along with the statistical features of the documents, are stored in internal auxiliary tables (tablet) to facilitate relevance assessment (ranking) during the information retrieval phase. OceanBase employs the advanced BM25 algorithm, which more effectively calculates the relevance scores between keywords in user queries and stored documents, ultimately outputting the documents that meet the criteria along with their scores.

Prerequisites

To operate and experience the full-text indexing feature of OceanBase Database, make sure that the following conditions are met:

  1. Environment requirements: You have deployed an OceanBase cluster of V4.3.5 BP1 or later and created a MySQL-compatible tenant. For more information about how to deploy an OceanBase cluster, see Overview. After the deployment is complete, run the following SQL statements to verify the cluster and tenant information:

    -- Verify the cluster information.
    SELECT * FROM GV$OB_SERVERS;
    
    -- Verify the tenant information.
    SELECT * FROM oceanbase.DBA_OB_TENANTS;
    
  2. Privilege settings: The MySQL-compatible tenant that you created must have the privileges to insert and query data. For more information about how to set privileges, see Directly grant privileges.

  3. Database creation: Make sure that you have created a database. For more information about how to create a database, see Create a database.

Procedure

This topic describes how to use full-text indexes and views in OceanBase Database and how to query data.

Step 1: Import the dataset

OceanBase Database is pre-installed with built-in tokenizers and the boolean mode, which is more efficient than traditional natural language processing. We will use the Sports news dataset to import data into OceanBase Database. This dataset contains sports news articles from sources such as BBC Sport, covering football, cricket, rugby, tennis, and athletics. The CSV file should have three columns: event (sport category), date (publication date), and news (article content). You can prepare your own CSV from the BBC Sport dataset (737 articles in five categories). We will create a table named sport_data_whole with three columns of variable-length characters (event, date, and news). We will also create a full-text index on the news column using the space tokenizer, which is well-suited for English text.

Note

The space tokenizer splits text by whitespace, which works effectively for English.
OceanBase Database also includes other built-in tokenizers, such as the IK tokenizer for Chinese (with smart and max_word modes), beng for basic English, and ngram for splitting by character length.

-- Create a table and create a full-text index using the space tokenizer for English
CREATE TABLE sport_data_whole (
    event VARCHAR(64),
    date VARCHAR(16),
    news VARCHAR(65535),
    FULLTEXT INDEX ft_idx_news (news) WITH PARSER space
);

Import the news dataset into the table using the local file method. This process is expected to take about 30 seconds.

-- Import data
LOAD DATA /*+ PARALLEL(8) */ LOCAL INFILE '/home/sports_news_en.csv' INTO TABLE sport_data_whole
FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n';

After importing the data, the table contains approximately 700 news articles, with an average length of about 350 words per article. The original data size is about 2 MB. After efficient compression, the total storage space required is less than 5 MB. This is mainly due to the large amount of data stored in the inverted and forward auxiliary tables of the full-text index, which contain a significant number of tokenization records.

-- Verify the number of imported records
SELECT AVG(LENGTH(news)), COUNT(*) FROM sport_data_whole;

The result is as follows:

+-------------------+----------+
| avg(length(news)) | count(*) |
+-------------------+----------+
|         1850.2400 |      737 |
+-------------------+----------+
1 row in set
-- Query the view to verify the result
SELECT * FROM oceanbase.DBA_OB_TABLE_SPACE_USAGE;

The result is as follows:

+----------+---------------+--------------------------------+-------------+---------------+
| TABLE_ID | DATABASE_NAME | TABLE_NAME                     | OCCUPY_SIZE | REQUIRED_SIZE |
+----------+---------------+--------------------------------+-------------+---------------+
|   500035 | test          | sport_data_whole               |      920000 |       1048576 |
|   500036 | test          | __idx_500035_news              |     1500000 |       2097152 |
|   500037 | test          | __idx_500035_fts_rowkey_doc    |       21058 |         28672 |
|   500038 | test          | __idx_500035_fts_doc_rowkey    |       23236 |         28672 |
|   500039 | test          | __idx_500035_news_fts_doc_word |     1600000 |       2097152 |
+----------+---------------+--------------------------------+-------------+---------------+

Step 2: Query using full-text index

With the stored news dataset and index, we can perform multi-condition combination or highly filtered searches. For example, as a fan, if you want to search for news containing both "Manchester" and "goal," you can use the Boolean mode.

Compared to string LIKE matching without an index, the Boolean mode syntax is more concise and queries faster.

-- Query using Boolean mode to find news containing both "Manchester" and "goal"
SELECT COUNT(*) FROM sport_data_whole
WHERE MATCH (news) AGAINST ('+Manchester +goal' IN BOOLEAN MODE);

The result is:

+----------+
| count(*) |
+----------+
|       12 |
+----------+
1 row in set (0.03 sec)

In contrast, using the LIKE syntax:

-- Query using LIKE syntax
SELECT COUNT(*) FROM sport_data_whole
WHERE news LIKE '%Manchester%' AND news LIKE '%goal%';

The result is also:

+----------+
| count(*) |
+----------+
|       12 |
+----------+
1 row in set (0.08 sec)

We can further rank the returned news articles based on the scores in the output to determine which one is more relevant to the query.

-- Return the event, date, and score of the news to help determine relevance
SELECT event, date, MATCH (news) AGAINST ('Manchester goal') AS score
FROM sport_data_whole
WHERE MATCH (news) AGAINST ('+Manchester +goal' IN BOOLEAN MODE);

The result is:

+---------+------------+---------------------+
| event   | date       | score               |
+---------+------------+---------------------+
| football| 2004-10-15 |  0.4657063867776557 |
| football| 2005-03-22 | 0.41760566608994765 |
| ...     | ...        | ...                 |
+---------+------------+---------------------+
12 rows in set

Additionally, Boolean mode allows us to exclude certain keywords. For example, fouls and cards are often mentioned in football match reports. If you want to find intense matches without mentions of yellow cards, red cards, or fouls, you can use the - operator in Boolean mode.

-- Query for intense matches without yellow cards, red cards, or fouls
SELECT COUNT(*) FROM sport_data_whole
WHERE MATCH (news) AGAINST ('+intense -yellow -red -foul' IN BOOLEAN MODE);

The result is:

+----------+
| count(*) |
+----------+
|       18 |
+----------+
1 row in set

Step 3: Tune

Tune the TOKENIZE function

When the query results of a full-text index are not as expected, it is usually because the tokenization results are not ideal. OceanBase Database provides a quick TOKENIZE function to help you test the tokenization effects. This function supports all tokenizers and their corresponding attributes. You can use the TOKENIZE function to verify the tokenization effects of tokenizers.

For example, when searching for compound or hyphenated terms such as "hat-trick" or "clean sheet," the tokenization result affects whether your query will match the intended documents. The space tokenizer splits text by whitespace; hyphenated terms may be split into separate tokens.

  1. Use the TOKENIZE function to verify the tokenization effects of tokenizers:

    -- Verify the tokenization effects of English sports news using the space tokenizer
    SELECT TOKENIZE('Wayne Rooney scored a brilliant hat-trick in the Premier League match', 'space', '[{"output": "default"}]');
    

    The returned result is as follows:

    +------------------------------------------------------------------------------------------------------------------------------------------------------+
    | TOKENIZE('Wayne Rooney scored a brilliant hat-trick in the Premier League match', 'space', '[{"output": "default"}]')                                  |
    +------------------------------------------------------------------------------------------------------------------------------------------------------+
    | ["Wayne", "Rooney", "scored", "a", "brilliant", "hat-trick", "in", "the", "Premier", "League", "match"]                                                |
    +------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set
    

    The above result shows that "hat-trick" is kept as a single token. If your tokenizer splits it into "hat" and "trick," you would need to search using +hat +trick in Boolean mode to find relevant documents.

  2. Execute the following statement to query whether the target documents are hit:

    -- Boolean mode search for news about specific players and achievements
    SELECT COUNT(*)
    FROM sport_data_whole
    WHERE MATCH (news) AGAINST ('+Rooney +hat-trick' IN BOOLEAN MODE);
    

    If the tokenizer treats "hat-trick" as one token and your dataset contains such articles, the query returns matching records. If "hat-trick" is split into two tokens, use +Rooney +hat +trick instead.

  3. When tokenization does not meet your needs (for example, for custom vocabularies or language-specific rules), you can update the system dictionary. After updating the system dictionary, you need to refresh the cache.

    Note

    To update the system dictionary, we recommend that you contact OceanBase Technical Support.

  4. Rebuild the full-text index after dictionary changes:

    -- Rebuild the full-text index to apply the new dictionary
    ALTER TABLE sport_data_whole
    DROP INDEX ft_idx_news,
    ADD FULLTEXT INDEX ft_idx_news (news)
    WITH PARSER space;
    
  5. Verify the optimization effect by running your search again. Using TOKENIZE to inspect tokenization helps you choose the correct Boolean mode query syntax (for example, +hat +trick versus +hat-trick) and improves retrieval accuracy.

Performance comparison between OceanBase Database and MySQL

To compare the full-text search performance of OceanBase Database with that of MySQL, we use the full-text search feature of MySQL as a reference. For a fair comparison, we selected the wikir1k dataset, which contains 369,721 rows with an average of 100 words per row, for the performance comparison.

The following table shows the comparison results under different scenarios in the natural language mode and the boolean mode. As shown in the table, OceanBase Database outperforms MySQL significantly in scenarios that require extensive text segmentation or return a large number of results. For small result sets, since the computational load is relatively small, the advantages of the query engine are not significant, and the performance of both engines is close.

Test environment: OceanBase Database tenant specifications are 8 cores and 16 GB of memory. MySQL version is 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL).

Natural language queries

-- q1: Find documents containing "and"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and');

-- q2: Find documents containing "and", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('and') LIMIT 10;

-- q3: Find documents containing "librettists"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists');

-- q4: Find documents containing "librettists", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('librettists') LIMIT 10;

-- q5: Find documents containing "alleviating librettists"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists');

-- q6: Find documents containing "black spotted white yellow"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow');

-- q7: Find documents containing "black spotted white yellow", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('black spotted white yellow') LIMIT 10;

-- q8: Find documents containing "between up and down"
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down');

-- q9: Find documents containing "between up and down", limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('between up and down') LIMIT 10;

-- q10: Find long documents
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire');

-- q11: Find long documents, with "and" added
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and');

-- q12: Find long documents, limit to 10 results
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('alleviating librettists modifications retelling intangible hydrographic administratively berwickshire strathaven dumfriesshire lesmahagow transhumanist musselburgh prestwick cardiganshire montgomeryshire and') LIMIT 10;
Scenario OceanBase MySQL
q1 single token, high frequency words 3820458us 5718430us
q2 single token, high frequency words with limit 231861us 503772us
q3 single token, low frequency words 879us 672us
q4 single token, low frequency words with limit 720us 700us
q5 multiple tokens, small result set 1591us 1100us
q6 multiple tokens, medium result set 259700us 602221us
q7 multiple tokens, medium result set with limit 25502us 42620us
q8 multiple tokens, large result set 3842391us 6846847us
q9 multiple tokens, large result set with limit 301362us 784024us
q10 many tokens, small result set 22143us 10161us
q11 many tokens, large result set 3905829us 5929343us
q12 many tokens, large result set with limit 345968us 769970us

Boolean mode

-- q1: +high-frequency words -medium-frequency words
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -which -his' IN BOOLEAN MODE);

-- q2: +high-frequency words -low-frequency words
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+which  (+and -his)' IN BOOLEAN MODE);

-- q3: +medium-frequency words (+high-frequency words -medium-frequency words)
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and -carabantes -bufera' IN BOOLEAN MODE);

-- q4: +high-frequency words +low-frequency words
SELECT * FROM wikir1k WHERE MATCH (document) AGAINST ('+and +librettists'  IN BOOLEAN MODE);
Scenario OceanBase MySQL
q1: +high-frequency words -medium-frequency words 1586657us 2440798us
q2: +high-frequency words -low-frequency words 3726508us 7974832us
q3: +medium-frequency words (+high-frequency words -medium-frequency words) 3080644us 5612041us
q4: +high-frequency words +low-frequency words 230284us 357580us

Performance comparison summary

From the above data comparison, it can be seen that OceanBase outperforms MySQL in complex full-text search scenarios, whether in natural language mode or boolean mode. Especially in queries that require extensive tokenization or return a large number of results, OceanBase's advantages are even more evident. This provides developers and data analysts with a strong reference for choosing a database, particularly in applications that require efficient retrieval of massive data. OceanBase clearly demonstrates its powerful performance and flexible query capabilities in such scenarios.

OceanBase's full-text index can provide fast response times in complex queries, making it more suitable for real-world applications that require high concurrency and high-performance retrieval.

Contact Us