This topic describes how to use semantic indexes in OceanBase Database.
Note
This feature is an experimental feature in the current version. We recommend that you do not use it in a production environment.
Overview
The semantic index leverages the built-in embedding capabilities of OceanBase Database to greatly simplify the process of using vector indexes. It makes the concept of vectors transparent to users: you can directly write the raw data you want to store (such as text), and OceanBase Database will automatically convert it into vectors and build the index internally. During searches, you only need to provide the raw search content, and OceanBase Database will automatically perform the embedding and search the vector index, significantly enhancing usability.
Considering the performance overhead of embedding models, the semantic index offers both synchronous and asynchronous embedding options:
- Synchronous mode: Embedding and indexing occur immediately after data is written, ensuring real-time visibility.
- Asynchronous mode: Data embedding and indexing are performed in batches by background tasks, significantly improving write performance, though with delayed data visibility. You can flexibly set the trigger cycle for background tasks based on your requirements for data visibility.
Additionally, tables with semantic indexes can also perform brute-force searches. Brute-force search refers to searching by scanning the entire table to obtain the exact results of the top n closest rows.
Supported features
Notice
This feature is only supported for HNSW, HNSW_BQ, and HNSW_SQ indexes in the current version.
The syntax for updating, deleting, and searching using semantic indexes, as well as memory management, is identical to that of HNSW series indexes. Index monitoring and maintenance processes are supported, and in asynchronous mode, incremental refreshes will also trigger data embedding.
The supported features are as follows:
| Module | Feature | Description |
|---|---|---|
| DDL | Create a semantic index when creating a table | You can create a semantic index on a VARCHAR column when creating a table. |
| DDL | Create a semantic index after table creation | You can create a semantic index on a VARCHAR column in an existing table. |
| Search | semantic_distance function |
Use this function to perform vector search by providing the raw data. |
| Search | semantic_vector_distance function |
Use this function to perform vector search by providing the vector. There are two usage scenarios:
|
| DBMS_VECTOR | REFRESH_INDEX |
Use this procedure in the same way as for regular vector indexes to perform incremental refreshes and asynchronous embedding. |
| DBMS_VECTOR | REBUILD_INDEX |
Use this procedure in the same way as for regular vector indexes to perform full index rebuilds. |
Some usage considerations are as follows:
- In synchronous mode, write performance may be affected by embedding performance. In asynchronous mode, data visibility is delayed.
- For scenarios with repeated searches, it is recommended to use the AI Function Service to pre-fetch the query vector to avoid embedding during each search.
Prerequisites
Register the embedding model and endpoint
Before using the semantic index, you must register the embedding model and endpoint. Here is an example:
CALL DBMS_AI_SERVICE.DROP_AI_MODEL ('ob_embed');
CALL DBMS_AI_SERVICE.DROP_AI_MODEL_ENDPOINT ('ob_embed_endpoint');
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL(
'ob_embed', '{
"type": "dense_embedding",
"model_name": "BAAI/bge-m3"
}');
CALL DBMS_AI_SERVICE.CREATE_AI_MODEL_ENDPOINT (
'ob_embed_endpoint', '{
"ai_model_name": "ob_embed",
"url": "https://api.siliconflow.cn/v1/embeddings",
"access_key": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxx",
"provider": "siliconflow"
}');
Note
Please replace access_key with your actual API Key. The vector dimension of the BAAI/bge-m3 model is 1024, so you must use dim=1024 when creating a semantic index.
Manually enable the semantic index
You can enable the semantic index feature by setting the tenant-level parameter _enable_semantic_index to ON. By default, this parameter is OFF:
ALTER SYSTEM SET _enable_semantic_index = true;
Syntax and description
Creation
Semantic indexes support creation at table creation time and post-creation. When creating an index, note the following:
- You must specify the
VARCHARcolumn to create an index. - The
modelandsync_modeparameters are not supported for regular vector indexes. - The parameters and descriptions for post-creation are the same as those for creation at table creation time.
You can use the CREATE TABLE statement to create a semantic index. You can specify the index parameters to initiate a background task in synchronous or asynchronous mode. In synchronous mode, the VARCHAR data is automatically converted to vector data when data is inserted. In asynchronous mode, the data conversion is completed periodically or manually.
Syntax:
CREATE TABLE table_name (
column_name1 data_type1,
column_name2 VARCHAR, -- Text column
...,
VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
);
You can create a semantic index on the VARCHAR column of an existing table. When creating an index, you can specify the index parameters to initiate a background task in synchronous or asynchronous mode. In synchronous mode, all existing VARCHAR data is embedded into vectors. In asynchronous mode, the embedding is completed periodically or manually.
Syntax:
CREATE VECTOR INDEX index_name
ON table_name(varchar_column_name)
WITH (param1=value1, param2=value2, ...);
The param parameters are described as follows:
| Parameter | Default value | Value range | Required | Description | Notes |
|---|---|---|---|---|---|
distance |
l2/inner_product/cosine |
Yes | Specifies the vector distance algorithm type. | l2 indicates the Euclidean distance, inner_product indicates the inner product distance, and cosine indicates the cosine distance. |
|
type |
hnsw / hnsw_bq / hnsw_sq |
Yes | Specifies the index algorithm type. | ||
lib |
vsag |
vsag |
No | Specifies the vector index library type. | Only the VSAG vector library is supported. |
model |
The name of a registered model | Yes | Specifies the name of the large language model used for embedding. | You must register the model with the AI function service before creating an index.
NoteThis parameter is not supported for regular vector indexes. |
|
dim |
A positive integer, up to 4096 | Yes | Specifies the dimension of the vector after embedding. | The value must match the dimension provided by the model. | |
sync_mode |
async |
immediate/manual/async |
No | Specifies the synchronization mode between data and the index. | immediate indicates the synchronous mode, manual indicates the manual mode, and async indicates the asynchronous mode.
NoteThis parameter is not supported for regular vector indexes. |
sync_interval |
10s |
A time interval, such as 10s, 1h, or 1d |
No | Specifies the trigger cycle of the background task in asynchronous mode. | The value must be a positive number. The unit can be seconds (s), hours (h), or days (d). |
Other vector index parameters (such as m, ef_construction, and ef_search) are used in the same way as in regular HNSW/HNSW_BQ indexes. For more information, see the HNSW series index documentation in the related topics at the end of this document.
Search
Semantic indexes support two search methods:
- Text search
- Vector search
- You can specify the
APPROXIMATE/APPROXclause to perform nearest neighbor search using the vector index. - You can also omit the
APPROXIMATE/APPROXclause to perform brute-force search using full table scan.
- You can specify the
Use the semantic_distance expression to perform vector search by passing in the original text.
Syntax:
SELECT ... FROM table_name
ORDER BY semantic_distance(column_name, 'query_text') [APPROXIMATE|APPROX]
LIMIT n;
Where:
column_name: the text column specified when the semantic index was created.query_text: the original text to search for.n: the number of rows to return.
With the
APPROXIMATEclause: Use thesemantic_vector_distanceexpression to perform vector search by passing in the vector. If the search statement contains theAPPROXIMATE/APPROXclause, the vector index is used for search. Syntax:SELECT ... FROM table_name ORDER BY semantic_vector_distance(column_name, 'query_vector') [APPROXIMATE|APPROX] LIMIT n;Where:
column_name: the text column specified when the semantic index was created.query_vector: the query vector.n: the number of rows to return.
Without the
APPROXIMATEclause: Use thesemantic_vector_distanceexpression to perform vector search by passing in the vector. If the search statement does not contain theAPPROXIMATE/APPROXclause, a brute-force search is performed by scanning the entire table to obtain the exact results of the top n rows with the smallest distances. During the search, thedistancetype is obtained from the table schema, and a full table scan is performed. The vector distance is calculated for each row to ensure the exact results. Syntax:SELECT ... FROM table_name ORDER BY semantic_vector_distance(column_name, 'query_vector') LIMIT n;The parameters are the same as those with the
APPROXIMATEclause.
For more information about the APPROXIMATE/APPROX clause, see the HNSW series index documentation in the related topics at the end of this document.
Create, update, search, and delete examples
The DML operations (INSERT, UPDATE, and DELETE) of a semantic index are consistent with those of a regular vector index. When you insert or update VARCHAR data, the system synchronously or asynchronously embeds the data into vectors based on the sync_mode parameter.
Create an index when you create a table
Create the vector_idx index when you create the items table:
-- Assume that the ob_embed model has been created (refer to the "Prerequisites" section for registration).
CREATE TABLE items (
id BIGINT PRIMARY KEY,
doc VARCHAR(100),
VECTOR INDEX vector_idx(doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s)
);
Insert data into the items table. The system automatically embeds the vector:
INSERT INTO items(id, doc) VALUES(1, 'Rose');
Create an index after you create a table
Create the items table first, and then use the CREATE VECTOR INDEX statement to create the vector_idx index:
CREATE TABLE items (
id BIGINT PRIMARY KEY,
doc VARCHAR(100)
);
-- Assume that the ob_embed model has been created (refer to the "Prerequisites" section for registration).
CREATE VECTOR INDEX vector_idx
ON items (doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=async, sync_interval=10s);
Insert data into the items table. The system automatically embeds the vector:
INSERT INTO items(id, doc) VALUES(1, 'Rose');
Update
When you update VARCHAR data, the system re-embeds the data:
- In synchronous mode: The system re-embeds the data immediately after the update.
- In asynchronous mode: The system re-embeds the data in the background task triggered in the next cycle after the update.
Example:
UPDATE items SET doc = 'Lily' WHERE id = 1;
Delete
The delete operation is consistent with that of a regular vector index. You can directly delete data.
Example:
DELETE FROM items WHERE id = 1;
Search
-- Assume that the ob_emb model has been created.
CREATE TABLE items (
id INT PRIMARY KEY,
doc varchar(100),
VECTOR INDEX vector_idx(doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
);
INSERT INTO items(id, doc) VALUES(1, 'Rose');
INSERT INTO items(id, doc) VALUES(2, 'Sunflower');
INSERT INTO items(id, doc) VALUES(3, 'Rose');
INSERT INTO items(id, doc) VALUES(4, 'Sunflower');
INSERT INTO items(id, doc) VALUES(5, 'Rose');
-- Search by using raw text.
SELECT id, doc FROM items
ORDER BY semantic_distance(doc, 'Sunflower')
APPROXIMATE LIMIT 3;
The returned result is as follows:
+----+-----------+
| id | doc |
+----+-----------+
| 2 | Sunflower |
| 4 | Sunflower |
| 5 | Rose |
+----+-----------+
3 rows in set
Use the semantic_vector_distance expression to search by using a vector. If the APPROXIMATE or APPROX clause is specified in the search statement, the system uses the vector index for the search.
-- Assume that the ob_emb model has been created (refer to the "Prerequisites" section for registration).
CREATE TABLE items (
id INT PRIMARY KEY,
doc varchar(100),
VECTOR INDEX vector_idx(doc)
WITH (distance=l2, lib=vsag, type=hnsw, model=ob_embed, dim=1024, sync_mode=immediate)
);
INSERT INTO items(id, doc) VALUES(1, 'Rose');
INSERT INTO items(id, doc) VALUES(2, 'Lily');
INSERT INTO items(id, doc) VALUES(3, 'Sunflower');
INSERT INTO items(id, doc) VALUES(4, 'Rose');
-- Obtain the query vector
SET @query_vector = AI_EMBED('ob_embed', 'Sunflower');
-- Search by using the vector.
SELECT id, doc FROM items
ORDER BY semantic_vector_distance(doc, @query_vector)
APPROXIMATE LIMIT 3;
The returned result is as follows:
+----+-----------+
| id | doc |
+----+-----------+
| 3 | Sunflower |
| 1 | Rose |
| 4 | Rose |
+----+-----------+
3 rows in set
-- Search by using the vector for exact results.
SELECT id, doc FROM items
ORDER BY semantic_vector_distance(doc, @query_vector)
LIMIT 3;
The returned result is as follows:
+----+-----------+
| id | doc |
+----+-----------+
| 3 | Sunflower |
| 4 | Rose |
| 1 | Rose |
+----+-----------+
3 rows in set
