This topic describes how to create, query, maintain, and drop a vector index in OceanBase Database.
Specifics of OceanBase vector indexes are as follows:
- HNSW indexes are supported. The maximum dimension of an indexed column is 2000. HNSW indexes are memory-based indexes. They must be fully loaded into memory. They support DML operations and real-time queries.
- L2 and inner product (IP) are supported as the index distance algorithm.
Prepare
Before you use vector indexes, you must enable the vector feature by setting the ob_vector_memory_limit_percentage parameter.
ALTER SYSTEM SET ob_vector_memory_limit_percentage = 30;
Overview
Vector index queries are approximate nearest neighbor queries, and they cannot guarantee 100% accuracy. The query accuracy is measured by recall. For example, if a query can return 9 correct results out of 10 nearest neighbors, the recall is 90%. The following table describes recall.
Recall is affected by the construction parameters and query parameters.
Query parameters are specified when you create an index and cannot be modified. However, you can specify the session variable
ob_hnsw_ef_searchto use a custom value.
The syntax for vector index queries is as follows:
SELECT ... FROM $table_name ORDER BY $distance_function($column_name, $vector_expr) [APPROXIMATE|APPROX] LIMIT $num [OFFSET $num];
The following table describes the syntax.
Query syntax requirements:
- You must specify the
APPROXIMATEorAPPROXkeyword to choose the vector index path. - You must include the
ORDER BYandLIMITclauses. - The
ORDER BYclause supports only a single vector condition. - The value of
LIMIT + OFFSETmust be greater than 0 and no more than 16,384.
- You must specify the
Distance function usage rules:
- If you do not specify the
APPROXIMATEorAPPROXkeyword, the query plan will not choose the vector index path when you callinner_product,negative_inner_product, orL2. However, the system does not return an error in these cases. - If you specify the
APPROXIMATEorAPPROXkeyword, the query plan will not choose the vector index path when the distance function does not match the vector index algorithm. However, the system does not return an error in this case. - If you specify the
APPROXIMATEorAPPROXkeyword and the expression isinner_product, the query plan will not choose the vector index path. - If you specify the
APPROXIMATEorAPPROXkeyword and the expression isnegative_inner_productorL2, the distance function must match the vector index algorithm. In this case, the query plan will choose the vector index path.
- If you do not specify the
Other descriptions:
- The
WHEREcondition will be used as a filter condition after the vector index query. - If you do not specify the
LIMITclause, the system returns an error.
- The
Examples
Considerations
OceanBase supports creating vector indexes when you create a table and after you create a table. When you create a vector index, take note of the following considerations:
You cannot modify a vector index after it is created.
You must specify the VECTOR keyword when you create a vector index.
The parameters and descriptions of a post-created index are the same as those when you create an index.
If you have a large amount of data, we recommend that you write the data to the database first and then create an index to ensure optimal query performance.
Create an index when you create a table
Create a test table.
CREATE TABLE t1(c1 INT, c0 INT, c2 VECTOR(10), c3 VECTOR(10), PRIMARY KEY(c1), VECTOR INDEX idx1(c2) WITH (distance=l2, type=hnsw, lib=vsag), VECTOR INDEX idx2(c3) WITH (distance=l2, type=hnsw, lib=vsag));
Write test data.
INSERT INTO t1 VALUES(1, 1,'[0.203846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]', '[0.203846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
INSERT INTO t1 VALUES(2, 2, '[0.735541,0.670776,0.903237,0.447223,0.232028,0.659316,0.765661,0.226980,0.579658,0.933939]', '[0.213846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
INSERT INTO t1 VALUES(3, 3, '[0.327936,0.048756,0.084670,0.389642,0.970982,0.370915,0.181664,0.940780,0.013905,0.628127]', '[0.223846,0.205289,0.880265,0.824340,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833]');
Perform an approximate nearest neighbor query.
SELECT * FROM t1 ORDER BY l2_distance(c2, [0.712338,0.603321,0.133444,0.428146,0.876387,0.763293,0.408760,0.765300,0.560072,0.900498]) APPROXIMATE LIMIT 1;
The return result is as follows:
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
| c1 | c0 | c2 | c3 |
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
| 3 | 3 | [0.327936,0.048756,0.08467,0.389642,0.970982,0.370915,0.181664,0.94078,0.013905,0.628127] | [0.223846,0.205289,0.880265,0.82434,0.615737,0.496899,0.983632,0.865571,0.248373,0.542833] |
+----+------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
1 row in set
The following table describes the parameters of the index.
| Parameter | Default value | Value range | Required | Description | Remarks |
|---|---|---|---|---|---|
| distance | l2/inner_product | Yes | The type of the vector distance function. | l2 indicates the Euclidean distance, and inner_product indicates the inner product distance. | |
| type | hnsw | Yes | The type of the index. | At present, only HNSW indexes are supported. | |
| lib | vsag | vsag | No | The type of the vector index library. | At present, only the VSAG vector library is supported. |
| m | 16 | [5,64] | No | The maximum number of neighbors for each node in HNSW. | The value indicates the tradeoff between index construction time and query performance. A larger value indicates slower index construction but better query performance. |
| ef_construction | 200 | [5,1000] | No | The candidate set size during HNSW index construction. | The value indicates the tradeoff between index construction time and index quality. A larger value indicates slower index construction but higher index quality. ef_construction must be greater than m. |
| ef_search | 64 | [1,1000] | No | The candidate set size during HNSW query. | The value indicates the tradeoff between query speed and recall. A larger value indicates slower queries but higher recall. |
Create an index after you create a table
Create a test table.
CREATE TABLE vec_table (id int, c2 vector(10));
Create an index.
CREATE VECTOR INDEX vec_idx1 ON vec_table(c2) WITH (distance=l2, type=hnsw);
View the schema of the test table.
SHOW CREATE TABLE vec_table;
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| vec_table | CREATE TABLE `vec_table` (
`id` int(11) DEFAULT NULL,
`c2` VECTOR(10) DEFAULT NULL,
VECTOR KEY `vec_idx1` (`c2`) WITH (DISTANCE=L2, TYPE=HNSW, LIB=VSAG, M=16, EF_CONSTRUCTION=200, EF_SEARCH=64) BLOCK_SIZE 16384
) DEFAULT CHARSET = utf8mb4 ROW_FORMAT = DYNAMIC COMPRESSION = 'zstd_1.3.8' REPLICA_NUM = 2 BLOCK_SIZE = 16384 USE_BLOOM_FILTER = FALSE TABLET_SIZE = 134217728 PCTFREE = 0 |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set
View the index.
SHOW INDEX FROM vec_table;
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
| vec_table | 1 | vec_idx1 | 1 | c2 | A | NULL | NULL | NULL | YES | VECTOR | available | | YES | NULL |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+-----------+---------------+---------+------------+
1 row in set
Drop an index
Drop an index.
DROP INDEX vec_idx1 ON vec_table;
View the index.
SHOW INDEX FROM vec_table;
The return result is as follows:
Empty set
Index maintenance
When too much incremental data is generated, the query performance decreases. To reduce the volume of incremental data tables, OceanBase Database introduces the DBMS_VECTOR package for vector index maintenance.
Incremental refresh
If a large amount of data is written after the index is created, we recommend that you use incremental refresh.
The syntax is as follows:
PROCEDURE refresh_index(
IN idx_name VARCHAR(65535), ---- The index name.
IN table_name VARCHAR(65535), ---- The table name.
IN idx_vector_col VARCHAR(65535) DEFAULT NULL, ---- The vector column name.
IN refresh_threshold INT DEFAULT 10000, ---- The threshold of incremental data.
IN refresh_type VARCHAR(65535) DEFAULT NULL); ---- The refresh type. Valid value: FAST.
Incremental refresh is performed every 15 minutes. If more than 10,000 incremental data items are generated, incremental refresh is automatically triggered.
Full refresh
If a large amount of data is updated or deleted after the index is created, we recommend that you use full refresh.
The syntax is as follows:
PROCEDURE rebuild_index(
IN idx_name VARCHAR(65535), ---- The index name.
IN table_name VARCHAR(65535), ---- The table name.
IN idx_vector_col VARCHAR(65535) DEFAULT NULL, ---- The vector column name.
IN delta_rate_threshold FLOAT DEFAULT 0.2, ---- The threshold of incremental data.
IN idx_organization VARCHAR(65535) DEFAULT NULL, ---- The index type.
IN idx_distance_metrics VARCHAR(65535) DEFAULT 'EUCLIDEAN',---- The distance type.
IN idx_parameters LONGTEXT DEFAULT NULL, ---- The index parameters.
IN idx_parallel_creation INT DEFAULT 1); ---- The degree of parallelism for parallel index creation. Reserved. Supported only in syntax.
Full refresh is performed every 24 hours. If the amount of new data exceeds 20% of the original data amount, full refresh is automatically triggered. Full refresh is performed asynchronously in the background. It creates a new index and then replaces the old index. During the rebuild, the old index remains available, but the overall process is relatively slow.
Examples of index maintenance
Create a test table.
CREATE TABLE vector_index_test(c1 INT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx1(c2) WITH (distance=l2, type=hnsw, lib=vsag));
Write test data.
INSERT INTO vector_index_test VALUES(1, '[0.203846,0.205289,0.880265]');
INSERT INTO vector_index_test VALUES(2, '[0.484526,0.669954,0.986755]');
INSERT INTO vector_index_test VALUES(3, '[0.327936,0.048756,0.084670]');
INSERT INTO vector_index_test VALUES(4, '[0.148869,0.878546,0.028024]');
INSERT INTO vector_index_test VALUES(5, '[0.334970,0.857377,0.886132]');
INSERT INTO vector_index_test VALUES(6, '[0.117582,0.302352,0.471198]');
INSERT INTO vector_index_test VALUES(7, '[0.551185,0.231134,0.075354]');
INSERT INTO vector_index_test VALUES(8, '[0.185221,0.315131,0.558301]');
INSERT INTO vector_index_test VALUES(9, '[0.928764,0.254038,0.272721]');
Trigger an update.
-- Trigger incremental update.
call dbms_vector.refresh_index('idx1', 'vector_index_test', 'c2', 1, 'FAST');
-- Trigger full update.
call dbms_vector.rebuild_index('idx1','t1','c2');