This topic describes how to create, search, and use in-memory sparse indexes in OceanBase Database. An in-memory sparse index is a type of index that is stored in memory. OceanBase Database supports two types of in-memory sparse indexes: SINDI and SINDI_SQ.
Note
This feature is an experimental feature in the current version and is not recommended for production environments.
An in-memory sparse index is an efficient index type provided by OceanBase Database for sparse vectors (vectors with most elements as zero). It requires that the entire index be loaded into memory and supports DML operations and real-time searches. To enhance the query performance of sparse vectors, OceanBase Database integrates the SINDI index, which is based on the VSAG algorithm library. The SINDI index outperforms disk-based sparse vector indexes in terms of performance and is suitable for use when sufficient memory resources are available.
Supported features
The memory sparse vector index supports the following features:
| Module | Feature | Description |
|---|---|---|
| DDL | Create a sparse vector index when you create a table | You can create a sparse vector index on a SPARSEVECTOR column when you create a table. The maximum dimension is 500,000. |
| DDL | Create a sparse vector index after the table is created | You can create a sparse vector index on a SPARSEVECTOR column in an existing table. The maximum dimension is 500,000. |
| DML | Insert, update, and delete | The DML syntax for sparse vector indexes is the same as that for general vector indexes. |
| Search | Vector search | You can use SQL functions for vector search. |
| Search | Query parameters | You can set query-level parameters by using the parameters clause during search. |
| DBMS_VECTOR | REFRESH_INDEX |
Execute an incremental index refresh. |
| DBMS_VECTOR | REBUILD_INDEX |
Execute a full index rebuild. |
| Monitoring | Index monitoring | You can view the basic information and real-time status of an index by using views. For more information, see the related topics at the end of this topic. |
Syntax and description
Create
Memory sparse indexes support creation at table creation and post-creation. When creating, please note the following:
- Dimension limit: The maximum number of dimensions supported for creating a sparse vector index is 500,000.
- Column type: You must specify a
SPARSEVECTORcolumn when creating a sparse vector index. - Keywords: The
VECTORkeyword is required when creating an index. - Index types: Supported types are
sindiandsindi_sq.sindiindicates a full-precision memory sparse index, andsindi_sqindicates a scalar quantization sparse index. SINDI_SQ is recommended for large-scale data to reduce precision loss, and its logic is consistent with HNSW_SQ. - Distance algorithm: Only
inner_product(inner product) is supported. - Parameter consistency: The parameters and descriptions for post-creation indexes are consistent with those for creation at table creation.
- For SINDI_SQ indexes, it is recommended to create the index after writing data and rebuild the index after writing a large amount of incremental data:
- If created with the table, first create a SINDI index. After the data exceeds 10,000 rows, you can rebuild it as a SINDI_SQ index.
- If created later, create a SINDI index when the data does not exceed 10,000 rows. If the data exceeds 10,000 rows, you can directly create a SINDI_SQ index.
You can use the CREATE TABLE statement to create a sparse vector index.
CREATE TABLE table_name (
column_name1 data_type1,
column_name2 SPARSEVECTOR,
...,
VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
);
You can create a sparse vector index on an existing SPARSEVECTOR column in a table.
CREATE VECTOR INDEX index_name ON table_name(column_name) WITH (param1=value1, param2=value2, ...);
The parameters are described as follows. The parameters for creation at table creation and post-creation are consistent. The build and query parameters for SINDI_SQ are consistent with those for SINDI:
| Parameter | Default Value | Value Range | Required | Description | Notes |
|---|---|---|---|---|---|
distance |
inner_product |
Yes | The vector distance algorithm type. | Both SINDI and SINDI_SQ indexes support only the inner product (inner_product) as the distance algorithm. |
|
type |
sindi / sindi_sq |
Yes | The index algorithm type. | sindi indicates a full-precision memory sparse index, and sindi_sq indicates a scalar quantization sparse index. |
|
lib |
vsag |
vsag |
No | The vector index library type. | Only the VSAG vector library is supported. |
prune |
false |
true/false |
No | Whether to prune the vector. | If prune is set to true, you must also set the refine and drop_ratio_build parameters. If prune is set to false, you can perform full-precision search. If you set refine to true or drop_ratio_build to a value other than 0, an error will be returned. |
refine |
false |
true/false |
No | Whether to rerank. | If set to true, high-precision distance calculation and reranking are performed on the search results. This parameter is valid only when prune is set to true. |
drop_ratio_build |
0 |
[0, 0.9] |
No | The pruning ratio for sparse vector data. | When a new sparse vector is inserted, the query_length * drop_ratio_build smallest values are pruned based on their values. If refine is set to true, the original vector data is retained. Otherwise, only the pruned data is retained. This parameter is valid only when prune is set to true. |
drop_ratio_search |
0 |
[0, 0.9] |
No | The pruning ratio for search sparse vector values. | A larger value indicates more pruning, lower accuracy, and higher performance. You can also set this parameter in the parameters clause during search. Query parameters have higher priority. |
refine_k |
4.0 |
[1.0, 1000.0] |
No | The reranking ratio. | The search results for limit_k * refine_k items are obtained, and the original vectors are used for reranking. This parameter is valid only when refine is set to true. You can also set this parameter in the parameters clause during search. Query parameters have higher priority. |
Search
The search syntax for sparse vector indexes is similar to that for dense vector indexes. You can use the APPROXIMATE/APPROX keyword for approximate nearest neighbor search. The syntax is as follows:
SELECT ... FROM table_name
ORDER BY inner_product(column_name, query_vector) [APPROXIMATE|APPROX]
LIMIT n [PARAMETERS(param1=value1, param2=value2)];
Where:
column_name: TheSPARSEVECTORcolumn specified when the sparse vector index was created.query_vector: The query vector, which can be a string in sparse vector format, such as'{1:2.4, 3:1.5}'.n: The number of rows to return.PARAMETERS: Optional query-level parameters for settingdrop_ratio_searchandrefine_k.
Syntax requirements:
- The
APPROXIMATE/APPROXkeyword must be specified for the vector index to be used instead of a full table scan. - The
ORDER BYandLIMITclauses must be included. - The
ORDER BYclause supports only a single vector condition. - The value of
LIMIT + OFFSETmust be in the range(0, 16384]. - If the
LIMITclause is not specified, an error will be returned.
Rules for using distance functions:
- If the
APPROXIMATE/APPROXkeyword is specified, the search will use the vector index if the distance function is supported and matches the vector index algorithm. - If the
APPROXIMATE/APPROXkeyword is specified, the search will not use the vector index if the distance function does not match the vector index algorithm, but no error will be returned. - If the
APPROXIMATE/APPROXkeyword is specified, the search will not use the vector index and an error will be returned if the distance function is not supported. - If the
APPROXIMATE/APPROXkeyword is not specified, the search will not use the vector index, but no error will be returned.
Other notes:
- The
WHEREclause will be used as a filter for the vector index search. - Recall rate is affected by the build and search parameters.
- Index search parameters can be set when creating the index. If you need to adjust them after the index is created, you can use DBMS_VECTOR.REBUILD_INDEX to modify the table-level attributes, or you can specify the query-level parameter
ef_searchduring the query.
Special requirements for memory sparse indexes:
- Priority of query parameters: Query-level parameters set in the
PARAMETERSclause > query parameters set during index creation > default values. drop_ratio_search: The value is in the range[0, 0.9], with a default value of0. This parameter specifies the pruning ratio for search sparse vector values. A larger value indicates more pruning, lower accuracy, and higher performance. Thequery_length * drop_ratio_searchsmallest values are pruned based on their values. Since pruning all values is meaningless, at least one value will be retained.refine_k: The value is in the range[1.0, 1000.0], with a default value of4.0. This parameter specifies the reranking ratio. The search results forlimit_k * refine_kitems are obtained, and the original vectors are used for reranking. This parameter is valid only whenrefineis set totrue.
Create, update, search, and delete examples
Create during table creation
SINDI example
Create a test table sparse_t1 and create a sparse vector index:
CREATE TABLE sparse_t1 (
c1 INT PRIMARY KEY,
c2 SPARSEVECTOR,
VECTOR INDEX sparse_idx1(c2)
WITH (lib=vsag, type=sindi, distance=inner_product)
);
Insert sparse vector data into the test table:
INSERT INTO sparse_t1 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
INSERT INTO sparse_t1 VALUES(2, '{3:0.3, 2:0.2, 4:0.4}');
INSERT INTO sparse_t1 VALUES(3, '{3:0.3, 4:0.4, 5:0.5}');
Search the test table:
SELECT * FROM sparse_t1;
The result is as follows:
+----+---------------------+
| c1 | c2 |
+----+---------------------+
| 1 | {1:0.1,2:0.2,3:0.3} |
| 2 | {2:0.2,3:0.3,4:0.4} |
| 3 | {3:0.3,4:0.4,5:0.5} |
+----+---------------------+
3 rows in set
SINDI_SQ example
CREATE TABLE sparse_t1_sq (c1 INT AUTO_INCREMENT, c2 SPARSEVECTOR, PRIMARY KEY(c1), VECTOR INDEX sparse_idx1_sq(c2) WITH (distance=inner_product, type=sindi_sq, lib=vsag));
Create after table creation
SINDI example
Create a test table and then create a sparse vector index:
CREATE TABLE sparse_t2 (
c1 INT PRIMARY KEY,
c2 SPARSEVECTOR
);
CREATE VECTOR INDEX sparse_idx2 ON sparse_t2(c2)
WITH (lib=vsag, type=sindi, distance=inner_product,
prune=true, refine=true, drop_ratio_build=0.1,
drop_ratio_search=0.5, refine_k=2.0);
Insert sparse vector data into the test table:
INSERT INTO sparse_t2 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
Search the test table:
SELECT * FROM sparse_t2;
The result is as follows:
+----+---------------------+
| c1 | c2 |
+----+---------------------+
| 1 | {1:0.1,2:0.2,3:0.3} |
+----+---------------------+
1 row in set
SINDI_SQ example
Create a test table.
CREATE TABLE sparse_t2_sq (c1 INT AUTO_INCREMENT, c2 SPARSEVECTOR, PRIMARY KEY(c1));
Create a SINDI_SQ index.
CREATE VECTOR INDEX sparse_idx_sq ON sparse_table_sindi_sq(c2) WITH (distance=inner_product, type=sindi_sq, lib=vsag);
Update
When you update sparse vector data, the index is automatically maintained:
UPDATE sparse_t1 SET c2 = '{1:0.1}' WHERE c1 = 1;
Delete
The delete operation is the same as for a regular vector index. You can directly delete data:
DELETE FROM sparse_t1 WHERE c1 = 1;
Search
CREATE TABLE t1 (
c1 INT PRIMARY KEY,
c2 SPARSEVECTOR,
VECTOR INDEX idx1(c2)
WITH (lib=vsag, type=sindi, distance=inner_product)
);
INSERT INTO t1 VALUES(1, '{1:0.1, 2:0.2, 3:0.3}');
INSERT INTO t1 VALUES(2, '{3:0.3, 2:0.2, 4:0.4}');
INSERT INTO t1 VALUES(3, '{3:0.3, 4:0.4, 5:0.5}');
INSERT INTO t1 VALUES(4, '{5:0.5, 4:0.4, 6:0.6}');
INSERT INTO t1 VALUES(5, '{5:0.5, 6:0.6, 7:0.7}');
SELECT * FROM t1
ORDER BY negative_inner_product(c2, '{3:0.3, 4:0.4}')
APPROXIMATE LIMIT 4;
The result is as follows:
+----+---------------------+
| c1 | c2 |
+----+---------------------+
| 2 | {2:0.2,3:0.3,4:0.4} |
| 3 | {3:0.3,4:0.4,5:0.5} |
| 4 | {4:0.4,5:0.5,6:0.6} |
| 1 | {1:0.1,2:0.2,3:0.3} |
+----+---------------------+
SELECT *, negative_inner_product(c2, '{3:0.3, 4:0.4}')
AS score FROM t1
ORDER BY score APPROXIMATE LIMIT 4
PARAMETERS(drop_ratio_search=0.5);
The result is as follows:
+----+---------------------+---------------------+
| c1 | c2 | score |
+----+---------------------+---------------------+
| 4 | {4:0.4,5:0.5,6:0.6} | -0.1600000113248825 |
| 3 | {3:0.3,4:0.4,5:0.5} | -0.2500000149011612 |
| 2 | {2:0.2,3:0.3,4:0.4} | -0.2500000149011612 |
+----+---------------------+---------------------+
3 rows in set
References
- For more information about the sparse vector data type, see Vector data types.
- For more information about memory estimation and actual usage of the sparse index, see Index memory management.
- For more information about distance functions, see Use SQL functions.
- For more information about view monitoring and rebuilding of the sparse index, see Index monitoring and maintenance.
