This topic describes how to create, search, and delete IVF series dense indexes in OceanBase Database, including syntax details and practical examples.
Index syntax and descriptions
Create indexes
The IVF series includes two types of indexes: IVF and IVF_PQ. You can create these either when you create the table or after the table is created. Keep the following points in mind:
- You must specify the
VECTORkeyword when you create a vector index. - The parameters and descriptions of an index created after you create a table are the same as those of an index created when you create a table.
- If the data volume is large, we recommend that you write all the data and then create an index to obtain the best search performance.
- We recommend that you create an IVF or IVF_PQ index after you write data to the table and rebuild the index after you write a large amount of incremental data. For more information about how to create an index, see the specific examples in the following sections.
- When you create an IVF index, the index name must be less than 33 characters in length. Otherwise, an error may occur because the index auxiliary table name exceeds the
index_namelimit. In later versions, the index name can be longer than 33 characters.
Syntax for creating an index during table creation:
CREATE TABLE table_name (
column_name1 data_type1,
column_name2 VECTOR(dim),
...,
VECTOR INDEX index_name (column_name2) WITH (param1=value1, param2=value2, ...)
);
Syntax for creating an index after table creation:
-- You can set the parallelism to improve the index construction performance. The maximum value of the parallelism is the number of CPU cores multiplied by 2.
CREATE [/*+ paralell $value*/] VECTOR INDEX index_name ON table_name(column_name2) WITH (param1=value1, param2=value2, ...);
param description:
| Parameter | Default value | Value range | Required | Description | Remarks |
|---|---|---|---|---|---|
| distance | l2/inner_product/cosine | Yes | The vector distance algorithm. | l2 indicates the Euclidean distance, inner_product indicates the inner product distance, and cosine indicates the cosine distance. | |
| type | ivf_flat/ivf_pq | Yes | The type of the IVF index. | ||
| lib | ob | ob | No | The type of the vector index library. | |
| nlist | 128 | [1,65536] | No | The number of cluster centers. |
data_volume is the data volume of the partition with the largest data volume.
|
| sample_per_nlist | 256 | [1,int64_max] | Yes | The number of data samples for each cluster center. This parameter is used for an index created after you create a table. | V4.3.5 BP5: We recommend that you keep the parameter value as the default value and do not modify it. |
| nbits | 8 | [1,24] | No | The number of bits for quantization.
NoticeThis parameter is supported in V4.3.5 BP3 and later. You can specify this parameter only when you create an IVF_PQ index. |
We recommend that you set the value to 8 and keep the value in the range of [8, 10]. A larger value indicates a higher quantization accuracy and a higher search accuracy, but may affect the search performance. |
| m | No default value. Must be specified. | [1,65536] | Yes | The dimension of the quantized vector.
NoticeThis parameter is supported in V4.3.5 BP3 and later. You can specify this parameter only when you create an IVF_PQ index. |
A larger value indicates a higher search accuracy, but may affect the search performance. |
For IVF and IVF_PQ indexes (V4.3.5 BP5 only), the following additional notes apply:
- Limitations:
- You cannot create an IVF or IVF_PQ index in a scenario where a heap table and a partitioned table are used together.
- Recommendations:
- We strongly recommend that you create an IVF or IVF_PQ index after you import data. We do not recommend that you create an IVF index when you create a table. If no data is available when you create an index, the index cannot be clustered and cannot be used. In this case, you must manually execute the
REBUILD INDEXorDROPstatement and then execute theCREATE INDEXstatement again. - We recommend that you estimate and query the memory usage before you create an index. For more information, see Manage index memory. This way, you can avoid the memory shortage issue during index creation.
- We recommend that you execute the
ALTER SYSTEM major freezestatement and wait for the major freeze to complete before you create an index. - In a multi-node cluster, we recommend that you set the memory of the SYS tenant to more than 1 GB.
- We recommend that you set the parallel_servers_target parameter to
max_cpu * 10before you create an index. - If the data volume of a single partition is more than 10 million and the CPU resources of the tenant exceed 10 cores, we recommend that you modify the sampling ratio by executing the
ALTER SYSTEM SET _px_object_sampling = 5000statement before you create an index. This way, the index construction efficiency can be improved.
- We strongly recommend that you create an IVF or IVF_PQ index after you import data. We do not recommend that you create an IVF index when you create a table. If no data is available when you create an index, the index cannot be clustered and cannot be used. In this case, you must manually execute the
Search for indexes
The IVF series indexes use approximate nearest neighbor (ANN) search, which means results are not guaranteed to be 100% accurate. The main metric for accuracy is recall. For example, if you request the 10 nearest neighbors and consistently get 9 correct, the recall rate is 90%.
SELECT ... FROM table_name
ORDER BY distance_function(column_name, vector_expr) [APPROXIMATE|APPROX]
LIMIT num (OFFSET num)
[PARAMETERS (param1=value1, ...)];
Search syntax requirements:
- You must include the
APPROXIMATEorAPPROXkeyword for the query to use the vector index instead of scanning the entire table. - Both
ORDER BYandLIMITclauses are required. - Only one vector condition is supported in the
ORDER BYclause. - The sum of
LIMITandOFFSETmust be within the range(0, 16384]. - Omitting the
LIMITclause will cause an error. - The
PARAMETERSclause is used to specify search parameters. For an IVF index, thenprobesparameter is supported. For example,PARAMETERS(nprobes=200). This parameter is supported starting from V4.3.5 BP5 and specifies the number of cluster centers to be used during the search.
Rules for using distance functions:
- If you specify
APPROXIMATE/APPROX, use a distance function supported by the current version, and it matches the vector index algorithm, the search will use the vector index. - If you specify
APPROXIMATE/APPROXbut the distance function does not match the vector index algorithm, the search will not use the vector index, but no error will occur. - If you specify
APPROXIMATE/APPROXwith an unsupported distance function, the search will not use the vector index and you will get an error. - If you do not specify
APPROXIMATE/APPROX, even supported distance functions will not use the vector index, but no error will occur.
Additional notes:
- The
WHEREclause acts as a filter during vector index searches. - Recall rate is affected by both index build parameters and search parameters.
- Search parameters for the index are set when the index is created and cannot be changed later. However, you can use the session variable
ob_ivf_nprobesto set the number of cluster centers for searching. If this session variable is set, its value takes precedence. For details, see ob_ivf_nprobes. - Tips for optimizing IVF/IVF_PQ index searches (V4.3.5 BP5 only):
- Using compressed row format (
ROW_FORMAT=COMPRESSED) when you create the table can improve IVF search performance by 20–30%. For example:CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 VECTOR(4)) ROW_FORMAT=COMPRESSED; - Parallel search is not recommended, as it may reduce performance or return incorrect results.
- Using compressed row format (
Notice
Starting with V4.3.5 BP5, you can use the PARAMETERS(similarity=$value) clause to specify a similarity threshold for searches.
- Similarity threshold parameter (
similarity):- Syntax: Add
PARAMETERS (similarity=$value)after theLIMITclause, where$valueis the similarity threshold (range:[0, 1]). In most cases, a higher similarity value means greater similarity and a smaller distance. The exception isinner_product, where a larger distance indicates greater similarity. - Feature: When you specify a similarity threshold, only results with similarity greater than or equal to that threshold will be returned.
- Supported IVF index types: IVF, IVF_PQ
- Supported distance types:
- For searches, you can specify the similarity threshold for indexes using
cosine_distanceandl2_distance. Forl2_distance, it’s recommended to apply L2 normalization to vectors for more accurate similarity searches. For details and examples, see Using SQL functions - vector normalization. - Similarity thresholds are not supported for indexes using
inner_productdistance (an error will be returned).
- For searches, you can specify the similarity threshold for indexes using
- Relationship between similarity and distance:
cosine_distance=2 - 2 * similarityl2_distance=sqrt(1 / similarity - 1)
- Syntax: Add
Delete an index
The syntax for deleting a vector index is as follows:
DROP INDEX index_name ON table_name;
Examples of creating, searching, and deleting indexes
Create an index during table creation
IVF example
CREATE TABLE ivf_vecindex_suite_table_test (c1 INT, c2 VECTOR(3), PRIMARY KEY(c1), VECTOR INDEX idx2(c2) WITH (distance=l2, type=ivf_flat));
Create an index after table creation
IVF example
Create a test table.
CREATE TABLE vec_table_ivf (c1 INT, c2 VECTOR(3), PRIMARY KEY(c1));
Create an IVF index.
CREATE VECTOR INDEX vec_idx3 ON vec_table_ivf(c2) WITH (distance=l2, type=ivf_flat);
Delete
DROP INDEX vec_idx1 ON vec_table;
Check the deleted index.
SHOW INDEX FROM vec_table;
The return result is as follows:
Empty set