Index type selection|V4.4.2|OceanBase Database| docs|Distributed Database

OceanBase Database provides various vector index algorithms. You can select the appropriate index type based on your use case.

HNSW or IVF?

OceanBase Database supports two types of dense indexes:

Graph-based HNSW series indexes: HNSW, HNSW_SQ, and HNSW_BQ.
Disk-based IVF series indexes: IVF and IVF_PQ.

Each type of index has its own strengths. HNSW series indexes typically provide higher query performance but require more memory. IVF series indexes perform well when cache is sufficient and can run without requiring memory. However, the choice between HNSW and IVF is not solely based on memory considerations. You must also evaluate your business scenarios, data size, performance requirements, and resource constraints. The following sections compare their core differences and provide recommendations.

Core differences

Dimension	HNSW series	IVF series
Storage method	In-memory graph-based index	Disk-based index
Memory usage	High, as the index must be fully loaded into memory	Low, as the index does not need to be loaded into memory
Query performance (QPS)	Extremely high, with sub-millisecond response times	High, approaching HNSW performance when cache is sufficient
Recall rate	High, up to 99%	High, slightly lower than HNSW, and can be optimized with parameters
Construction speed	Slower, as it requires building a graph structure	Faster, based on clustering algorithms
Data volume	Millions to hundreds of millions of vectors	Millions to billions of vectors
Cost	High memory cost	Low storage cost, suitable for large-scale data
Real-time capability	Supports real-time DML operations	Supports real-time DML operations

Decision flowchart

Note

Before selecting an index type, you need to estimate the memory usage based on the information in Manage memory usage of vector indexes.
The recommendations in the decision flowchart are based on 1024-dimensional vectors. If your vectors have a different dimension, you can approximate the required resources based on the ratio.
The decision flowchart primarily considers memory costs to help you decide on the index type.
Even if a tenant has sufficient memory, it is not always optimal to choose the highest specification index such as HNSW. If you require extremely high performance, consider more cost-effective options such as HNSW_SQ.

Decision flowchart

Notice

This section only lists some common scenarios. If you cannot determine the index type based on the above flowchart or have other requirements, contact OceanBase Technical Support.

Should I use a partitioned table?

The primary reason for using a partitioned table is to handle large volumes of data. Additionally, if your query conditions can be used as partition keys, partition pruning can enhance query performance. Consider using a partitioned table in the following scenarios:

When the data volume reaches tens of millions or even hundreds of millions of records: With a large amount of data, partitioning allows data to be distributed across multiple partitions. Each partition can have its own index, reducing the load on individual queries and improving overall query performance.
When your query conditions include specific scalar columns that can be used for partition pruning: For example, if the label field is frequently used in WHERE clauses, you can consider using it as the partition key to create a partitioned table. This way, you can reduce the number of partitions that need to be queried through partition pruning.

Here are some specific recommendations:

Partitioning

When using vector indexes, it's best to avoid having too many partitions. Unlike scalar indexes, vector indexes like HNSW don't see a significant increase in query TopK computation costs when the index size grows from 1 million to 2 million vectors under the same configuration. Therefore, if partition pruning isn't utilized, having too many partitions could actually degrade performance. Additionally, overly large partitions not only increase the time required for index rebuilding but also impact the efficiency of joint queries with scalar conditions.

In summary, it's recommended to keep the data volume in each partition below 20 million and prioritize selecting a field that supports partition pruning as the partition key.

Algorithm selection

For large datasets, we recommend using the HNSW_BQ or IVF_PQ index. If you want to use other indexes, please refer to the Memory usage section for estimation.

Memory usage

HNSW_BQ

For the HNSW_BQ index, it is recommended that Tenant memory > Total memory usage during HNSW_BQ queries + Memory usage for a single HNSW_SQ index partition. The memory estimation process is as follows:

Use the INDEX_VECTOR_MEMORY_ADVISOR function to calculate the recommended memory for HNSW_BQ and HNSW_SQ indexes during construction and queries.
Based on the recommended memory values obtained from the above tool, calculate and determine the total memory required for the tenant.

For example, if you have 100 million 1024-dimensional vector data and use 10 partitions (approximately 10 million vectors per partition), the calculation process is as follows:

-- Specify REFINE_TYPE=SQ8 to directly and accurately obtain the recommended memory for the HNSW_BQ index during construction and queries.
-- You do not need to separately calculate the memory usage for the HNSW_SQ index. This is because the HNSW_BQ index construction process by default uses the SQ8 quantization algorithm.
-- After specifying refine_type=sq8, the function automatically includes the memory required for the SQ8 quantized vectors in the calculation.
-- The recommended memory for the HNSW_BQ index is 74.6 GB, and the memory usage during queries is 57.4 GB. We use the recommended memory value.
SELECT DBMS_VECTOR.INDEX_VECTOR_MEMORY_ADVISOR('HNSW_BQ',100000000,1024,'FLOAT32','M=32,DISTANCE=COSINE,REFINE_TYPE=SQ8', 10000000);
+------------------------------------------------------------------------------------------------------------------------------+
| DBMS_VECTOR.INDEX_VECTOR_MEMORY_ADVISOR('HNSW_BQ',100000000,1024,'FLOAT32','M=32,DISTANCE=COSINE,REFINE_TYPE=SQ8', 10000000) |
+------------------------------------------------------------------------------------------------------------------------------+
| Suggested minimum vector memory is 74.6 GB, memory consumption when providing search service is 57.4 GB                      |
+------------------------------------------------------------------------------------------------------------------------------+
1 row in set

-- By default, the vector memory occupies 50% of the tenant memory. Therefore, the total tenant memory is
SELECT 74.6/0.5;
+--------------------------------------------------------------------------------------------------------------+
| 74.6/0.5 = 149.2 GB |
+--------------------------------------------------------------------------------------------------------------+
1 row in set

-- To account for the possibility of new data not being compressed in the actual environment, it is recommended to reserve some extra memory.
SELECT 149.2 * 1.2;
+--------------------------------------------------------------------------------------------------------------+
| 149.2 * 1.2 = 179.04 GB |
+--------------------------------------------------------------------------------------------------------------+
1 row in set

-- Therefore, the recommended tenant memory configuration is 179 GB.

IVF series

For IVF and IVF_PQ indexes, it is recommended to set tenant memory > memory required for building a single partition + total memory required for all partitions to be resident. For example, with 100 million data points divided into 10 partitions, each containing approximately 10 million vectors, building a single partition requires about 2.7 GB of memory. When querying, the total memory required for all 10 partitions of the IVF_PQ index is approximately 1.1 GB (110 MB per partition). Therefore, a minimum of around 3 GB of memory is needed. It is recommended to configure at least 6 GB of memory to allow for some buffer. This method can be used to estimate memory requirements for other data volume scenarios.

Build and query parameters

Index build and query parameters are recommended to be set based on the maximum data volume of a single partition. For more information, see the Suggested index parameters section.

Performance and recall rate

If the query can be pruned to a single partition: the performance and recall rate will be the same as in the single-partition scenario, as detailed in the section on different data volume scales below.
If the query cannot be pruned to a single partition: QPS can be estimated proportionally based on the number of partitions per OBServer node (e.g., if a single node has 3 partitions, the QPS would be approximately 1/3 of the single-partition performance). Since cross-partition queries combine more candidate results, the actual recall rate is usually higher than in the single-partition scenario.

Suggested index parameters

HNSW series

The recommended index construction and query parameters vary depending on the data volume for indexes within the same series. This section provides suggested configurations for 768-dimensional vectors with 1 million and 10 million data points using HNSW, HNSW_SQ, and HNSW_BQ indexes. For vector data in the billions, refer to subsequent sections for IVF_PQ or HNSW_BQ indexes within partitioned tables.

Notice

For scenarios where the data volume is expected to grow, it is recommended to set parameters based on the final data volume.

Notice

HNSW_BQ is a high-compression quantization algorithm. Its recall rate may be limited in low-dimensional vectors. To avoid performance loss, it is recommended to use HNSW_BQ on vectors of 512 dimensions or higher.

Scenario	Index Type	Parameter Recommendations
Maximum Recall (Highest memory usage)	HNSW	1 million data points: m = 16, ef_construction = 200, ef_search = 100, other parameters default
Best Performance (Lowest memory usage)	HNSW_SQ	1 million data points: m = 16, ef_construction = 200, ef_search = 100, other parameters default 10 million data points: m = 32, ef_construction = 400, ef_search = 350, other parameters default
Best Cost-Performance Ratio (Low memory usage, good performance)	HNSW_BQ	1 million data points: m = 16, ef_construction = 200, ef_search = 100, other parameters default 10 million data points: m = 32, ef_construction = 400, ef_search = 1000, refine_k = 10, other parameters default 1 billion data points: Use partitioned tables, m = 32, ef_construction = 400, ef_search = 1000, refine_k = 10, other parameters default

Detailed explanation for millions of data

Using the parameters in the preceding table, this section provides additional details on memory usage and recall rate optimization for 1 million 768-dimensional vectors.

Memory usage:

Index type	Recommended tenant memory	Description
HNSW	15 GB	The recommended size for the vector index memory is 7.3 GB. If the tenant memory exceeds 8 GB, the vector index can use up to 50% of the tenant memory by default. If the tenant memory is 8 GB or less, the vector index can use up to 40% of the tenant memory by default. Therefore, approximately 15 GB of tenant memory is required.
HNSW_SQ	6 GB	The recommended size for the vector index memory is 2.1 GB.
HNSW_BQ	6 GB	The HNSW_BQ index requires high-precision vectors during construction, so the memory requirements during construction are the same as those for HNSW_SQ, which is 6 GB. After the index is built, the memory usage decreases significantly to about 405 MB. The preceding description applies to non-partitioned tables. If you use partitioned tables, OceanBase Database dynamically adjusts the number of partitions that can be built at the same time based on the memory allocated to the tenant. When configuring, we recommend that the tenant memory should include the memory required for HNSW_BQ queries and the memory required for HNSW_SQ during the construction of a partition, with some redundancy. For more information, see the Memory usage section.

Recall rate optimization:

You can improve the recall rate by increasing the number of vector calculations, but this will reduce query performance. You can set the parameters to the recommended values in the following table for different TopN values. If you want to further improve the recall rate, you can increase the parameter values.

Notice

The recall rate is directly related to the data characteristics. The following table provides the recommended values for a standard dataset with 768 dimensions, where the recall rate reaches approximately 0.95.

TopN	ef_search	refine_k (only for HNSW_BQ)
Top10	64	4
Top100	240	4

Maximum recall rate:

The maximum recall rates vary among different indexing algorithms. When using the recommended construction parameters in this section, setting ef_search to 1000 may improve the query recall rate, but it will reduce the QPS to one-third of that at a 0.95 recall rate. Additionally, increasing ef_search does not infinitely improve the recall rate for all types of indexes. Only the HNSW index can achieve a recall rate of 0.99 or higher. The HNSW_BQ index can further improve the recall rate by increasing refine_k, but this will also further reduce performance.

Maximum recall rate reference values (ef_search = 1000):

HNSW index: recall rate of 0.991
HNSW_SQ index: recall rate of 0.9786
HNSW_BQ index: recall rate of 0.9897 (ef_search = 1000, refine_k = 10)

Detailed explanation for tens of millions of data

Let's take an example of 10 million vectors with 768 dimensions (using the construction parameters from the table above) to explain the memory usage and recall rate optimization:

Memory usage:

Index Type	Recommended Tenant Memory	Description
HNSW	160 GB	The recommended memory size for the vector index is 76.3 GB.
HNSW_SQ	48 GB	The recommended memory size for the vector index is 22.6 GB.
HNSW_BQ	48 GB	The HNSW_BQ index requires high-precision vectors during construction. By default, it uses HNSW_SQ as a cache during index construction, so for non-partitioned tables, the memory required for HNSW_BQ is the same as for HNSW_SQ. After construction, the HNSW_BQ index only occupies about 5.4 GB of memory.

Recall rate optimization:

By adjusting the ef_search and refine_k (only for HNSW_BQ) parameters, you can increase the number of vector calculations to improve the recall rate, but this will also reduce query performance. For different TopN values, you can set the parameters to the recommended values in the table below. If you need to further improve the recall rate, you can set the parameter values to be larger.

Notice

The recall rate is directly related to the data characteristics. The recommended values in the table below are for a standard dataset with 768 dimensions, where the recall rate reaches approximately 0.95.

TopN	ef_search	refine_k (only for HNSW_BQ)
Top10	100	4
Top100 (HNSW/HNSW_SQ)	350	-
Top100 (HNSW_BQ)	1000	10

IVF series

Scenario	Index type	Parameter recommendations
Low-dimensional (384 dimensions and below)	IVF	For tens of millions of data: use partitioned tables with nlist=3000 For hundreds of millions of data: use partitioned tables with nlist=3000
Low-cost (minimal memory usage)	IVF_PQ	For millions of data: nlist=1000, m=vector dimension/2 For tens of millions of data: nlist=3000, m=vector dimension/2 For hundreds of millions of data: use partitioned tables with nlist=3000, m=vector dimension/2

To balance the number of cluster centers and the amount of data per center, it is generally recommended to set nlist to the square root of the data volume. For example, for 10 million data points, nlist should be around 3000. When using IVF_PQ, the m parameter is recommended to be set to half of the vector dimension (dim).

Notice

IVF_PQ is a high-compression quantization algorithm, and its recall rate may be relatively low in low-dimensional vectors. To avoid performance loss, it is recommended to use IVF_PQ for vectors with 128 dimensions or more.

Notice

For scenarios where the data volume is expected to grow, it is recommended to set the parameters based on the final data volume.

Handling millions of data (with partitioned tables)

Let's take an example of 10 million 768-dimensional vector data (using the parameters from the table above) to explain how to optimize memory usage and recall rate:

Memory Usage:

Index Type	Index Parameters	Memory Usage (Build Time / Resident Time)
IVF	distance=l2, nlist=3000	2.7 GB / 10.5 MB
IVF_PQ	distance=l2, nlist=3000, m=384	4.0 GB / 1.3 GB
IVF_PQ	distance=cosine, nlist=3000, m=384	2.7 GB / 11.4 MB

In the table above, Build Time refers to the temporary memory usage during index creation, which is released after the index is built. Resident Time refers to the memory usage after the index is built.

For IVF_PQ indexes, if you choose distance = l2, it will use more resident memory due to the need to store precomputed results. In contrast, using distance = inner_product or cosine consumes less resident memory. Therefore, in practical applications, it is recommended to prioritize inner_product or cosine as the distance type to optimize memory resources.

Recall Rate Optimization:

By adjusting the nprobes parameter, you can increase the number of vector computations to improve the recall rate, but this will also reduce query performance. For different TopN values, you can set the parameter to the recommended values in the table below. If you need to further improve the recall rate, you can set the parameter to a larger value.

Notice

TopN	nprobes
Top10	1
Top100	20

Handling hundreds of millions of vector data (with partitioned tables)

When dealing with vector data in the hundreds of millions, it's highly recommended to use partitioned tables in conjunction with IVF-type indexes. As the dataset size and the nlist parameter increase, the query overhead of a single IVF index significantly increases. By splitting the data into multiple partitions and building smaller-scale IVF indexes for each partition, the query load per partition is effectively reduced. Additionally, parallel queries across partitions further enhance overall query performance and recall rate.

In a multi-partitioned table scenario, since each IVF index is local to its partition, each partition independently constructs its own IVF index. Therefore, it's advisable to calculate the nlist value based on the average data volume per partition. For example, with 100 million 768-dimensional vectors divided into 10 partitions, each partition contains about 10 million vectors. In this case, nlist is recommended to be set to sqrt(10 million) = 3162.

Let's take the example of 100 million 768-dimensional vectors (using the construction parameters from the table above) to explain memory usage and recall rate optimization:

Memory Usage:

In a multi-partitioned scenario, since each partition independently constructs and maintains its own IVF index, the total memory usage is the sum of the memory usage across all partitions. For instance, if a single IVF index consumes 10.5 MB of memory, with 10 partitions, the total memory usage would be approximately 10.5 × 10 = 105 MB.

Index Type	Index Parameters	Memory Usage (Construction / Resident)
IVF	distance=l2, nlist=3000	2.7 GB / 10.5 * 10 MB
IVF_PQ	distance=l2, nlist=3000, m=384	4.0 GB / 1.3 * 10 GB
IVF_PQ	distance=cosine, nlist=3000, m=384	2.7 GB / 11.4 * 10 MB

In the table above, the Construction column indicates the temporary memory usage during index creation, which is released after the index is built. The Resident column shows the continuous memory usage of the IVF vector index after the index is built.

Recall Rate Optimization:

Adjusting the nprobes parameter allows for more vector computations, thereby improving the recall rate, but this comes at the cost of reduced query performance. You can set the parameter to the recommended values in the table below for different TopN scenarios. If you need further improvements in recall rate, you can set the parameter to a higher value.

In a partitioned table scenario, since each partition independently executes an IVF index query, when querying across multiple partitions, the TopN results are retrieved from each partition and then aggregated and re-ranked. This not only enhances the overall search accuracy but also typically results in a higher actual recall rate compared to a single-partition scenario. Therefore, in a multi-partitioned table, you can appropriately reduce the nprobes parameter to achieve a recall rate comparable to that of a single-partitioned table.

Notice

The recall rate is directly related to the data characteristics. The values in the table below are recommended for a standard dataset of 768 dimensions to achieve a recall rate of around 0.95.

TopN	nprobes
Top10	1
Top100	10

For more information about parameters and tuning, see HNSW Index and IVF Index.

OceanBase Database provides various vector index algorithms. You can select the appropriate index type based on your use case.