This topic describes the principles of the HNSW series and IVF series vector indexes.
HNSW series vector index
The HNSW vector index in OceanBase Database consists of a static baseline index and a dynamic incremental index. The static baseline index is read-only and is not modified after it is generated. The dynamic incremental index supports both read and write operations. DML operations (insert, update, and delete) are first written to the dynamic incremental index. When the incremental data reaches a certain scale, it is persisted to disk. During queries, the static baseline index and the dynamic incremental index are queried separately, and the results are merged and returned to the SQL layer. When the incremental data reaches a certain scale, an incremental-to-baseline merge is triggered to control the number of segments and ensure query performance.
Index storage
Data organization
The HNSW vector index introduces the concept of segments and organizes data by segment within partitions. After the introduction of persistent incremental data, four types of segments exist:
| Segment | Description | Storage location | Read/write |
|---|---|---|---|
| Baseline index | A read-only index generated by a full rebuild | Disk | Read-only |
| Active incremental | A writeable incremental index in memory | Memory | Read/write |
| Frozen incremental | An incremental index that is frozen after the active incremental index is full | Memory | Read-only |
| Persisted incremental | An incremental index that is persisted to disk after it is frozen | Disk | Read-only |
Other notes:
- A single partition contains only one baseline segment and can have multiple incremental segments.
- Incremental segments may contain uncommitted or rolled-back data, which needs to be filtered using a visibility bitmap.
Dumping and merging
Dumping
The incremental data in memory is persisted to disk, and the log table is shrunk. The main process is as follows:
- Freeze: When the active incremental index reaches a threshold (such as a certain number of rows in memory), it is frozen (no longer written to) → generates a frozen incremental index.
- Persist: The frozen incremental index is serialized and written to disk storage.
- Update metadata: The index metadata (such as version number and segment list) is updated.
Merging
| Type | Trigger condition | Purpose |
|---|---|---|
| Incremental merge | The number of persisted incremental indexes is ≥ 2 | Merges multiple incremental indexes to reduce the merge overhead during queries |
| Baseline rebuild | The number of incremental vectors exceeds the number of baseline vectors | Rebuilds the baseline index from scratch to generate a new read-only baseline index |
Note
During merging, you can perform quantization compression (such as FP32 → SQ/BQ) to save storage space.
Filters
During queries, you can use two types of filters:
| Scenario | Description |
|---|---|
| Pre_filter | Filters the main table data first, and then queries the vector index (suitable for scenarios with strong filtering conditions) |
| Post_filter | Queries the vector index first, and then filters the candidate results (suitable for scenarios where most vector results are valid) |
Common questions
Q: Why is the HNSW vector index built in two stages? A: Building an HNSW vector index has high costs and is not suitable for frequent updates. Instead:
- The baseline index is read-only and is rebuilt periodically to support queries.
- The incremental index is writeable in memory and supports real-time DML operations. It is periodically merged into the baseline index.
IVF series vector index
The IVF series vector index is an inverted index based on clustering. It divides the high-dimensional vector space into nlist subregions defined by cluster centers (centroids). Each cluster center is associated with an inverted bucket that stores vectors (or their quantized compressed codes) belonging to that subregion. During search, only the buckets closest to the query vector are accessed, achieving a balance between computational overhead and retrieval accuracy.
Internal structure
The IVF series vector index can be logically divided into two layers:
- Cluster center layer (coarse clustering layer): Maintains
nlistcluster centers to determine which buckets to prioritize during queries. The search parameternprobespecifies the number of closest cluster centers to consider. - Inverted bucket data layer: Each cluster center corresponds to a bucket, and the bucket stores vectors that fall within its cluster region (which can be compressed using techniques like product quantization).
The cluster center layer (coarse clustering layer) has two main physical organization methods, each suitable for different scales of nlist:
| Organization method | Suitable for | Search method | Features |
|---|---|---|---|
| ARRAY | Small nlist |
Linear distance calculation and top selection | Simple implementation, low constant overhead |
| HGRAPH | Large nlist |
Graph traversal | Supports large numbers of cluster centers and is efficient |
In practice, after the cluster centers are loaded into memory, the system automatically selects the optimal organization method based on the scale of nlist. If nlist is below a threshold, the ARRAY structure is used. If nlist exceeds the threshold, an HGraph is built. The system can also release the original large contiguous array and access the cluster centers through a unified interface.
References
- For syntax, parameters, and examples of creating, searching, and deleting HNSW series vector indexes, see HNSW series vector index.
- For the principles of IVF series vector indexes, see IVF series vector index.
- For information about memory and persistence control of vector indexes, see Memory management of vector indexes.
