Vector similarity search|V4.4.2|OceanBase Database| docs|Distributed Database

Vector similarity search

Last Updated：2026-04-02 06:23:56 Updated

Vector similarity search, also known as nearest neighbor search, is a search method based on distance metrics in vector space. Its core objective is to find the set of vectors most similar to a given query vector. Although specific distance metrics are used during computation, the final output is the Top K nearest vectors, sorted in ascending order of distance.

This topic describes two vector search methods in OceanBase Database: exact nearest neighbor search based on full-scan and approximate nearest neighbor search based on vector index. It also provides examples to illustrate how to use these methods.

Notice

For readability, this document refers to vector nearest neighbor search as "vector search," exact nearest neighbor search as "exact search," and approximate nearest neighbor search as "approximate search."

Perform exact search

Exact search uses a full scan strategy, calculating the distance between the query vector and all vectors in the dataset to perform an exact search. This method ensures complete accuracy of the search results, but because it requires calculating the distance for all data, the search performance decreases significantly as the dataset grows.

When executing an exact search, the system calculates and compares the distances between the query vector vₑ and all other vectors in the vector space. After completing the full distance calculations, the system selects the k vectors closest to the query as the search results.

Example: Euclidean search

Euclidean similarity search is used to retrieve the top-k vectors in the vector space that are closest to the query vector, using Euclidean distance as the metric. The following example demonstrates how to use exact search to retrieve the top 5 vectors from a table that are closest to the query vector:

-- Create a test table
CREATE TABLE t1 (
    id INT PRIMARY KEY,
    c1 VECTOR(3)
);

-- Insert data
INSERT INTO t1 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');

-- Perform an exact search
SELECT c1
FROM t1
ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;

The return result is as follows:

+---------------+
| c1            |
+---------------+
| [0.1,0.2,0.3] |
| [0.2,0.3,0.4] |
| [0.3,0.4,0.5] |
| [0.4,0.5,0.6] |
| [0.5,0.6,0.7] |
+---------------+
5 rows in set

Analyze the execution plan

Obtain the execution plan of the preceding example:

EXPLAIN SELECT c1
        FROM t1
        ORDER BY l2_distance(c1, '[0.1, 0.2, 0.3]') LIMIT 5;

The return result is as follows:

+---------------------------------------------------------------------------------------------+
| Query Plan                                                                                  |
+---------------------------------------------------------------------------------------------+
| =================================================                                           |
| |ID|OPERATOR         |NAME|EST.ROWS|EST.TIME(us)|                                           |
| -------------------------------------------------                                           |
| |0 |TOP-N SORT       |    |5       |3           |                                           |
| |1 |└─TABLE FULL SCAN|t1  |10      |3           |                                           |
| =================================================                                           |
| Outputs & filters:                                                                          |
| -------------------------------------                                                       |
|   0 - output([t1.c1]), filter(nil), rowset=16                                               |
|       sort_keys([l2_distance(t1.c1, cast('[0.1, 0.2, 0.3]', ARRAY(18, -1))), ASC]), topn(5) |
|   1 - output([t1.c1]), filter(nil), rowset=16                                               |
|       access([t1.c1]), partitions(p0)                                                       |
|       is_index_back=false, is_global_index=false,                                           |
|       range_key([t1.id]), range(MIN ; MAX)always true                                       |
+---------------------------------------------------------------------------------------------+
14 rows in set

The analysis is as follows:

Execution method:
- The full-table scan method is used, which requires traversing all data in the table. The TABLE FULL SCAN operation in the execution plan scans all data in the t1 table.
- The system calculates the vector distance for each record and then sorts the records by distance. The TOP-N SORT operation in the execution plan calculates the vector distance using the l2_distance function and sorts the records by distance in ascending order.
- The system returns the five records with the smallest distances. The topn(5) setting in the execution plan indicates that only the first five records of the sorted list are returned.
Performance characteristics:
- Advantages: The search results are completely accurate and ensure that the true nearest neighbors are returned.
- Disadvantages: The system must scan all data in the table and calculate the distance between all vectors, leading to a significant drop in performance as the data volume increases.
Applicable scenarios:
- Scenarios with a small amount of data.
- Scenarios where high result accuracy is required.
- Scenarios where real-time queries are not suitable for large datasets.

Perform approximate search by using vector indexes

Vector index search uses an approximate nearest neighbor (ANN) strategy, accelerating the search process through pre-built index structures. While it cannot guarantee 100% result accuracy, it can significantly improve search performance, allowing for a good balance between accuracy and efficiency in practical applications.

Example: Approximate search by using the HNSW index

-- Create a HNSW vector index for the table.
CREATE TABLE t2 (
    id INT PRIMARY KEY,
    vec VECTOR(3),
    VECTOR INDEX idx(vec) WITH (distance=l2, type=hnsw, lib=vsag)
);

-- Insert test data.
INSERT INTO t2 VALUES
(1, '[0.1, 0.2, 0.3]'),
(2, '[0.2, 0.3, 0.4]'),
(3, '[0.3, 0.4, 0.5]'),
(4, '[0.4, 0.5, 0.6]'),
(5, '[0.5, 0.6, 0.7]'),
(6, '[0.6, 0.7, 0.8]'),
(7, '[0.7, 0.8, 0.9]'),
(8, '[0.8, 0.9, 1.0]'),
(9, '[0.9, 1.0, 0.1]'),
(10, '[1.0, 0.1, 0.2]');

-- Perform approximate search and return the five most similar data records.
SELECT id, vec
FROM t2
ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
APPROXIMATE
LIMIT 5;

The return result is as follows. The result is the same as that of the exact search because the data volume is small:

+------+---------------+
| id   | vec           |
+------+---------------+
|    1 | [0.1,0.2,0.3] |
|    2 | [0.2,0.3,0.4] |
|    3 | [0.3,0.4,0.5] |
|    4 | [0.4,0.5,0.6] |
|    5 | [0.5,0.6,0.7] |
+------+---------------+
5 rows in set

Execution plan analysis

The execution plan of the preceding example is as follows:

EXPLAIN SELECT id, vec
        FROM t2
        ORDER BY l2_distance(vec, '[0.1, 0.2, 0.3]')
        APPROXIMATE
        LIMIT 5;

The return result is as follows:

+--------------------------------------------------------------------------------------------------------------------+
| Query Plan                                                                                                         |
+--------------------------------------------------------------------------------------------------------------------+
| ====================================================                                                               |
| |ID|OPERATOR         |NAME   |EST.ROWS|EST.TIME(us)|                                                               |
| ----------------------------------------------------                                                               |
| |0 |VECTOR INDEX SCAN|t2(idx)|10      |29          |                                                               |
| ====================================================                                                               |
| Outputs & filters:                                                                                                 |
| -------------------------------------                                                                              |
|   0 - output([t2.id], [t2.vec]), filter(nil), rowset=16                                                            |
|       access([t2.id], [t2.vec]), partitions(p0)                                                                    |
|       is_index_back=true, is_global_index=false,                                                                   |
|       range_key([t2.__vid_1750162978114053], [t2.__type_17_1750162978114364]), range(MIN,MIN ; MAX,MAX)always true |
+--------------------------------------------------------------------------------------------------------------------+
11 rows in set

The analysis is as follows:

Execution method:
- The vector index scan method is used. The HNSW index is used to directly locate similar vectors. The VECTOR INDEX SCAN operation in the execution plan is performed by using the index t2(idx).
- The graph structure of the index is used to quickly locate the nearest neighbors. The distance between all vectors is not calculated. The is_index_back=true setting in the execution plan indicates that the index is used to retrieve complete data.
- The five records that the index considers to be the most similar are returned. The output([t2.id], [t2.vec]) in the execution plan indicates that the id and vector data are returned.
Performance characteristics:
- Advantage: The search performance is high and remains stable as the data volume increases.
- Disadvantage: A small amount of error may exist in the results, and 100% accuracy is not guaranteed.
Applicable scenarios:
- Real-time search for large-scale datasets.
- Scenarios with high requirements for search performance.
- Scenarios that can tolerate a small amount of result error.

Summary

The following table describes the two search methods.

Item	Exact search	Approximate search
Execution method	Full-table scan (`TABLE FULL SCAN`) and sorting	Direct search by using the vector index (`VECTOR INDEX SCAN`)
Execution plan	Contains two operators: `TABLE FULL SCAN` and `TOP-N SORT`	Contains only one operator: `VECTOR INDEX SCAN`
Performance characteristics	Full-table scan and sorting are required, and the performance significantly decreases as the data volume increases.	The index is used to directly locate the target data, and the performance is stable.
Result accuracy	100% accurate, and the real nearest neighbors are returned.	Approximately accurate, and a small amount of error may exist.
Applicable scenarios	Scenarios with a small amount of data and high accuracy requirements.	Scenarios with large-scale datasets and high performance requirements.

For more information about SQL functions, see Use SQL functions.
For more information about vector indexes, see Vector index overview.
To perform large-scale performance tests, we recommend using the VectorDBBenchmark tool to generate a test dataset to better compare the performance differences between exact search and approximate search.

OceanBase

Customer Stories

Documentation

Vector similarity search

Notice

Perform exact search

Example: Euclidean search

Analyze the execution plan

Perform approximate search by using vector indexes

Example: Approximate search by using the HNSW index

Execution plan analysis

Summary

OceanBase

Customer Stories

Documentation

Vector similarity search

Notice

Perform exact search

Example: Euclidean search

Analyze the execution plan

Perform approximate search by using vector indexes

Example: Approximate search by using the HNSW index

Execution plan analysis

Summary

Related topics