This topic describes how to store unstructured data (vector embeddings), semi-structured data, and structured data in OceanBase Database. This not only makes full use of the basic capabilities of OceanBase Database, but also provides strong support for hybrid search.
Storage principle
OceanBase Database can store different types of data. The principle behind this is to convert different types of data (such as text, images, and videos) into vectors, and then search for information by calculating the distances between these vectors. Hybrid search can be performed in two ways: simple search based on the similarity of a single vector, and complex search involving both vectors and scalars.
Vector search inherently has some level of approximation. Therefore, to improve accuracy, you must use various methods in practical applications. Accurate search results can bring greater value to businesses.
Prerequisites
Before you can perform vector search, you need to estimate the memory usage based on the index data in the tenant of the MySQL user and configure the parameters accordingly. The following command configures the memory for vector indexes to 30% of the tenant memory:
ALTER SYSTEM SET ob_vector_memory_limit_percentage = 30;
The default value of ob_vector_memory_limit_percentage is 0, which means no memory is allocated for vector indexes. In this case, an error will be returned when you create an index.
Create a vector column
The following example shows a table that stores vector data, spatial data, and relational data. The data type of the vector column is VECTOR, and it requires you to specify the dimension when you create it. The maximum supported dimension is 16,000. The data type of the spatial column is GEOMETRY.
CREATE TABLE t (
-- Store structured data.
id INT PRIMARY KEY,
-- Store spatial data.
g GEOMETRY,
-- Store vector data.
vec VECTOR(3)
);
Use the INSERT statement to insert vector data
Once you have created a table with a VECTOR column, you can directly use the INSERT statement to insert vectors into the table. When you insert data, the dimension of the vector must match the one specified when you defined the table. This design ensures data consistency and query efficiency. The vector is represented in the standard format of a floating-point number array, and each value of the dimension must be a valid floating-point number. Here is a simple example:
INSERT INTO t (id, g, vec) VALUES (
-- Insert structured data.
1,
-- Insert semi-structured data.
ST_GeomFromText('POINT(1 1)'),
-- Insert unstructured data.
'[0.1, 0.2, 0.3]'
);