Generate vector embeddings|V4.3.3| docs|Distributed Database

Generate vector embeddings

Last Updated：2025-11-27 06:23:27 Updated

This topic describes the concept of vector embeddings in vector search and how to generate vector embeddings.

What is vector embedding?

Vector embedding is a technique that converts unstructured data into numerical vectors. These vectors capture the semantic information of unstructured data, allowing computers to "understand" and process the meaning of unstructured data. Specifically:

Vector embedding maps unstructured data such as text, images, or audio and video to points in a high-dimensional vector space.
In this vector space, semantically similar unstructured data is mapped to nearby locations.
Vectors typically consist of hundreds of numbers (such as 512-dimensional or 1024-dimensional vectors).
You can calculate the similarity between vectors using mathematical methods such as cosine similarity.
Common text embedding models include Word2Vec, BERT, and BGE.

For example, when developing a RAG application, you usually need to embed the text data into vectors and store them in a vector database, while storing other structured data in a relational database.

OceanBase Database starting from V4.3.3 supports storing vectors as a field type in relational tables. This allows vectors and conventional scalar data to be stored efficiently in OceanBase Database.

Common vector embedding methods

Preparations

You need to install the pip command in advance.

Use offline and local pre-trained embedding models

Embedding text using a pre-trained model locally is the most flexible approach, but requires significant computational resources. Common models include:

Use Sentence Transformers

Sentence Transformers is a model used for natural language processing (NLP) that converts sentences or paragraphs into vector embeddings. It is based on deep learning technology, specifically using the transformer architecture, to effectively capture the semantic information of text. Due to the timeout issue when directly accessing the hugging face domain name in China, please set the hugging face mirror address in advance by running export HF_ENDPOINT=https://hf-mirror.com, and then execute the following code:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016  0.00884024 -0.05844684 ...  0.00750248 -0.04790139
#   0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ...  0.00899352 -0.02396279
#   0.02985837]
# [-0.01356584  0.01900942 -0.05800966 ...  0.00523864 -0.05689549
#   0.00077098]
# [-0.02149693  0.02998871 -0.05638731 ...  0.01443702 -0.02131325
#  -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])

Use Hugging Face Transformers

Hugging Face Transformers is an open-source library that provides a wide range of pre-trained deep learning models, particularly for natural language processing (NLP) tasks. Due to the timeout issue when directly accessing the hugging face domain name in China, please set the hugging face mirror address in advance by running export HF_ENDPOINT=https://hf-mirror.com, and then execute the following code:

from transformers import AutoTokenizer, AutoModel
import torch

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")

# Prepare the input
texts = ["This is sample text."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Generate the embedding
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0]  # Use the output of the [CLS] token
    print(embeddings)
    # tensor([[-1.4136,  0.7477, -0.9914,  ...,  0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])

Ollama

Ollama is an open-source model. It allows users to easily run, manage, and use various large language models locally. In addition to supporting open-source language models like Llama 3 and Mistral, Ollama also supports embedding models such as bge-m3.

Deploy Ollama

You can download the installation package from the official website and install it on macOS and Windows. For more information about the installation procedure, see the official website of Ollama. After the installation is complete, Ollama will run in the background as a service.

To install Ollama on Linux, run the following command:
```
curl -fsSL https://ollama.ai/install.sh | sh
```
Pull an embedding model

Ollama supports the use of the bge-m3 model for text embedding:
```
ollama pull bge-m3
```

Use Ollama for text embedding

You can use the embedding capability of Ollama through HTTP API or Python SDK.

HTTP API

import requests

def get_embedding(text: str) -> list:
"""Use the HTTP API of Ollama to obtain the text embedding."""
response = requests.post(
'http://localhost:11434/api/embeddings',
json={
    'model': 'bge-m3',
    'prompt': text
}
)
return response.json()['embedding']

# Example usage
text = "This is a sample text."
embedding = get_embedding(text)
print(embedding)
# [-1.4269912242889404, 0.9092104434967041, ...]

Python SDK

First, install the Python SDK of Ollama:

pip install ollama

Then, use it as follows:

import ollama

# Example usage
texts = ["Sentence 1", "Sentence 2"]
embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings']
print(embeddings)
# [[0.03486196, 0.0625187, ...], [...]]

Advantages and disadvantages of Ollama

Advantages:
- Fully local deployment, no need for internet connection
- Open-source and free, no API key required
- Supports multiple models, easy to switch and compare
- Relatively low resource usage
Disadvantages:
- Limited embedding model options
- Performance may not match commercial services
- Requires self-maintenance and updates
- No enterprise-level support

When deciding whether to use Ollama, weigh these factors. If your application scenario prioritizes privacy or offline operation, Ollama is a good choice. However, if you need higher service stability and better performance, you may want to opt for commercial services.

Use online and remote embedding services

Deploying a local and offline embedding model requires high specifications of the deployment machine and high management requirements for processes such as model loading and unloading. Therefore, many users require online embedding services. Currently, many AI inference service providers offer corresponding text embedding services. Take the text embedding service of Tongyi Qianwen as an example. First, register an account on Alibaba Cloud BaiLian and obtain the API Key. Then, you can call its public interface to obtain the text embedding results.

HTTP call

After obtaining the text embedding model, you can use the following code to try embedding text. If the requests package is not installed in your Python environment, you need to run pip install requests to install it first.

import requests
from typing import List

class RemoteEmbedding():
    def __init__(
        self,
        base_url: str,
        api_key: str,
        model: str,
        dimensions: int = 1024,
        **kwargs,
    ):
        self._base_url = base_url
        self._api_key = api_key
        self._model = model
        self._dimensions = dimensions

    """
        OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
    """

    def embed_documents(
        self,
        texts: List[str],
    ) -> List[List[float]]:
        """Embed search docs.

        Args:
            texts: List of text to embed.

        Returns:
            List of embeddings.
        """
        res = requests.post(
            f"{self._base_url}",
            headers={"Authorization": f"Bearer {self._api_key}"},
            json={
                "input": texts,
                "model": self._model,
                "encoding_format": "float",
                "dimensions": self._dimensions,
            },
        )
        data = res.json()
        embeddings = []
        try:
            for d in data["data"]:
                embeddings.append(d["embedding"][: self._dimensions])
            return embeddings
        except Exception as e:
            print(data)
            print("Error", e)
            raise e

    def embed_query(self, text: str, **kwargs) -> List[float]:
        """Embed query text.

        Args:
            text: Text to embed.

        Returns:
            Embedding.
        """
        return self.embed_documents([text])[0]

embedding = RemoteEmbedding(
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings", # You can refer to https://bailian.console.aliyun.com/#/model-market/detail/text-embedding-v3?tabKey=sdk
  api_key="your-api-key", # Fill in with your API Key
  model="text-embedding-v3",
)

print("Embedding result:", embedding.embed_query("Today is a nice day"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["Today is a nice day", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]

Use the Tongyi Qianwen SDK

Tongyi Qianwen provides a SDK named dashscope to quickly call model capabilities. After you install it using pip install dashscope, you can obtain the text embedding.

import dashscope
from dashscope import TextEmbedding

# Set the API key.
dashscope.api_key = "your-api-key"

# Prepare the input text.
texts = ["This is the first sentence.", "This is the second sentence."]

# Call the embedding service.
response = TextEmbedding.call(
    model="text-embedding-v3",
    input=texts
)

# Obtain the embedding result.
if response.status_code == 200:
    print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]

Common image embedding methods

Use an offline and local pre-trained embedding model

Use CLIP

Contrastive Language-Image Pretraining (CLIP) is a model proposed by OpenAI for multimodal learning by combining images and text. CLIP can understand and process the relationships between images and text, enabling it to excel in various tasks such as image classification, image retrieval, and text generation.

from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare the input image
image = Image.open("path_to_your_image.jpg")
texts = ["This is the first sentence", "This is the second sentence"]

# Call the embedding service
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Obtain the embedding results
if outputs.status_code == 200:
    print(outputs.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]

References

Store vector embeddings