Text embedding for vector search|V4.3.5| docs|Distributed Database

Text embedding for vector search

Last Updated：2025-01-26 09:36:34 Updated

This topic describes the concept of text embedding in vector search and provides some examples.

What is text embedding?

Text embedding is a technique of converting text into numerical vectors that capture the semantics of the text so that computers can "understand" and process the meaning of the text. Specifically:

Text embedding maps words or sentences to points in a high-dimensional vector space.
In this vector space, semantically similar texts are mapped to nearby positions.
Vectors usually consist of hundreds of numbers (such as 512 or 1,024).
You can calculate the similarity between vectors by using methods such as cosine similarity.
Common text embedding models include Word2Vec, BERT, and BGE.

When you develop a RAG application, you usually need to embed the text data and store the embedded data as vectors in a vector database, while storing the structured data in a relational database. Starting from OceanBase Database V4.3.3, you can store vectors as a field type in a relational table. This allows you to store both vectors and conventional scalar data in an orderly and efficient manner in OceanBase Database.

Common text embedding methods

Preparation

You need to install pip in advance.

Use an offline or local pre-trained embedding model

Embedding text locally using a pre-trained model provides the most flexibility, but requires substantial computing resources. Some commonly used models are described as follows.

Use Sentence Transformers

Direct access to the hugging face domain name usually times out in China. Therefore, set the environment variable export HF_ENDPOINT=https://hf-mirror.com in advance. Then, run the following code:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016  0.00884024 -0.05844684 ...  0.00750248 -0.04790139
#   0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ...  0.00899352 -0.02396279
#   0.02985837]
# [-0.01356584  0.01900942 -0.05800966 ...  0.00523864 -0.05689549
#   0.00077098]
# [-0.02149693  0.02998871 -0.05638731 ...  0.01443702 -0.02131325
#  -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])

Use Hugging Face Transformers

Direct access to the hugging face domain name usually times out in China. Therefore, set the environment variable export HF_ENDPOINT=https://hf-mirror.com in advance. Then, run the following code:

from transformers import AutoTokenizer, AutoModel
import torch

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")

# Prepare the input
texts = ["This is a sample text"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Generate the embedding
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0]  # Use the output of the [CLS] token
    print(embeddings)
    # tensor([[-1.4136,  0.7477, -0.9914,  ...,  0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])

Ollama

Ollama is an open-source model runtime that allows you to easily run, manage, and use various large language models on your local machine. In addition to supporting open-source language models such as Llama 3 and Mistral, it also supports the bge-m3 embedding model.

Deploy Ollama

You can download and install the installation package directly on MacOS and Windows. For more information, see the official website of Ollama. After the installation is complete, Ollama will run in the background as a service.

To install Ollama on Linux, run the following command:

curl -fsSL https://ollama.ai/install.sh | sh

Pull the embedding model

Ollama supports using the bge-m3 model for text embedding:

ollama pull bge-m3

Use Ollama to embed text

You can use Ollama's embedding capability through the HTTP API or Python SDK:

Use the HTTP API

import requests

def get_embedding(text: str) -> list:
"""Use the HTTP API of Ollama to obtain the text embedding"""
response = requests.post(
    'http://localhost:11434/api/embeddings',
    json={
        'model': 'bge-m3',
        'prompt': text
    }
)
return response.json()['embedding']

# Example usage
text = "This is a sample text"
embedding = get_embedding(text)
print(embedding)
# [-1.4269912242889404, 0.9092104434967041, ...]

Use the Python SDK

First, install the Ollama Python SDK:

pip install ollama

Then, use it as follows:

import ollama

# Example usage
texts = ["Sentence 1", "Sentence 2"]
embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings']
print(embeddings)
# [[0.03486196, 0.0625187, ...], [...]]

Advantages and limitations of Ollama

The advantages are as follows:

Fully deployed on your local machine, without the need for network connection
Open-source and free, without the need for an API key
Supports multiple models, making it convenient for you to switch and compare models
Relatively small resource usage

The limitations are as follows:

Limited model selection
Lower performance compared to commercial services
Requires manual maintenance and updates
No enterprise-level support

When you decide whether to use Ollama, consider these factors. If your application scenario requires high privacy or must run offline, Ollama is a good choice. Otherwise, you may want to use a commercial service for higher service quality and better performance.

Call an online or remote embedding service

If you deploy an embedding model offline on a local server, the server requires high specifications and you must manage the loading and unloading of the model. Many users require online embedding services. Therefore, many AI inference service providers offer corresponding embedding services. For example, you can register an account on Alibaba Cloud Model Studio and obtain an API Key. Then, you can call its public API to obtain text embeddings.

Call the API through HTTP

After you obtain the API Key, you can try to call the API to embed texts. If the requests package is not installed in your Python environment, you need to execute the pip install requests command to install it first.

import requests
from typing import List

class RemoteEmbedding():
    def __init__(
        self,
        base_url: str,
        api_key: str,
        model: str,
        dimensions: int = 1024,
        **kwargs,
    ):
        self._base_url = base_url
        self._api_key = api_key
        self._model = model
        self._dimensions = dimensions

    """
        OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
    """

    def embed_documents(
        self,
        texts: List[str],
    ) -> List[List[float]]:
        """Embed search docs.

        Args:
            texts: List of text to embed.

        Returns:
            List of embeddings.
        """
        res = requests.post(
            f"{self._base_url}",
            headers={"Authorization": f"Bearer {self._api_key}"},
            json={
                "input": texts,
                "model": self._model,
                "encoding_format": "float",
                "dimensions": self._dimensions,
            },
        )
        data = res.json()
        embeddings = []
        try:
            for d in data["data"]:
                embeddings.append(d["embedding"][: self._dimensions])
            return embeddings
        except Exception as e:
            print(data)
            print("Error", e)
            raise e

    def embed_query(self, text: str, **kwargs) -> List[float]:
        """Embed query text.

        Args:
            text: Text to embed.

        Returns:
            Embedding.
        """
        return self.embed_documents([text])[0]

embedding = RemoteEmbedding(
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings", # For more information, see https://bailian.console.aliyun.com/#/model-market/detail/text-embedding-v3?tabKey=sdk
  api_key="your-api-key", # Enter your API Key.
  model="text-embedding-v3",
)

print("Embedding result:", embedding.embed_query("Today is a nice day"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["Today is a nice day", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]

Call the API using the Tongyi SDK

Tongyi provides the dashscope SDK to quickly call model capabilities. After you install the dashscope package using the pip install dashscope command, you can call its public API to obtain text embeddings.

import dashscope
from dashscope import TextEmbedding

# Set the API Key.
dashscope.api_key = "your-api-key"

# Prepare the input text.
texts = ["This is the first sentence.", "This is the second sentence."]

# Call the embedding service.
response = TextEmbedding.call(
    model="text-embedding-v3",
    input=texts
)

# Obtain the embedding results.
if response.status_code == 200:
    print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]