Vector embedding technology|V4.4.2|OceanBase Database| docs|Distributed Database

This topic introduces vector embedding technology for vector search.

What is vector embedding?

Vector embedding is a technique for converting unstructured data into numerical vectors. These vectors can capture the semantic information of unstructured data, enabling computers to "understand" and process the meaning of such data. Specifically:

Vector embedding maps unstructured data such as text, images, or audio/video to points in a high-dimensional vector space.
In this vector space, semantically similar unstructured data is mapped to nearby locations.
Vectors are typically composed of hundreds of numbers (such as 512 or 1024 dimensions).
Mathematical methods (such as cosine similarity) can be used to calculate the similarity between vectors.
Common vector embedding models include Word2Vec, BERT, and BGE. For example, when developing RAG applications, text data is often embedded into vector data and stored in a vector database, while other structured data is stored in a relational database.

Starting from OceanBase Database V4.3.3, vector data can be stored as a data type in relational tables, so vectors and traditional scalar data can be stored together in OceanBase.

Generate vector embeddings in OceanBase with the AI function service

Starting from OceanBase V4.4.1, you can generate vector embeddings using the AI function service. No extra dependencies are required: register your model and you can generate embeddings inside the database. For details, see AI function syntax and examples.

Common text embedding methods

This section describes text embedding methods.

Preparations

Install the pip command if it is not already available.

Use offline, local pre-trained embedding models

Using pre-trained models locally is flexible but needs more compute. Common options include:

Sentence Transformers

Sentence Transformers turns sentences or paragraphs into vector embeddings. It uses deep learning (including the Transformer architecture) to capture text semantics. If access to Hugging Face is slow from your region, set the mirror first: export HF_ENDPOINT=https://hf-mirror.com, then run the code below.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016  0.00884024 -0.05844684 ...  0.00750248 -0.04790139
#   0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ...  0.00899352 -0.02396279
#   0.02985837]
# [-0.01356584  0.01900942 -0.05800966 ...  0.00523864 -0.05689549
#   0.00077098]
# [-0.02149693  0.02998871 -0.05638731 ...  0.01443702 -0.02131325
#  -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])

Hugging Face Transformers

Hugging Face Transformers provides many pre-trained models for NLP and other tasks. If access to Hugging Face is slow from your region, set the mirror first: export HF_ENDPOINT=https://hf-mirror.com, then run the code below.

from transformers import AutoTokenizer, AutoModel
import torch

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")

# Prepare the input
texts = ["This is an example text."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0]  # Use the [CLS] token's output
    print(embeddings)
    # tensor([[-1.4136,  0.7477, -0.9914,  ...,  0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])

Ollama

Ollama is an open-source runtime for running, managing, and using large language models locally. It supports both LLMs (e.g. Llama 3, Mistral) and embedding models such as bge-m3.

Deploy Ollama.

On macOS and Windows, download and install from the Ollama website. After installation, Ollama runs as a background service.

On Linux:
```
curl -fsSL https://ollama.ai/install.sh | sh
```
Pull an embedding model.

Ollama supports the bge-m3 model for text embedding:
```
ollama pull bge-m3
```

Use Ollama for text embedding.

You can call Ollama's embedding API over HTTP or via the Python SDK:

HTTP API

import requests

def get_embedding(text: str) -> list:
"""Use the HTTP API of Ollama to obtain text embeddings."""
response = requests.post(
'http://localhost:11434/api/embeddings',
json={
    'model': 'bge-m3',
    'prompt': text
}
)
return response.json()['embedding']

# Example usage
text = "This is an example text."
embedding = get_embedding(text)
print(embedding)
# [-1.4269912242889404, 0.9092104434967041, ...]

Python SDK

First, install the Python SDK for Ollama:

pip install ollama

Then, you can use it like this:

import ollama

# Example usage
texts = ["First sentence", "Second sentence"]
embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings']
print(embeddings)
# [[0.03486196, 0.0625187, ...], [...]]

Advantages and limitations:

Advantages:
- Fully local deployment without the need for internet connectivity
- Open-source and free, without the need for an API key
- Supports multiple models, making it easy to switch and compare
- Relatively low resource usage
Limitations:
- Limited selection of embedding models
- Performance may not match commercial services
- Requires self-maintenance and updates
- Lacks enterprise-level support
Consider these trade-offs when choosing Ollama. It is a good fit when you need strong privacy or fully offline operation. For higher stability and performance, a commercial embedding service may be better.

Use online, remote embedding services

Local embedding models often need powerful hardware and careful management of loading and unloading. Many users prefer an online embedding service. Several AI providers offer text embedding APIs. For example, with Alibaba Cloud Model Studio (Bailian), you register, obtain an API key, and call the public API to get embeddings.

Click to activate the model service

Confirm to activate the model service

Alibaba Cloud Model Studio

HTTP API

After you have an API key, you can run the code below to get text embeddings. Install the requests package first if needed: pip install requests.

import requests
from typing import List

class RemoteEmbedding():
    def __init__(
        self,
        base_url: str,
        api_key: str,
        model: str,
        dimensions: int = 1024,
        **kwargs,
    ):
        self._base_url = base_url
        self._api_key = api_key
        self._model = model
        self._dimensions = dimensions

    """
        OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
    """

    def embed_documents(
        self,
        texts: List[str],
    ) -> List[List[float]]:
        """Embed search docs.

        Args:
            texts: List of text to embed.

        Returns:
            List of embeddings.
        """
        res = requests.post(
            f"{self._base_url}",
            headers={"Authorization": f"Bearer {self._api_key}"},
            json={
                "input": texts,
                "model": self._model,
                "encoding_format": "float",
                "dimensions": self._dimensions,
            },
        )
        data = res.json()
        embeddings = []
        try:
            for d in data["data"]:
                embeddings.append(d["embedding"][: self._dimensions])
            return embeddings
        except Exception as e:
            print(data)
            print("Error", e)
            raise e

    def embed_query(self, text: str, **kwargs) -> List[float]:
        """Embed query text.

        Args:
            text: Text to embed.

        Returns:
            Embedding.
        """
        return self.embed_documents([text])[0]

embedding = RemoteEmbedding(
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings",  # See https://bailian.console.aliyun.com for API details
  api_key="your-api-key",  # Replace with your API key
  model="text-embedding-v3",
)

print("Embedding result:", embedding.embed_query("The weather is nice today"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["The weather is nice today", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]

Qwen (Dashscope) SDK

Qwen provides the Dashscope SDK to call its models. Install it with pip install dashscope, then you can get text embeddings as follows:

import dashscope
from dashscope import TextEmbedding

# Set the API key.
dashscope.api_key = "your-api-key"

# Prepare the input text.
texts = ["This is the first sentence", "This is the second sentence"]

# Call the embedding service.
response = TextEmbedding.call(
    model="text-embedding-v3",
    input=texts
)

# Retrieve the embeddings.
if response.status_code == 200:
    print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]

Common image embedding methods

This section describes image embedding methods.

Use offline, local pre-trained embedding models

CLIP

CLIP (Contrastive Language-Image Pre-training) is a model from OpenAI that learns from image and text together. It captures the relationship between images and text and is used for image classification, image search, and text-to-image tasks.

from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare the input image
image = Image.open("path_to_your_image.jpg")
texts = ["This is the first sentence", "This is the second sentence"]

# Call the embedding service
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Obtain the embedding results
if outputs.status_code == 200:
    print(outputs.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]

This topic introduces vector embedding technology for vector search.

What is vector embedding?

Vector embedding maps unstructured data such as text, images, or audio/video to points in a high-dimensional vector space.
In this vector space, semantically similar unstructured data is mapped to nearby locations.
Vectors are typically composed of hundreds of numbers (such as 512 or 1024 dimensions).
Mathematical methods (such as cosine similarity) can be used to calculate the similarity between vectors.
Common vector embedding models include Word2Vec, BERT, and BGE. For example, when developing RAG applications, text data is often embedded into vector data and stored in a vector database, while other structured data is stored in a relational database.

Starting from OceanBase Database V4.3.3, vector data can be stored as a data type in relational tables, so vectors and traditional scalar data can be stored together in OceanBase.

Generate vector embeddings in OceanBase with the AI function service

Common text embedding methods

This section describes text embedding methods.

Preparations

Install the pip command if it is not already available.

Use offline, local pre-trained embedding models

Using pre-trained models locally is flexible but needs more compute. Common options include:

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016  0.00884024 -0.05844684 ...  0.00750248 -0.04790139
#   0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ...  0.00899352 -0.02396279
#   0.02985837]
# [-0.01356584  0.01900942 -0.05800966 ...  0.00523864 -0.05689549
#   0.00077098]
# [-0.02149693  0.02998871 -0.05638731 ...  0.01443702 -0.02131325
#  -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])

Hugging Face Transformers

from transformers import AutoTokenizer, AutoModel
import torch

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")

# Prepare the input
texts = ["This is an example text."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0]  # Use the [CLS] token's output
    print(embeddings)
    # tensor([[-1.4136,  0.7477, -0.9914,  ...,  0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])

Ollama

Ollama is an open-source runtime for running, managing, and using large language models locally. It supports both LLMs (e.g. Llama 3, Mistral) and embedding models such as bge-m3.

Deploy Ollama.

On macOS and Windows, download and install from the Ollama website. After installation, Ollama runs as a background service.

On Linux:
```
curl -fsSL https://ollama.ai/install.sh | sh
```
Pull an embedding model.

Ollama supports the bge-m3 model for text embedding:
```
ollama pull bge-m3
```

Use Ollama for text embedding.

You can call Ollama's embedding API over HTTP or via the Python SDK:

HTTP API

import requests

def get_embedding(text: str) -> list:
"""Use the HTTP API of Ollama to obtain text embeddings."""
response = requests.post(
'http://localhost:11434/api/embeddings',
json={
    'model': 'bge-m3',
    'prompt': text
}
)
return response.json()['embedding']

# Example usage
text = "This is an example text."
embedding = get_embedding(text)
print(embedding)
# [-1.4269912242889404, 0.9092104434967041, ...]

Python SDK

First, install the Python SDK for Ollama:

pip install ollama

Then, you can use it like this:

import ollama

# Example usage
texts = ["First sentence", "Second sentence"]
embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings']
print(embeddings)
# [[0.03486196, 0.0625187, ...], [...]]

Advantages and limitations:

Advantages:
- Fully local deployment without the need for internet connectivity
- Open-source and free, without the need for an API key
- Supports multiple models, making it easy to switch and compare
- Relatively low resource usage
Limitations:
- Limited selection of embedding models
- Performance may not match commercial services
- Requires self-maintenance and updates
- Lacks enterprise-level support
Consider these trade-offs when choosing Ollama. It is a good fit when you need strong privacy or fully offline operation. For higher stability and performance, a commercial embedding service may be better.

Use online, remote embedding services

Click to activate the model service

Confirm to activate the model service

Alibaba Cloud Model Studio

HTTP API

After you have an API key, you can run the code below to get text embeddings. Install the requests package first if needed: pip install requests.

import requests
from typing import List

class RemoteEmbedding():
    def __init__(
        self,
        base_url: str,
        api_key: str,
        model: str,
        dimensions: int = 1024,
        **kwargs,
    ):
        self._base_url = base_url
        self._api_key = api_key
        self._model = model
        self._dimensions = dimensions

    """
        OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
    """

    def embed_documents(
        self,
        texts: List[str],
    ) -> List[List[float]]:
        """Embed search docs.

        Args:
            texts: List of text to embed.

        Returns:
            List of embeddings.
        """
        res = requests.post(
            f"{self._base_url}",
            headers={"Authorization": f"Bearer {self._api_key}"},
            json={
                "input": texts,
                "model": self._model,
                "encoding_format": "float",
                "dimensions": self._dimensions,
            },
        )
        data = res.json()
        embeddings = []
        try:
            for d in data["data"]:
                embeddings.append(d["embedding"][: self._dimensions])
            return embeddings
        except Exception as e:
            print(data)
            print("Error", e)
            raise e

    def embed_query(self, text: str, **kwargs) -> List[float]:
        """Embed query text.

        Args:
            text: Text to embed.

        Returns:
            Embedding.
        """
        return self.embed_documents([text])[0]

embedding = RemoteEmbedding(
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings",  # See https://bailian.console.aliyun.com for API details
  api_key="your-api-key",  # Replace with your API key
  model="text-embedding-v3",
)

print("Embedding result:", embedding.embed_query("The weather is nice today"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["The weather is nice today", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]

Qwen (Dashscope) SDK

Qwen provides the Dashscope SDK to call its models. Install it with pip install dashscope, then you can get text embeddings as follows:

import dashscope
from dashscope import TextEmbedding

# Set the API key.
dashscope.api_key = "your-api-key"

# Prepare the input text.
texts = ["This is the first sentence", "This is the second sentence"]

# Call the embedding service.
response = TextEmbedding.call(
    model="text-embedding-v3",
    input=texts
)

# Retrieve the embeddings.
if response.status_code == 200:
    print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]

Common image embedding methods

This section describes image embedding methods.

Use offline, local pre-trained embedding models

CLIP

from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare the input image
image = Image.open("path_to_your_image.jpg")
texts = ["This is the first sentence", "This is the second sentence"]

# Call the embedding service
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Obtain the embedding results
if outputs.status_code == 200:
    print(outputs.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]

OceanBase

Customer Stories

Documentation

Vector embedding technology

What is vector embedding?

Generate vector embeddings in OceanBase with the AI function service

Common text embedding methods

Preparations

Use offline, local pre-trained embedding models

Sentence Transformers

Hugging Face Transformers

Ollama

Use online, remote embedding services

HTTP API

Qwen (Dashscope) SDK

Common image embedding methods

Use offline, local pre-trained embedding models

CLIP

Vector embedding technology

What is vector embedding?

Generate vector embeddings in OceanBase with the AI function service

Common text embedding methods

Preparations

Use offline, local pre-trained embedding models

Sentence Transformers

Hugging Face Transformers

Ollama

Use online, remote embedding services

HTTP API

Qwen (Dashscope) SDK

Common image embedding methods

Use offline, local pre-trained embedding models

CLIP

OceanBase

Customer Stories

Documentation

Vector embedding technology

What is vector embedding?

Generate vector embeddings in OceanBase with the AI function service

Common text embedding methods

Preparations

Use offline, local pre-trained embedding models

Sentence Transformers

Hugging Face Transformers

Ollama

Use online, remote embedding services

HTTP API

Qwen (Dashscope) SDK

Common image embedding methods

Use offline, local pre-trained embedding models

CLIP

Related topics

Vector embedding technology

What is vector embedding?

Generate vector embeddings in OceanBase with the AI function service

Common text embedding methods

Preparations

Use offline, local pre-trained embedding models

Sentence Transformers

Hugging Face Transformers

Ollama

Use online, remote embedding services

HTTP API

Qwen (Dashscope) SDK

Common image embedding methods

Use offline, local pre-trained embedding models

CLIP

Related topics