This topic describes vector embedding in vector search.
What is vector embedding?
Vector embedding is a technique that converts unstructured data into numerical vectors. These vectors capture the semantic information of the unstructured data, allowing computers to "understand" and process the meaning of the unstructured data. Specifically:
- Vector embedding maps unstructured data such as text, images, or audio/video to points in a high-dimensional vector space.
- In this vector space, semantically similar unstructured data are mapped to nearby positions.
- Vectors are typically composed of hundreds of numbers (e.g., 512 dimensions, 1024 dimensions, etc.).
- The similarity between vectors can be calculated using mathematical methods such as cosine similarity.
- Common vector embedding models include Word2Vec, BERT, and BGE. For example, when developing a RAG application, we usually need to embed the text data into vectors and store them in a vector database, while other structured data are stored in a relational database.
Starting from OceanBase Database V4.3.3, you can store vector data as a data type in a relational table. This allows vectors and traditional scalar data to be stored in OceanBase Database in an organized and efficient manner.
Generate vector embeddings in OceanBase Database by using AI Function Service
OceanBase Database supports generating vector embeddings by using AI Function Service. You do not need to install any dependencies. You only need to register the model information. For more information, see AI Function Service syntax and examples.
Common text embedding methods
This section describes text embedding methods.
Prerequisites
You need to install the pip command.
Use an offline, local pre-trained embedding model
Using a pre-trained model for local text embedding is the most flexible approach, but it requires significant computational resources. Commonly used models include:
Use Sentence Transformers
Sentence Transformers are models designed for natural language processing (NLP) tasks, aiming to convert sentences or paragraphs into vector embeddings. They are based on deep learning techniques, particularly using the Transformer architecture, which effectively captures the semantic information of text. Due to potential timeouts when directly accessing the Hugging Face domain in China, please set the Hugging Face mirror address before proceeding: export HF_ENDPOINT=https://hf-mirror.com. After setting this, execute the following code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
sentences = [
"That is a happy person",
"That is a happy dog",
"That is a very happy person",
"Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016 0.00884024 -0.05844684 ... 0.00750248 -0.04790139
# 0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ... 0.00899352 -0.02396279
# 0.02985837]
# [-0.01356584 0.01900942 -0.05800966 ... 0.00523864 -0.05689549
# 0.00077098]
# [-0.02149693 0.02998871 -0.05638731 ... 0.01443702 -0.02131325
# -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])
Use Hugging Face Transformers
Hugging Face Transformers is an open-source library that provides a wide range of pre-trained deep learning models, especially for natural language processing (NLP) tasks. Due to geographical issues, direct access to the Hugging Face domain may result in timeouts. Please set the Hugging Face mirror address before proceeding: export HF_ENDPOINT=https://hf-mirror.com. After setting this, execute the following code:
from transformers import AutoTokenizer, AutoModel
import torch
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")
# Prepare input
texts = ["This is an example text"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0] # Use the output of the [CLS] token
print(embeddings)
# tensor([[-1.4136, 0.7477, -0.9914, ..., 0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])
Ollama
Ollama is an open-source model. When running, it allows users to easily run, manage, and use various large language models locally. In addition to supporting open-source language models like Llama 3 and Mistral, it also supports embedding models like bge-m3.
Deploy Ollama
On macOS and Windows, you can directly download the installation package from the official website and install it. The installation method can be referenced from Ollama's official website. After installation, Ollama will run as a service in the background.
On Linux, install Ollama:
curl -fsSL https://ollama.ai/install.sh | shPull the embedding model
Ollama supports using the bge-m3 model for text embedding:
ollama pull bge-m3Use Ollama for text embedding
You can use Ollama's embedding capabilities through HTTP API or Python SDK:
HTTP API
import requests def get_embedding(text: str) -> list: """Get text embeddings using Ollama's HTTP API""" response = requests.post( 'http://localhost:11434/api/embeddings', json={ 'model': 'bge-m3', 'prompt': text } ) return response.json()['embedding'] # Example usage text = "This is an example text" embedding = get_embedding(text) print(embedding) # [-1.4269912242889404, 0.9092104434967041, ...]Python SDK
First, install the Ollama Python SDK:
pip install ollamaThen, you can use it as follows:
import ollama # Example usage texts = ["First sentence", "Second sentence"] embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings'] print(embeddings) # [[0.03486196, 0.0625187, ...], [...]]
Advantages and Limitations of Ollama
Advantages:
- Fully local deployment, no need for internet connection
- Open-source and free, no API key required
- Supports multiple models, easy to switch and compare
- Relatively low resource consumption
Limitations:
- Limited selection of embedding models
- Performance may not match commercial services
- Requires self-maintenance and updates
- Lacks enterprise-level support
When deciding whether to use Ollama, consider these factors. If your application requires high privacy or complete offline operation, Ollama is a good choice. However, if you need more stable service quality and better performance, commercial services may be more suitable.
Use online remote embedding services
Using an offline, local embedding model typically requires higher specifications for the deployment machine and has higher requirements for the management of model loading and unloading. Therefore, many users have a high demand for online embedding services. As a result, many AI inference service providers now offer corresponding text embedding services. For example, the text embedding service of Qwen allows you to register for an account on Alibaba Cloud Baichuan and obtain an API Key. Then, you can call its public interface to get the text embedding results.
HTTP call
After obtaining the API Key, you can use the following code to perform text embedding. If the requests package is not installed in your Python environment, you need to install it using pip install requests to send network requests.
import requests
from typing import List
class RemoteEmbedding():
def __init__(
self,
base_url: str,
api_key: str,
model: str,
dimensions: int = 1024,
**kwargs,
):
self._base_url = base_url
self._api_key = api_key
self._model = model
self._dimensions = dimensions
"""
OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
"""
def embed_documents(
self,
texts: List[str],
) -> List[List[float]]:
"""Embed search docs.
Args:
texts: List of text to embed.
Returns:
List of embeddings.
"""
res = requests.post(
f"{self._base_url}",
headers={"Authorization": f"Bearer {self._api_key}"},
json={
"input": texts,
"model": self._model,
"encoding_format": "float",
"dimensions": self._dimensions,
},
)
data = res.json()
embeddings = []
try:
for d in data["data"]:
embeddings.append(d["embedding"][: self._dimensions])
return embeddings
except Exception as e:
print(data)
print("Error", e)
raise e
def embed_query(self, text: str, **kwargs) -> List[float]:
"""Embed query text.
Args:
text: Text to embed.
Returns:
Embedding.
"""
return self.embed_documents([text])[0]
embedding = RemoteEmbedding(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings", # For more information, see https://bailian.console.aliyun.com/#/model-market/detail/text-embedding-v3?tabKey=sdk
api_key="your-api-key", # Fill in your API Key
model="text-embedding-v3",
)
print("Embedding result:", embedding.embed_query("Today's weather is nice"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["Today's weather is nice", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]
Use the Qwen SDK
Qwen provides a SDK called dashscope for quick model calls. After installing it with pip install dashscope, you can obtain the text embeddings.
import dashscope
from dashscope import TextEmbedding
# Set the API Key
dashscope.api_key = "your-api-key"
# Prepare the input text
texts = ["This is the first sentence", "This is the second sentence"]
# Call the embedding service
response = TextEmbedding.call(
model="text-embedding-v3",
input=texts
)
# Get the embedding results
if response.status_code == 200:
print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]
Common image embedding methods
This section introduces image embedding methods.
Using an offline, locally pre-trained embedding model
Using CLIP
CLIP (Contrastive Language-Image Pretraining) is a model proposed by OpenAI that aims to perform multimodal learning by combining images and text. CLIP can understand and process the relationships between images and text, allowing it to excel in various tasks such as image classification, image search, and text generation.
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare the input image
image = Image.open("path_to_your_image.jpg")
texts = ["This is the first sentence", "This is the second sentence"]
# Call the embedding service
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
# Obtain the embedding results
if outputs.status_code == 200:
print(outputs.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]
