This topic describes the concept of vector embeddings in vector search and how to generate vector embeddings.
What is vector embedding?
Vector embedding is a technique that converts unstructured data into numerical vectors. These vectors capture the semantic information of unstructured data, allowing computers to "understand" and process the meaning of unstructured data. Specifically:
Vector embedding maps unstructured data such as text, images, or audio and video to points in a high-dimensional vector space.
In this vector space, semantically similar unstructured data is mapped to nearby locations.
Vectors typically consist of hundreds of numbers (such as 512-dimensional or 1024-dimensional vectors).
You can calculate the similarity between vectors using mathematical methods such as cosine similarity.
Common text embedding models include Word2Vec, BERT, and BGE.
For example, when developing a RAG application, you usually need to embed the text data into vectors and store them in a vector database, while storing other structured data in a relational database.
OceanBase Database starting from V4.3.3 supports storing vectors as a field type in relational tables. This allows vectors and conventional scalar data to be stored efficiently in OceanBase Database.
Common vector embedding methods
Preparations
You need to install the pip command in advance.
Use offline and local pre-trained embedding models
Embedding text using a pre-trained model locally is the most flexible approach, but requires significant computational resources. Common models include:
Use Sentence Transformers
Sentence Transformers is a model used for natural language processing (NLP) that converts sentences or paragraphs into vector embeddings. It is based on deep learning technology, specifically using the transformer architecture, to effectively capture the semantic information of text. Due to the timeout issue when directly accessing the hugging face domain name in China, please set the hugging face mirror address in advance by running export HF_ENDPOINT=https://hf-mirror.com, and then execute the following code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
sentences = [
"That is a happy person",
"That is a happy dog",
"That is a very happy person",
"Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016 0.00884024 -0.05844684 ... 0.00750248 -0.04790139
# 0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ... 0.00899352 -0.02396279
# 0.02985837]
# [-0.01356584 0.01900942 -0.05800966 ... 0.00523864 -0.05689549
# 0.00077098]
# [-0.02149693 0.02998871 -0.05638731 ... 0.01443702 -0.02131325
# -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])
Use Hugging Face Transformers
Hugging Face Transformers is an open-source library that provides a wide range of pre-trained deep learning models, particularly for natural language processing (NLP) tasks. Due to the timeout issue when directly accessing the hugging face domain name in China, please set the hugging face mirror address in advance by running export HF_ENDPOINT=https://hf-mirror.com, and then execute the following code:
from transformers import AutoTokenizer, AutoModel
import torch
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")
# Prepare the input
texts = ["This is sample text."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Generate the embedding
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0] # Use the output of the [CLS] token
print(embeddings)
# tensor([[-1.4136, 0.7477, -0.9914, ..., 0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])
Ollama
Ollama is an open-source model. It allows users to easily run, manage, and use various large language models locally. In addition to supporting open-source language models like Llama 3 and Mistral, Ollama also supports embedding models such as bge-m3.
Deploy Ollama
You can download the installation package from the official website and install it on macOS and Windows. For more information about the installation procedure, see the official website of Ollama. After the installation is complete, Ollama will run in the background as a service.
To install Ollama on Linux, run the following command:
curl -fsSL https://ollama.ai/install.sh | shPull an embedding model
Ollama supports the use of the bge-m3 model for text embedding:
ollama pull bge-m3Use Ollama for text embedding
You can use the embedding capability of Ollama through HTTP API or Python SDK.
HTTP API
import requests def get_embedding(text: str) -> list: """Use the HTTP API of Ollama to obtain the text embedding.""" response = requests.post( 'http://localhost:11434/api/embeddings', json={ 'model': 'bge-m3', 'prompt': text } ) return response.json()['embedding'] # Example usage text = "This is a sample text." embedding = get_embedding(text) print(embedding) # [-1.4269912242889404, 0.9092104434967041, ...]Python SDK
First, install the Python SDK of Ollama:
pip install ollamaThen, use it as follows:
import ollama # Example usage texts = ["Sentence 1", "Sentence 2"] embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings'] print(embeddings) # [[0.03486196, 0.0625187, ...], [...]]
Advantages and disadvantages of Ollama
Advantages:
- Fully local deployment, no need for internet connection
- Open-source and free, no API key required
- Supports multiple models, easy to switch and compare
- Relatively low resource usage
Disadvantages:
- Limited embedding model options
- Performance may not match commercial services
- Requires self-maintenance and updates
- No enterprise-level support
When deciding whether to use Ollama, weigh these factors. If your application scenario prioritizes privacy or offline operation, Ollama is a good choice. However, if you need higher service stability and better performance, you may want to opt for commercial services.
Use online and remote embedding services
Deploying a local and offline embedding model requires high specifications of the deployment machine and high management requirements for processes such as model loading and unloading. Therefore, many users require online embedding services. Currently, many AI inference service providers offer corresponding text embedding services. Take the text embedding service of Tongyi Qianwen as an example. First, register an account on Alibaba Cloud BaiLian and obtain the API Key. Then, you can call its public interface to obtain the text embedding results.
HTTP call
After obtaining the text embedding model, you can use the following code to try embedding text. If the requests package is not installed in your Python environment, you need to run pip install requests to install it first.
import requests
from typing import List
class RemoteEmbedding():
def __init__(
self,
base_url: str,
api_key: str,
model: str,
dimensions: int = 1024,
**kwargs,
):
self._base_url = base_url
self._api_key = api_key
self._model = model
self._dimensions = dimensions
"""
OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
"""
def embed_documents(
self,
texts: List[str],
) -> List[List[float]]:
"""Embed search docs.
Args:
texts: List of text to embed.
Returns:
List of embeddings.
"""
res = requests.post(
f"{self._base_url}",
headers={"Authorization": f"Bearer {self._api_key}"},
json={
"input": texts,
"model": self._model,
"encoding_format": "float",
"dimensions": self._dimensions,
},
)
data = res.json()
embeddings = []
try:
for d in data["data"]:
embeddings.append(d["embedding"][: self._dimensions])
return embeddings
except Exception as e:
print(data)
print("Error", e)
raise e
def embed_query(self, text: str, **kwargs) -> List[float]:
"""Embed query text.
Args:
text: Text to embed.
Returns:
Embedding.
"""
return self.embed_documents([text])[0]
embedding = RemoteEmbedding(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings", # You can refer to https://bailian.console.aliyun.com/#/model-market/detail/text-embedding-v3?tabKey=sdk
api_key="your-api-key", # Fill in with your API Key
model="text-embedding-v3",
)
print("Embedding result:", embedding.embed_query("Today is a nice day"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["Today is a nice day", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]
Use the Tongyi Qianwen SDK
Tongyi Qianwen provides a SDK named dashscope to quickly call model capabilities. After you install it using pip install dashscope, you can obtain the text embedding.
import dashscope
from dashscope import TextEmbedding
# Set the API key.
dashscope.api_key = "your-api-key"
# Prepare the input text.
texts = ["This is the first sentence.", "This is the second sentence."]
# Call the embedding service.
response = TextEmbedding.call(
model="text-embedding-v3",
input=texts
)
# Obtain the embedding result.
if response.status_code == 200:
print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]
Common image embedding methods
Use an offline and local pre-trained embedding model
Use CLIP
Contrastive Language-Image Pretraining (CLIP) is a model proposed by OpenAI for multimodal learning by combining images and text. CLIP can understand and process the relationships between images and text, enabling it to excel in various tasks such as image classification, image retrieval, and text generation.
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare the input image
image = Image.open("path_to_your_image.jpg")
texts = ["This is the first sentence", "This is the second sentence"]
# Call the embedding service
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
# Obtain the embedding results
if outputs.status_code == 200:
print(outputs.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]