This topic describes the concept of text embedding in vector search and provides some examples.
What is text embedding?
Text embedding is a technique of converting text into numerical vectors that capture the semantics of the text so that computers can "understand" and process the meaning of the text. Specifically:
Text embedding maps words or sentences to points in a high-dimensional vector space.
In this vector space, semantically similar texts are mapped to nearby positions.
Vectors usually consist of hundreds of numbers (such as 512 or 1,024).
You can calculate the similarity between vectors by using methods such as cosine similarity.
Common text embedding models include Word2Vec, BERT, and BGE.
When you develop a RAG application, you usually need to embed the text data and store the embedded data as vectors in a vector database, while storing the structured data in a relational database. Starting from OceanBase Database V4.3.3, you can store vectors as a field type in a relational table. This allows you to store both vectors and conventional scalar data in an orderly and efficient manner in OceanBase Database.
Common text embedding methods
Preparation
You need to install pip in advance.
Use an offline or local pre-trained embedding model
Embedding text locally using a pre-trained model provides the most flexibility, but requires substantial computing resources. Some commonly used models are described as follows.
Use Sentence Transformers
Direct access to the hugging face domain name usually times out in China. Therefore, set the environment variable export HF_ENDPOINT=https://hf-mirror.com in advance. Then, run the following code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
sentences = [
"That is a happy person",
"That is a happy dog",
"That is a very happy person",
"Today is a sunny day"
]
embeddings = model.encode(sentences)
print(embeddings)
# [[-0.01178016 0.00884024 -0.05844684 ... 0.00750248 -0.04790139
# 0.00330675]
# [-0.03470375 -0.00886354 -0.05242309 ... 0.00899352 -0.02396279
# 0.02985837]
# [-0.01356584 0.01900942 -0.05800966 ... 0.00523864 -0.05689549
# 0.00077098]
# [-0.02149693 0.02998871 -0.05638731 ... 0.01443702 -0.02131325
# -0.00112451]]
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# torch.Size([4, 4])
Use Hugging Face Transformers
Direct access to the hugging face domain name usually times out in China. Therefore, set the environment variable export HF_ENDPOINT=https://hf-mirror.com in advance. Then, run the following code:
from transformers import AutoTokenizer, AutoModel
import torch
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
model = AutoModel.from_pretrained("BAAI/bge-m3")
# Prepare the input
texts = ["This is a sample text"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Generate the embedding
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state[:, 0] # Use the output of the [CLS] token
print(embeddings)
# tensor([[-1.4136, 0.7477, -0.9914, ..., 0.0937, -0.0362, -0.1650]])
print(embeddings.shape)
# torch.Size([1, 1024])
Ollama
Ollama is an open-source model runtime that allows you to easily run, manage, and use various large language models on your local machine. In addition to supporting open-source language models such as Llama 3 and Mistral, it also supports the bge-m3 embedding model.
- Deploy Ollama
You can download and install the installation package directly on MacOS and Windows. For more information, see the official website of Ollama. After the installation is complete, Ollama will run in the background as a service.
To install Ollama on Linux, run the following command:
curl -fsSL https://ollama.ai/install.sh | sh
- Pull the embedding model
Ollama supports using the bge-m3 model for text embedding:
ollama pull bge-m3
- Use Ollama to embed text
You can use Ollama's embedding capability through the HTTP API or Python SDK:
Use the HTTP API
import requests def get_embedding(text: str) -> list: """Use the HTTP API of Ollama to obtain the text embedding""" response = requests.post( 'http://localhost:11434/api/embeddings', json={ 'model': 'bge-m3', 'prompt': text } ) return response.json()['embedding'] # Example usage text = "This is a sample text" embedding = get_embedding(text) print(embedding) # [-1.4269912242889404, 0.9092104434967041, ...]Use the Python SDK
First, install the Ollama Python SDK:
pip install ollamaThen, use it as follows:
import ollama # Example usage texts = ["Sentence 1", "Sentence 2"] embeddings = ollama.embed(model="bge-m3", input=texts)['embeddings'] print(embeddings) # [[0.03486196, 0.0625187, ...], [...]]
- Advantages and limitations of Ollama
The advantages are as follows:
- Fully deployed on your local machine, without the need for network connection
- Open-source and free, without the need for an API key
- Supports multiple models, making it convenient for you to switch and compare models
- Relatively small resource usage
The limitations are as follows:
- Limited model selection
- Lower performance compared to commercial services
- Requires manual maintenance and updates
- No enterprise-level support
When you decide whether to use Ollama, consider these factors. If your application scenario requires high privacy or must run offline, Ollama is a good choice. Otherwise, you may want to use a commercial service for higher service quality and better performance.
Call an online or remote embedding service
If you deploy an embedding model offline on a local server, the server requires high specifications and you must manage the loading and unloading of the model. Many users require online embedding services. Therefore, many AI inference service providers offer corresponding embedding services. For example, you can register an account on Alibaba Cloud Model Studio and obtain an API Key. Then, you can call its public API to obtain text embeddings.
Call the API through HTTP
After you obtain the API Key, you can try to call the API to embed texts. If the requests package is not installed in your Python environment, you need to execute the pip install requests command to install it first.
import requests
from typing import List
class RemoteEmbedding():
def __init__(
self,
base_url: str,
api_key: str,
model: str,
dimensions: int = 1024,
**kwargs,
):
self._base_url = base_url
self._api_key = api_key
self._model = model
self._dimensions = dimensions
"""
OpenAI compatible embedding API. Tongyi, Baichuan, Doubao, etc.
"""
def embed_documents(
self,
texts: List[str],
) -> List[List[float]]:
"""Embed search docs.
Args:
texts: List of text to embed.
Returns:
List of embeddings.
"""
res = requests.post(
f"{self._base_url}",
headers={"Authorization": f"Bearer {self._api_key}"},
json={
"input": texts,
"model": self._model,
"encoding_format": "float",
"dimensions": self._dimensions,
},
)
data = res.json()
embeddings = []
try:
for d in data["data"]:
embeddings.append(d["embedding"][: self._dimensions])
return embeddings
except Exception as e:
print(data)
print("Error", e)
raise e
def embed_query(self, text: str, **kwargs) -> List[float]:
"""Embed query text.
Args:
text: Text to embed.
Returns:
Embedding.
"""
return self.embed_documents([text])[0]
embedding = RemoteEmbedding(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1/embeddings", # For more information, see https://bailian.console.aliyun.com/#/model-market/detail/text-embedding-v3?tabKey=sdk
api_key="your-api-key", # Enter your API Key.
model="text-embedding-v3",
)
print("Embedding result:", embedding.embed_query("Today is a nice day"), "\n")
# Embedding result: [-0.03573227673768997, 0.0645645260810852, ...]
print("Embedding results:", embedding.embed_documents(["Today is a nice day", "What about tomorrow?"]), "\n")
# Embedding results: [[-0.03573227673768997, 0.0645645260810852, ...], [-0.05443647876381874, 0.07368793338537216, ...]]
Call the API using the Tongyi SDK
Tongyi provides the dashscope SDK to quickly call model capabilities. After you install the dashscope package using the pip install dashscope command, you can call its public API to obtain text embeddings.
import dashscope
from dashscope import TextEmbedding
# Set the API Key.
dashscope.api_key = "your-api-key"
# Prepare the input text.
texts = ["This is the first sentence.", "This is the second sentence."]
# Call the embedding service.
response = TextEmbedding.call(
model="text-embedding-v3",
input=texts
)
# Obtain the embedding results.
if response.status_code == 200:
print(response.output['embeddings'])
# [{"embedding": [-0.03193652629852295, 0.08152323216199875, ...]}, {"embedding": [...]}]