This document provides detailed instructions for using the OlapOnDB API
1. Introduction
Generally speaking, users need to implement the process of subgraph extraction by themselves. Then use the rich interface in TuGraph to implement your own graph analysis algorithm.
This document mainly introduces the Procedure and Embed interface design, and introduces some common interface, specific interface information see include/lgraph/olap_on_db.h file.
2. Schema
Procedure and Embed auxiliary functions are mainly included in the OlapOnDB class, and some more frequently used functions will be introduced one by one
OLAP is associated with the following common data structures in TuGraph:
DB graph analysis class
OlapOnDBVertex array
ParallelVectorVertex set
ParallelBitsetSide data structure
AdjUnit/AdjUnitEdge collection data structure
AdjList
2.1 Snapshot Based Storage Structure
The OlapOnDB class in TuGraph provides a data "snapshot," that is, a fully usable copy of a given data set that includes a mirror image of the corresponding data at a certain point in time (the point at which the copy was started). Because OLAP operations involve only reads and not writes, OlapOnDB arranges data in a more compact manner, saving space while improving data access locality.
2.2 BSP Calculation Model
TuGraph uses the BSP (Bulk Synchronous Parallel) model in the process of calculation, which makes the process can be executed in parallel, and greatly improves the efficiency of the program.
The core idea of BSP calculation model is to propose and use Super Step. After OlapOnDB is created, the computation on this data is divided into multiple supersteps, such as PageRank, which is divided into multiple iterations, and each iteration is a Super Step. There is explicit synchronization between different Super Steps to ensure that all threads proceed to the next Super Step at the same time after completing the same superstep. Within a Super Step, all threads execute asynchronously, using parallelism to improve computational efficiency.
BSP calculation model can effectively avoid deadlock, and achieve coarse-grained global synchronization in hardware mode by means of obstacle synchronization, so that graph computation can be executed in parallel, and programmers do not need to spend much time on synchronization mutual exclusion.
3. Algorithm example
Here is an explanation of the PageRank algorithm, which is mainly divided into the main function Process and the PageRank algorithm process function.
3.1 Main function
The main function has three input parameters, 'TuGraph' database parameter 'db', the request 'request' obtained from the web side, and the return value 'response' given to the web side. The overall process can be divided into the following steps:
Obtain related parameters
Create a snapshot class
Main process of PageRank algorithm
Obtain and send the return value of the web page
extern "C" bool Process(GraphDB & db, const std::string & request, std::string & response) {
// Obtain the number of iterations from web side requests (num_iterations),
int num_iterations = 20;
try {
json input = json::parse(request);
num_iterations = input["num_iterations"].get<int>();
} catch (std::exception & e) {
throw std::runtime_error("json parse error");
return false;
}
// Read transaction creation and snapshot class creation
auto txn = db.CreateReadTxn();
OlapOnDB<Empty> olapondb(
db,
txn,
SNAPSHOT_PARALLEL
);
// Create a pr array to store the pr value for each node
ParallelVector<double> pr = olapondb.AllocVertexArray<double>();
// pagerank algorithm main process, obtain the pagerank value of each node
PageRankCore(olapondb, num_iterations, pr);
auto all_vertices = olapondb.AllocVertexSubset();
all_vertices.Fill();
/*
Function Purpose: Gets the number of the node with the largest pagerank among all nodes
Function flow description: This function executes Func A for node vi (also known as the active vertices) corresponding to all bits of 1 in the vertex set all_vertices. The return value of Func A is then used as the second input parameter of Func B to obtain the local maximum value (because the first input parameter is 0. So the return value is actually the pagerank value of each node). Finally, the return value of all threads is summarized, and Func B is executed again to get the global return value, and stored in the max_pr_vi variable
*/
size_t max_pr_vi = olapondb.ProcessVertexActive<size_t>(
//Func A
[&](size_t vi) {
return vi;
},
all_vertices,
0,
//Func B
[&](size_t a, size_t b) {
return pr[a] > pr[b] ? a : b;
}
);
// Retrieve and send the return value from the web page
json output;
output["max_pr_vid"] = olapondb.OriginalVid(max_pr_vi);
output["max_pr_val"] = pr[max_pr_vi];
response = output.dump();
return true;
}
3.2 PageRank Algorithm process
pagerank main process has two input parameters, snapshot class (subgraph) and iteration times, the overall process can be divided into the following steps:
- Initialization of related data structures
- Initialize the pagerank value of each node
- Calculation of pagerank value of each node, active vertices for all vertices (means that all vertices need to calculate pagerank value)
- Obtain the pagerank value after 'num_iterations' of each node
void PageRankCore(OlapBase<Empty>& graph, int num_iterations, ParallelVector<double>& curr) {
// Initialization of related data structures
auto all_vertices = olapondb.AllocVertexSubset();
all_vertices.Fill();
auto curr = olapondb.AllocVertexArray<double>();
auto next = olapondb.AllocVertexArray<double>();
size_t num_vertices = olapondb.NumVertices();
double one_over_n = (double)1 / num_vertices;
// The initialization of the pagerank value of each node is inversely proportional to the degree of the node
double delta = graph.ProcessVertexActive<double>(
[&](size_t vi) {
curr[vi] = one_over_n;
if (olapondb.OutDegree(vi) > 0) {
curr[vi] /= olapondb.OutDegree(vi);
}
return one_over_n;
},
all_vertices);
// Total iteration process
double d = (double)0.85;
for (int ii = 0;ii < num_iterations;ii ++) {
printf("delta(%d)=%lf\n", ii, delta);
next.Fill((double)0);
/*
Function Purpose: Calculates the pagerank of all nodes
Function flow description: This function is used to calculate the pagerank value of all nodes. Execute Func C on node vi corresponding to all the bits of 1 in all_vertices to obtain the pagerank value of vi in the current iteration and return the pagerank change value of vi node. The total change value of all active nodes is finally summarized and returned through the internal processing of the function, which is stored in the delta variable
*/
delta = graph.ProcessVertexActive<double>(
// Func C
[&](size_t vi) {
double sum = 0;
// Gets the pagerank value of the current node from its neighbor
for (auto & edge : olapondb.InEdges(vi)) {
size_t src = edge.neighbour;
sum += curr[src];
}
next[vi] = sum;
// pagerank value calculation core formula
next[vi] = (1 - d) * one_over_n + d * next[vi];
if (ii == num_iterations - 1) {
return (double)0;
} else {
// Statistics of relevant intermediate variables
if (olapondb.OutDegree(vi) > 0) {
next[vi] /= olapondb.OutDegree(vi);
return fabs(next[vi] - curr[vi]) * olapondb.OutDegree(vi);
} else {
return fabs(next[vi] - curr[vi]);
}
}
},
all_vertices
);
// The pagerank value obtained in this iteration is output as the input of the next iteration
curr.Swap(next);
}
}
4.1 Transaction creation
// Create a read transaction
auto txn = db.CreateReadTxn();
// Write transaction creation
auto txn = db.CreateWriteTxn();
4.2 Parallelization to create a directed graph
OlapOnDB<empty> olapondb(</empty>
db,
txn,
SNAPSHOT_PARALLEL
)
4.3 Create undirected graphs in parallel
OlapOnDB<empty> olapondb(</empty>
db,
txn,
SNAPSHOT_PARALLEL | SNAPSHOT_UNDIRECTED;
)
4.4 Obtain the output degree
size_t OutDegree(size_t vid)
4.5 Obtain the input degree
size_t InDegree(size_t vid)
4.6 Gets the outgoing edge set
/*
Function name: AdjList OutEdges(size_t vid)
Data structure:
An AdjList can be understood as an array of structures of type AdjUnit
AdjUnit has two member variables: 1. size_t neighbour 2. edge_data: neighbour indicates the number of the target node pointed by the outgoing edge. If the value is a licensed graph, the data type of edge_data is the same as the weight value of the edge in the input file
Example Output all outgoing neighbors of node vid
*/
for (auto & edge : olapondb.OutEdges(vid)) {
size_t dst = edge.neighbour;
printf("src = %lu,dst = %lu\n",vid,dst);
}
4.7 Gets the incoming edge set
AdjList<edgedata> InEdges(size_t vid)</edgedata>
// Example: Outputs all inbound neighbors of node vid
for (auto & edge : olapondb.InEdges(vid)) {
size_t dst = edge.neighbour;
printf("src = %lu,dst = %lu\n",vid,dst);
}
4.8 Get the node ID of OlapOnDB corresponding to the node in TuGraph
size_t OriginalVid(size_t vid)
// Note: The node numbers entered in TuGraph can be non-numbers, such as names. When OlapOnDB subgraphs are generated, names and other names will be converted to numbers for subsequent processing. Therefore, this method may not be applicable to some specific scenarios
4.9 Gets the node number of TuGraph corresponding to a node in OlapOnDB
size_t MappedVid(size_t original_vid)
4.10 Description of active vertices
Active vertices refer to the vertices that need to be processed in the batch function. In this example, we simply output the number of active vertices and summarize the number of active vertices:
ParallelBitset temp = 000111; // The current active vertices are vertices 3, 4 and 5
size_t delta = ForEachActiveVertex<double>(</double>
//void c
[&](size_t vi) {
printf("active_vertexId = %lu\n",vi);
return 1;
},
all_vertices
);
The result of this function is obvious. Because of multiple threads, the output order may vary:
active_vertexId = 3
active_vertexId = 4
active_vertexId = 5
The local return value is 1. This function will add all the local return values in a thread-safe manner to the final return value, which is stored in the delta variable, and the final value is 3