OlapOnDB API|V3.3.3|OceanBase TuGraph| docs|Distributed Database

This document provides detailed instructions for using the OlapOnDB API

1. Introduction

Generally speaking, users need to implement the process of subgraph extraction by themselves. Then use the rich interface in TuGraph to implement your own graph analysis algorithm.

This document mainly introduces the Procedure and Embed interface design, and introduces some common interface, specific interface information see include/lgraph/olap_on_db.h file.

2. Schema

Procedure and Embed auxiliary functions are mainly included in the OlapOnDB class, and some more frequently used functions will be introduced one by one

OLAP is associated with the following common data structures in TuGraph:

DB graph analysis class OlapOnDB
Vertex array ParallelVector
Vertex set ParallelBitset
Side data structure AdjUnit/AdjUnit
Edge collection data structure AdjList

2.1 Snapshot Based Storage Structure

The OlapOnDB class in TuGraph provides a data "snapshot," that is, a fully usable copy of a given data set that includes a mirror image of the corresponding data at a certain point in time (the point at which the copy was started). Because OLAP operations involve only reads and not writes, OlapOnDB arranges data in a more compact manner, saving space while improving data access locality.

2.2 BSP Calculation Model

TuGraph uses the BSP (Bulk Synchronous Parallel) model in the process of calculation, which makes the process can be executed in parallel, and greatly improves the efficiency of the program.

The core idea of BSP calculation model is to propose and use Super Step. After OlapOnDB is created, the computation on this data is divided into multiple supersteps, such as PageRank, which is divided into multiple iterations, and each iteration is a Super Step. There is explicit synchronization between different Super Steps to ensure that all threads proceed to the next Super Step at the same time after completing the same superstep. Within a Super Step, all threads execute asynchronously, using parallelism to improve computational efficiency.

BSP calculation model can effectively avoid deadlock, and achieve coarse-grained global synchronization in hardware mode by means of obstacle synchronization, so that graph computation can be executed in parallel, and programmers do not need to spend much time on synchronization mutual exclusion.

3. Algorithm example

Here is an explanation of the PageRank algorithm, which is mainly divided into the main function Process and the PageRank algorithm process function.

3.1 Main function

The main function has three input parameters, 'TuGraph' database parameter 'db', the request 'request' obtained from the web side, and the return value 'response' given to the web side. The overall process can be divided into the following steps:

Obtain related parameters
Create a snapshot class
Main process of PageRank algorithm
Obtain and send the return value of the web page

extern "C" bool Process(GraphDB & db, const std::string & request, std::string & response) {

    // Obtain the number of iterations from web side requests (num_iterations)，
    int num_iterations = 20;
    try {
        json input = json::parse(request);
        num_iterations = input["num_iterations"].get<int>();
    } catch (std::exception & e) {
        throw std::runtime_error("json parse error");
        return false;
    }

    // Read transaction creation and snapshot class creation
    auto txn = db.CreateReadTxn();
    OlapOnDB<Empty> olapondb(
        db,
        txn,
        SNAPSHOT_PARALLEL
    );

    // Create a pr array to store the pr value for each node
    ParallelVector<double> pr = olapondb.AllocVertexArray<double>();
    // pagerank algorithm main process, obtain the pagerank value of each node
    PageRankCore(olapondb, num_iterations, pr);

    auto all_vertices = olapondb.AllocVertexSubset();
    all_vertices.Fill();
    /*
        Function Purpose: Gets the number of the node with the largest pagerank among all nodes

        Function flow description: This function executes Func A for node vi (also known as the active vertices) corresponding to all bits of 1 in the vertex set all_vertices. The return value of Func A is then used as the second input parameter of Func B to obtain the local maximum value (because the first input parameter is 0. So the return value is actually the pagerank value of each node). Finally, the return value of all threads is summarized, and Func B is executed again to get the global return value, and stored in the max_pr_vi variable
    */
    size_t max_pr_vi = olapondb.ProcessVertexActive<size_t>(

        //Func A
        [&](size_t vi) {
            return vi;
        },
        all_vertices,
        0,

        //Func B
        [&](size_t a, size_t b) {
            return pr[a] > pr[b] ? a : b;
        }
    );

    // Retrieve and send the return value from the web page
    json output;
    output["max_pr_vid"] = olapondb.OriginalVid(max_pr_vi);
    output["max_pr_val"] = pr[max_pr_vi];
    response = output.dump();
    return true;
}

3.2 PageRank Algorithm process

pagerank main process has two input parameters, snapshot class (subgraph) and iteration times, the overall process can be divided into the following steps:

Initialization of related data structures
Initialize the pagerank value of each node
Calculation of pagerank value of each node, active vertices for all vertices (means that all vertices need to calculate pagerank value)
Obtain the pagerank value after 'num_iterations' of each node

void PageRankCore(OlapBase<Empty>& graph, int num_iterations, ParallelVector<double>& curr) {

    // Initialization of related data structures
    auto all_vertices = olapondb.AllocVertexSubset();
    all_vertices.Fill();
    auto curr = olapondb.AllocVertexArray<double>();
    auto next = olapondb.AllocVertexArray<double>();
    size_t num_vertices = olapondb.NumVertices();
    double one_over_n = (double)1 / num_vertices;

    // The initialization of the pagerank value of each node is inversely proportional to the degree of the node
    double delta = graph.ProcessVertexActive<double>(
        [&](size_t vi) {
            curr[vi] = one_over_n;
            if (olapondb.OutDegree(vi) > 0) {
                curr[vi] /= olapondb.OutDegree(vi);
            }
            return one_over_n;
        },
        all_vertices);

    // Total iteration process
    double d = (double)0.85;
        for (int ii = 0;ii < num_iterations;ii ++) {
        printf("delta(%d)=%lf\n", ii, delta);
        next.Fill((double)0);

        /*
            Function Purpose: Calculates the pagerank of all nodes

            Function flow description: This function is used to calculate the pagerank value of all nodes. Execute Func C on node vi corresponding to all the bits of 1 in all_vertices to obtain the pagerank value of vi in the current iteration and return the pagerank change value of vi node. The total change value of all active nodes is finally summarized and returned through the internal processing of the function, which is stored in the delta variable
        */
        delta = graph.ProcessVertexActive<double>(
            // Func C
            [&](size_t vi) {
                double sum = 0;

                // Gets the pagerank value of the current node from its neighbor
                for (auto & edge : olapondb.InEdges(vi)) {
                    size_t src = edge.neighbour;
                    sum += curr[src];
                }
                next[vi] = sum;

                // pagerank value calculation core formula
                next[vi] = (1 - d) * one_over_n + d * next[vi];
                if (ii == num_iterations - 1) {
                    return (double)0;
                } else {

                    // Statistics of relevant intermediate variables
                    if (olapondb.OutDegree(vi) > 0) {
                        next[vi] /= olapondb.OutDegree(vi);
                        return fabs(next[vi] - curr[vi]) * olapondb.OutDegree(vi);
                    } else {
                        return fabs(next[vi] - curr[vi]);
                    }
                }
            },
            all_vertices
        );

        // The pagerank value obtained in this iteration is output as the input of the next iteration
        curr.Swap(next);
    }
}

4.1 Transaction creation

// Create a read transaction
auto txn = db.CreateReadTxn();

// Write transaction creation
auto txn = db.CreateWriteTxn();

4.2 Parallelization to create a directed graph

OlapOnDB<empty> olapondb(</empty>
db,
txn,
SNAPSHOT_PARALLEL
)

4.3 Create undirected graphs in parallel

OlapOnDB<empty> olapondb(</empty>
db,
txn,
SNAPSHOT_PARALLEL | SNAPSHOT_UNDIRECTED;
)

4.4 Obtain the output degree

size_t OutDegree(size_t vid)

4.5 Obtain the input degree

size_t InDegree(size_t vid)

4.6 Gets the outgoing edge set

/*
    Function name: AdjList OutEdges(size_t vid)

    Data structure:

    An AdjList can be understood as an array of structures of type AdjUnit

    AdjUnit has two member variables: 1. size_t neighbour 2. edge_data: neighbour indicates the number of the target node pointed by the outgoing edge. If the value is a licensed graph, the data type of edge_data is the same as the weight value of the edge in the input file

    Example Output all outgoing neighbors of node vid
*/
for (auto & edge : olapondb.OutEdges(vid)) {
    size_t dst = edge.neighbour;
    printf("src = %lu,dst = %lu\n",vid,dst);
}

4.7 Gets the incoming edge set

AdjList<edgedata> InEdges(size_t vid)</edgedata>

// Example: Outputs all inbound neighbors of node vid
for (auto & edge : olapondb.InEdges(vid)) {
size_t dst = edge.neighbour;
printf("src = %lu,dst = %lu\n",vid,dst);
}

4.8 Get the node ID of OlapOnDB corresponding to the node in TuGraph

size_t OriginalVid(size_t vid)

// Note: The node numbers entered in TuGraph can be non-numbers, such as names. When OlapOnDB subgraphs are generated, names and other names will be converted to numbers for subsequent processing. Therefore, this method may not be applicable to some specific scenarios

4.9 Gets the node number of TuGraph corresponding to a node in OlapOnDB

size_t MappedVid(size_t original_vid)

4.10 Description of active vertices

Active vertices refer to the vertices that need to be processed in the batch function. In this example, we simply output the number of active vertices and summarize the number of active vertices:

ParallelBitset temp = 000111; // The current active vertices are vertices 3, 4 and 5

size_t delta = ForEachActiveVertex<double>(</double>
//void c
[&](size_t vi) {
printf("active_vertexId = %lu\n",vi);
return 1;
},
all_vertices
);

The result of this function is obvious. Because of multiple threads, the output order may vary:

active_vertexId = 3
active_vertexId = 4
active_vertexId = 5

The local return value is 1. This function will add all the local return values in a thread-safe manner to the final return value, which is stored in the delta variable, and the final value is 3