OlapOnDisk API|V3.3.3|OceanBase TuGraph| docs|Distributed Database

This document mainly introduces the usage instructions of OlapOnDisk API in detail

1. Introduction
[2. Algorithm Example](#2-Algorithm Example)
- [2.1 header files] (#21-header files)
- [2.2 Configuration class MyConfig](#22-Configuration class MyConfig)
- [2.3 Main function](#23-Main function)
- [2.4 bfs algorithm flow] (#24-bfs algorithm flow)
[3. Function description of other commonly used functions](#3-Function description of other commonly used functions)
- [3.1 Image Loading] (#31-Image Loading)
- [3.2 Image writing] (#32-Image writing)
- [3.3 Graph Analysis Function](#33-Graph Analysis Function)

1. Introduction

The Standalone mode of TuGraph can be used to load graph data files, where the sources of graph data files can include text files, BINARY_FILE binary files, and ODPS sources. In this mode, TuGraph can quickly load multiple data sources into a graph, and then run iterative algorithms such as BFS, WCC, SSSP, etc. on the graph, and output the final result to the terminal.

In TuGraph, the export and calculation process can be accelerated through parallel processing in memory, so as to achieve near real-time processing and analysis. Compared with traditional methods, it avoids the overhead of data export and storage, and can use compact Graph data structures achieve desirable performance for computation.

TuGraph has built-in a large number of common graph analysis algorithms and rich auxiliary interfaces, so users hardly need to implement the specific graph calculation process by themselves. They only need to include the header file (.h) of the corresponding algorithm library when implementing their own stored procedures. To your own program, and link your own dynamic library files in the compilation phase.

This document mainly introduces the common interfaces of Standalone, and the auxiliary functions used are mainly contained in the OlapOnDB class. At the same time, in order to help users understand and facilitate, the BFS algorithm is illustrated with examples.

2. Algorithm example

Here, the BFS algorithm is explained in blocks, which are roughly divided into the main function main, the BFS algorithm process BFSCore function and the configuration class MyConfig.

2.1 Head file

#include "olap/olap_on_disk.h"   
#include "tools/json.hpp"      //Header files to include when using TuGraph
#include "./algo.h"   //A header file containing various algorithmic logic functions

When using TuGraph to realize the calculation application of graph data files, generally, the StandaloneGraph class object graph is first created, the graph file data is loaded into the graph, and then the graph calculation process is realized by calling the graph logic function, and finally the result of the graph calculation is printed out.

2.2 Configuration class MyConfig

The MyConfig configuration class function is used to provide the configuration information required for the algorithm logic calculation, inherited from ConfigBase , where EdgeDate can choose Empty (unweighted graph), int (the weight of the weighted graph is an integer) or double (the weight of the weighted graph is double) type.

The MyConfig configuration class generally depends on the algorithm, and additional configuration information is required as follows:

Parameters required by the algorithm
Algorithm name
Configure the Print function in the class Other common members inherit from ConfigBase, please refer to src/olap/olap_config.h for reference.

class MyConfig : public ConfigBase<Empty> {
 public:

    // The parameters required by the algorithm are initialized
    size_t root = 0;
    std::string name = std::string("bfs");
    void AddParameter(fma_common::Configuration & config) {
        ConfigBase<Empty>::AddParameter(config);
        config.Add(root, "root", true)
                .Comment("the root of bfs");
    }
    void Print() {
        ConfigBase<Empty>::Print();
        std::cout << "  name: " << name << std::endl;
        if (root != size_t(-1)) {
            std::cout << "  root: " << root << std::endl;
        } else {
            std::cout << "  root: UNSET" << std::endl;
        }
    }
    // The configuration file accepts command line parameters. This use case will sequentially read the parameters when calling the algorithm from the command line. The value specified by the user is preferred. If the user does not specify it, the default parameter is selected.
    MyConfig(int &argc, char** &argv): ConfigBase<Empty>(argc, argv) {
        fma_common::Configuration config;
        AddParameter(config);
        config.ExitAfterHelp(true);
        config.ParseAndFinalize(argc, argv);
        Print();
    }
};

2.3 main function

int main(int argc, char** argv) {
    double start_time;
    // Statistical memory consumption class MemUsage instantiation
    MemUsage memUsage;
    memUsage.startMemRecord();

    // prepare
    start_time = get_time();
    // Configuration class MyConfig instantiation
    MyConfig config(argc, argv);
    size_t root_vid = config.root;
    // OlapOnDisk class instantiation
    OlapOnDisk<Empty> graph;
    graph.Load(config, DUAL_DIRECTION);
    memUsage.print();
    memUsage.reset();
    // Statistical graph loading time consumption
    auto prepare_cost = get_time() - start_time;
    printf("prepare_cost = %.2lf(s)\n", prepare_cost);

    // core
    start_time = get_time();
    // Create an array to count whether a node has been traversed
    auto parent = graph.AllocVertexArray<size_t>();
    // Breadth-first search algorithm, returns the number of nodes connected to the root_vid root node in the graph
    size_t count = BFSCore(graph, root_vid, parent);
    memUsage.print();
    memUsage.reset();
    auto core_cost = get_time() - start_time;
    printf("core_cost = %.2lf(s)\n", core_cost);

    // output
    start_time = get_time();
    // Print relevant information to the terminal
    printf("found_vertices = %ld\n", count);
    auto output_cost = get_time() - start_time;
    printf("output_cost = %.2lf(s)\n", output_cost);

    printf("total_cost = %.2lf(s)\n", prepare_cost + core_cost + output_cost);
    printf("DONE.");

    return 0;
}

2.4 bfs algorithm process

The main process of bfs has two input parameters, the snapshot class (subgraph) and the number of iterations. The overall process can be divided into the following steps:

Relevant definitions and initialization of data structures
Use the batch function to perform cyclic calculations on each node, find all nodes adjacent to the current node in each round, and exchange them when the round ends.
Until all nodes are found, return the number of nodes discovered_vertices.

size_t BFSCore(Graph<Empty>& graph, size_t root_vid, ParallelVector<size_t>& parent){

  size_t root = root_vid;
  auto active_in = graph.AllocVertexSubset();   //Allocate an array, active_in is used to store the nodes found in the previous cycle stage
  active_in.Add(root);            //Add the root node to the array
  auto active_out = graph.AllocVertexSubset();  //Allocate the array active_out to store the nodes found in the current cycle stage
  parent.Fill((size_t)-1);               //Assign a value of -1 to the node in the parent array, -1 means not found
  parent[root] = root;
  size_t num_activations = 1;       //Indicates the number of nodes found in the current loop phase
  size_t discovered_vertices = 0;    //Indicates the total number of nodes found in the current cycle phase

  for (int ii = 0; num_activations != 0; ii++) {       //num_activations indicates the number of nodes found in the current loop phase
      printf("activates(%d) <= %lu\n", ii, num_activations);
      discovered_vertices += num_activations;         //discovered_vertices indicates the total number of nodes found in the current cycle phase
      active_out.Clear();
      num_activations = graph.ProcessVertexActive<size_t>(
          [&](size_t vi) {
              size_t num_activations = 0;
              for (auto& edge : graph.OutEdges(vi)) {   //Each cycle starts from the root node, finds adjacent adjacent nodes, changes its parent value, and operates num_activations+1
                  size_t dst = edge.neighbour;
                  if (parent[dst] == (size_t)-1) {
                      auto lock = graph.GuardVertexLock(dst);
                      if (parent[dst] == (size_t)-1) {
                          parent[dst] = vi;
                          num_activations += 1;
                          active_out.Add(dst);       //Store the nodes found in the current loop phase
                      }
                  }
              }
              return num_activations;
          },
          active_in);
      active_in.Swap(active_out);
  }
  // return all nodes
  return discovered_vertices;
}

3. Description of other commonly used functions

3.1 Graph load

TuGraph-StandaloneThe loading sources of graph data files are mainly divided into three categories: text files, binary files, and ODPS. The binary file is a file in which the binary representation of the edge data is arranged in order, which can save a lot of storage space. Its loading function is divided into three types, namely:

void Load(ConfigBase<EdgeData> config,EdgeDirectionPolicy edge_direction_policy = DUAL_DIRECTION)：The loading method of the graph data file contains two parameters, and their meanings represent respectively
- config：Configuration parameters to load. This parameter saves the general information of the graph (such as data source, algorithm name, data input and output paths, number of vertices, etc.) and different information parameters configured according to different data sources and different algorithms.
- edge_direction_policy：Specifies whether the graph is directed or undirected, including three modes: DUAL_DIRECTION, MAKE_SYMMETRIC, and INPUT_SYMMETRIC. Among them, DUAL_DIRECTION is the default graph loading method. DUAL_DIRECTION : The input file is an asymmetric graph and the loaded graph is an asymmetric graph. MAKE_SYMMETRIC : The input file is an asymmetric graph and the loaded graph is a symmetric graph. INPUT_SYMMETRIC : The input file is a symmetric graph and the loaded graph is a symmetric graph. For details, see enum EdgeDirectionPolicy in the olap_config.h file under the lgraph folder.
void LoadVertexArrayTxt<V>(V * array, std::string path, std::function<size_t(const char *, const char *, VertexUnit<V> &)> parse_line)：Load the vertices in the file into an array in the order of their ids. The meanings of each parameter are:
- array：array of data to be read
- path：The path to read the file, each line in the file represents a pair of vertex
- parse_line：A user-defined function that tells the system how to parse a line of text data into a vertex pair.

3.2 Graph write

void Write(ConfigBase<EdgeData> & config, ParallelVector<VertexData>& array, size_t array_size, std::string name, std::function<bool(VertexData &)> filter_output = filter_output_default<VertexData&>)：Write the data in the array back to the file, and the meanings of each parameter are:
- config：Configuration parameters to load. This parameter saves the general information of the graph (such as data source, algorithm name, data input and output paths, number of vertices, etc.) and different information parameters configured according to different data sources and different algorithms.
- array：array of data to be written
- array_size：The length of the number of data to be written
- name：algorithm name
- filter_output：Write data rule function, the data to be written needs to meet the requirements of this function.

3.3 graph parse function

std::tuple<size_t, bool> parse_line_unweighted(const char *p, const char *end, EdgeUnit<EdgeData> &e)：Parse the graph data file, and load the graph as an unweighted graph.
std::tuple<size_t, bool> parse_line_weight(const char* p, const char* end, EdgeUnit<EdgeData>& e)：Parse the graph data file, load the graph as a weighted graph, and specify the weight data type by modifying .

This function can be specified through the constructor parse_line when the MyConfig class is defined.

OlapOnDisk API

Table of contents