This document mainly introduces the usage instructions of OlapOnDisk API in detail
Table of contents
- 1. Introduction
- [2. Algorithm Example](#2-Algorithm Example)
- [2.1 header files] (#21-header files)
- [2.2 Configuration class MyConfig](#22-Configuration class MyConfig)
- [2.3 Main function](#23-Main function)
- [2.4 bfs algorithm flow] (#24-bfs algorithm flow)
- [3. Function description of other commonly used functions](#3-Function description of other commonly used functions)
- [3.1 Image Loading] (#31-Image Loading)
- [3.2 Image writing] (#32-Image writing)
- [3.3 Graph Analysis Function](#33-Graph Analysis Function)
1. Introduction
The Standalone mode of TuGraph can be used to load graph data files, where the sources of graph data files can include text files, BINARY_FILE binary files, and ODPS sources. In this mode, TuGraph can quickly load multiple data sources into a graph, and then run iterative algorithms such as BFS, WCC, SSSP, etc. on the graph, and output the final result to the terminal.
In TuGraph, the export and calculation process can be accelerated through parallel processing in memory, so as to achieve near real-time processing and analysis. Compared with traditional methods, it avoids the overhead of data export and storage, and can use compact Graph data structures achieve desirable performance for computation.
TuGraph has built-in a large number of common graph analysis algorithms and rich auxiliary interfaces, so users hardly need to implement the specific graph calculation process by themselves. They only need to include the header file (.h) of the corresponding algorithm library when implementing their own stored procedures. To your own program, and link your own dynamic library files in the compilation phase.
This document mainly introduces the common interfaces of Standalone, and the auxiliary functions used are mainly contained in the OlapOnDB class. At the same time, in order to help users understand and facilitate, the BFS algorithm is illustrated with examples.
2. Algorithm example
Here, the BFS algorithm is explained in blocks, which are roughly divided into the main function main, the BFS algorithm process BFSCore function and the configuration class MyConfig.
2.1 Head file
#include "olap/olap_on_disk.h"
#include "tools/json.hpp" //Header files to include when using TuGraph
#include "./algo.h" //A header file containing various algorithmic logic functions
When using TuGraph to realize the calculation application of graph data files, generally, the StandaloneGraph class object graph is first created, the graph file data is loaded into the graph, and then the graph calculation process is realized by calling the graph logic function, and finally the result of the graph calculation is printed out.
2.2 Configuration class MyConfig
The MyConfig configuration class function is used to provide the configuration information required for the algorithm logic calculation, inherited from ConfigBase
The MyConfig configuration class generally depends on the algorithm, and additional configuration information is required as follows:
- Parameters required by the algorithm
- Algorithm name
- Configure the Print function in the class Other common members inherit from ConfigBase, please refer to src/olap/olap_config.h for reference.
class MyConfig : public ConfigBase<Empty> {
public:
// The parameters required by the algorithm are initialized
size_t root = 0;
std::string name = std::string("bfs");
void AddParameter(fma_common::Configuration & config) {
ConfigBase<Empty>::AddParameter(config);
config.Add(root, "root", true)
.Comment("the root of bfs");
}
void Print() {
ConfigBase<Empty>::Print();
std::cout << " name: " << name << std::endl;
if (root != size_t(-1)) {
std::cout << " root: " << root << std::endl;
} else {
std::cout << " root: UNSET" << std::endl;
}
}
// The configuration file accepts command line parameters. This use case will sequentially read the parameters when calling the algorithm from the command line. The value specified by the user is preferred. If the user does not specify it, the default parameter is selected.
MyConfig(int &argc, char** &argv): ConfigBase<Empty>(argc, argv) {
fma_common::Configuration config;
AddParameter(config);
config.ExitAfterHelp(true);
config.ParseAndFinalize(argc, argv);
Print();
}
};
2.3 main function
int main(int argc, char** argv) {
double start_time;
// Statistical memory consumption class MemUsage instantiation
MemUsage memUsage;
memUsage.startMemRecord();
// prepare
start_time = get_time();
// Configuration class MyConfig instantiation
MyConfig config(argc, argv);
size_t root_vid = config.root;
// OlapOnDisk class instantiation
OlapOnDisk<Empty> graph;
graph.Load(config, DUAL_DIRECTION);
memUsage.print();
memUsage.reset();
// Statistical graph loading time consumption
auto prepare_cost = get_time() - start_time;
printf("prepare_cost = %.2lf(s)\n", prepare_cost);
// core
start_time = get_time();
// Create an array to count whether a node has been traversed
auto parent = graph.AllocVertexArray<size_t>();
// Breadth-first search algorithm, returns the number of nodes connected to the root_vid root node in the graph
size_t count = BFSCore(graph, root_vid, parent);
memUsage.print();
memUsage.reset();
auto core_cost = get_time() - start_time;
printf("core_cost = %.2lf(s)\n", core_cost);
// output
start_time = get_time();
// Print relevant information to the terminal
printf("found_vertices = %ld\n", count);
auto output_cost = get_time() - start_time;
printf("output_cost = %.2lf(s)\n", output_cost);
printf("total_cost = %.2lf(s)\n", prepare_cost + core_cost + output_cost);
printf("DONE.");
return 0;
}
2.4 bfs algorithm process
The main process of bfs has two input parameters, the snapshot class (subgraph) and the number of iterations. The overall process can be divided into the following steps:
- Relevant definitions and initialization of data structures
- Use the batch function to perform cyclic calculations on each node, find all nodes adjacent to the current node in each round, and exchange them when the round ends.
- Until all nodes are found, return the number of nodes discovered_vertices.
size_t BFSCore(Graph<Empty>& graph, size_t root_vid, ParallelVector<size_t>& parent){
size_t root = root_vid;
auto active_in = graph.AllocVertexSubset(); //Allocate an array, active_in is used to store the nodes found in the previous cycle stage
active_in.Add(root); //Add the root node to the array
auto active_out = graph.AllocVertexSubset(); //Allocate the array active_out to store the nodes found in the current cycle stage
parent.Fill((size_t)-1); //Assign a value of -1 to the node in the parent array, -1 means not found
parent[root] = root;
size_t num_activations = 1; //Indicates the number of nodes found in the current loop phase
size_t discovered_vertices = 0; //Indicates the total number of nodes found in the current cycle phase
for (int ii = 0; num_activations != 0; ii++) { //num_activations indicates the number of nodes found in the current loop phase
printf("activates(%d) <= %lu\n", ii, num_activations);
discovered_vertices += num_activations; //discovered_vertices indicates the total number of nodes found in the current cycle phase
active_out.Clear();
num_activations = graph.ProcessVertexActive<size_t>(
[&](size_t vi) {
size_t num_activations = 0;
for (auto& edge : graph.OutEdges(vi)) { //Each cycle starts from the root node, finds adjacent adjacent nodes, changes its parent value, and operates num_activations+1
size_t dst = edge.neighbour;
if (parent[dst] == (size_t)-1) {
auto lock = graph.GuardVertexLock(dst);
if (parent[dst] == (size_t)-1) {
parent[dst] = vi;
num_activations += 1;
active_out.Add(dst); //Store the nodes found in the current loop phase
}
}
}
return num_activations;
},
active_in);
active_in.Swap(active_out);
}
// return all nodes
return discovered_vertices;
}
3. Description of other commonly used functions
3.1 Graph load
TuGraph-StandaloneThe loading sources of graph data files are mainly divided into three categories: text files, binary files, and ODPS. The binary file is a file in which the binary representation of the edge data is arranged in order, which can save a lot of storage space. Its loading function is divided into three types, namely:
void Load(ConfigBase<EdgeData> config,EdgeDirectionPolicy edge_direction_policy = DUAL_DIRECTION):The loading method of the graph data file contains two parameters, and their meanings represent respectivelyconfig:Configuration parameters to load. This parameter saves the general information of the graph (such as data source, algorithm name, data input and output paths, number of vertices, etc.) and different information parameters configured according to different data sources and different algorithms.edge_direction_policy:Specifies whether the graph is directed or undirected, including three modes: DUAL_DIRECTION, MAKE_SYMMETRIC, and INPUT_SYMMETRIC. Among them, DUAL_DIRECTION is the default graph loading method. DUAL_DIRECTION : The input file is an asymmetric graph and the loaded graph is an asymmetric graph. MAKE_SYMMETRIC : The input file is an asymmetric graph and the loaded graph is a symmetric graph. INPUT_SYMMETRIC : The input file is a symmetric graph and the loaded graph is a symmetric graph. For details, seeenum EdgeDirectionPolicyin the olap_config.h file under the lgraph folder.
void LoadVertexArrayTxt<V>(V * array, std::string path, std::function<size_t(const char *, const char *, VertexUnit<V> &)> parse_line):Load the vertices in the file into an array in the order of their ids. The meanings of each parameter are:array:array of data to be readpath:The path to read the file, each line in the file represents a pair of vertexparse_line:A user-defined function that tells the system how to parse a line of text data into a vertex pair.
3.2 Graph write
void Write(ConfigBase<EdgeData> & config, ParallelVector<VertexData>& array, size_t array_size, std::string name, std::function<bool(VertexData &)> filter_output = filter_output_default<VertexData&>):Write the data in the array back to the file, and the meanings of each parameter are:config:Configuration parameters to load. This parameter saves the general information of the graph (such as data source, algorithm name, data input and output paths, number of vertices, etc.) and different information parameters configured according to different data sources and different algorithms.array:array of data to be writtenarray_size:The length of the number of data to be writtenname:algorithm namefilter_output:Write data rule function, the data to be written needs to meet the requirements of this function.
3.3 graph parse function
std::tuple<size_t, bool> parse_line_unweighted(const char *p, const char *end, EdgeUnit<EdgeData> &e):Parse the graph data file, and load the graph as an unweighted graph.std::tuple<size_t, bool> parse_line_weight(const char* p, const char* end, EdgeUnit<EdgeData>& e):Parse the graph data file, load the graph as a weighted graph, and specify the weight data type by modifying.
This function can be specified through the constructor parse_line when the MyConfig class is defined.