Production environment data processing techniques

Last Updated：2023-08-18 09:26:34 Updated

Common tasks of database applications based on data warehouses are queries for or modifications on massive data. The problem facing application developers is how to achieve high performance during processing of massive data sets.

Data processing techniques include two types: iterative operation and set operation.

Iterative data processing techniques

In iterative processing, the application cyclically traverses a set of row data based on specified conditions.

The following procedure is a typical procedure of iterative processing, but not every step is required. In this procedure, the client/server model is used.

The client application reads a set of data from the database.
The client application internally processes this set of data entry by entry.
The client writes the data processing results back to the database.

You can implement iterative processing by using three mainstream technologies: progressive processing, group processing, and manual parallel processing.

Progressive iterative processing

In progressive processing, a single thread traverses the dataset and operates on a single row at a time. In a typical implementation, the application retrieves a row of data from the database, processes the data in the intermediate layer, and sends the data back to the database. Then, the database executes DML statements and commits the transaction.

Assume that you need to process records in the raw_list table, split each row into two records, and insert them into the op_list table. The following sample codes of a PL program use the progressive processing technology to implement this requirement:

CREATE TABLE raw_list(
      id number NOT NULL PRIMARY KEY
    , debit_id number NOT NULL
    , credit_id number NOT NULL
    , op_amount number NOT NULL
    , op_time date NOT NULL 
);
CREATE SEQUENCE seq_op_list START WITH 1 ;
CREATE TABLE op_list(
      op_id number NOT NULL PRIMARY KEY
    , op_code varchar2(20) NOT NULL
    , account_id number NOT NULL
    , amount number NOT NULL
    , gmt_create date DEFAULT sysdate NOT NULL
);

delimiter /
DECLARE
    CURSOR c1 IS SELECT * FROM raw_list ;
    r   raw_list%rowtype ;
BEGIN
    OPEN c1;
    LOOP
        FETCH c1 INTO r ;
        EXIT WHEN c1%NOTFOUND ;
    
        INSERT INTO op_list(op_id, op_code, account_id, amount) values(seq_op_list.nextval, '-', r.credit_id, r.op_amount);
        INSERT INTO op_list(op_id, op_code, account_id, amount) values(seq_op_list.nextval, '+', r.debit_id, r.op_amount);
        COMMIT;
    END LOOP; 
END;
/
delimiter ;

The progressive processing technology has the following advantages:

Meanwhile, the progressive processing technology has the following disadvantages:

Group iterative processing

Group processing is the same as progressive processing, except that it processes multiple rows instead of a single row during each iteration . Example:

delimiter /
DECLARE
    CURSOR c1 IS SELECT * FROM raw_list ;
    TYPE T_RAW  IS TABLE OF raw_list%ROWTYPE INDEX BY binary_integer ;
    t   T_RAW;
    row_cnt binary_integer  := 0;
    array_size  NUMBER := 100;
BEGIN
    OPEN c1;
    LOOP
        FETCH c1 BULK COLLECT INTO t LIMIT array_size ;
        EXIT WHEN t.COUNT = 0 ;
    
        FORALL i IN 1..t.COUNT
            INSERT INTO op_list(op_id, op_code, account_id, amount) values(seq_op_list.nextval, '-', t(i).credit_id, t(i).op_amount);
        
        FORALL i IN 1..t.COUNT
            INSERT INTO op_list(op_id, op_code, account_id, amount) values(seq_op_list.nextval, '+', t(i).debit_id, t(i).op_amount);
        
        COMMIT;
    END LOOP; 
END;
/
delimiter ;

Different from codes for progressive iterative processing, the foregoing codes use the BULK COLLECT operator in the FETCH statement to obtain data in groups and specify array_size to limit the size of each group.

Compared with the progressive iterative processing technology, the group iterative processing technology has the following advantages:

The disadvantages of the group iterative processing technology are the same as those of the progressive iterative processing technology. Specifically, the processing time is also excessively long when massive data needs to be processed. In addition, the application can execute codes only serially, failing to fully utilize the parallel processing capability of the database.

Manual parallel iterative processing

Manual parallel iterative processing use the same iteration algorithm as progressive processing and group processing, but allows multiple server processes to divide and run jobs in parallel.

Assume that you have the same business demand as in progressive iterative processing except that:

delimiter /
DECLARE 
    sqlstmt varchar2(1024)  := q'[
DECLARE
    CURSOR c1 IS SELECT * FROM raw_list_${thr} ;
    TYPE T_RAW  IS TABLE OF raw_list%ROWTYPE INDEX BY binary_integer ;
    t   T_RAW;
    row_cnt binary_integer  := 0;
    array_size  NUMBER := 100;
BEGIN
    OPEN c1;
    LOOP
        FETCH c1 BULK COLLECT INTO t LIMIT array_size ;
        EXIT WHEN t.COUNT = 0 ;
    
        FORALL i IN 1..t.COUNT
            INSERT INTO op_list(op_id, op_code, account_id, amount) values(seq_op_list.nextval, '-', t(i).credit_id, t(i).op_amount);
        
        FORALL i IN 1..t.COUNT
            INSERT INTO op_list(op_id, op_code, account_id, amount) values(seq_op_list.nextval, '+', t(i).debit_id, t(i).op_amount);
        
        COMMIT;
    END LOOP; 
END;
]'  ;

BEGIN
    sqlstmt := replace(sqlstmt, '${thr}', 1);
    EXECUTE IMMEDIATE sqlstmt ;
END;
/
delimiter ;

Manual parallel processing has the following advantages:

The manual parallel processing has the following disadvantages:

Set-based data processing

Set-based processing is an SQL technology that processes datasets in the database.

In the set-based model, SQL statements define the results and the database is allowed to determine the most efficient way to obtain the results. On the contrary, an iteration algorithm uses conditional logic to pull each row or group of data rows from the database to the client application, processes the data on the client, and then sends the data back to the database. Set-based processing avoids network roundtrips and API overheads of the database, because the data does not leave the database.

Still take the previous requirement as an example. The following SQL statements use the set-based algorithm to meet the requirement:

INSERT INTO op_list(op_id, op_code, account_id, amount, gmt_create) 
SELECT  seq_op_list.nextval, '+', debit_id, op_amount, sysdate
FROM raw_list;
INSERT INTO op_list(op_id, op_code, account_id, amount, gmt_create) 
SELECT  seq_op_list.nextval, '-', credit_id, op_amount, sysdate
FROM raw_list;
COMMIT;

The SELECT clause that follows the INSERT statement reads all data and writes the data to the target table at a time. Then, the application commits the transactions all at once.

Compared with iteration algorithms, the set-based processing technology has the following advantages:

The performance is higher when a medium volume of data needs to be processed.
The codes are simpler and clearer.
Compared with manual parallel processing, the set-based processing technology can take advantage of the parallel processing capability of the database.
If table join operations are involved, the database may use a more efficient Hash join algorithm, which is much better than the nested loop algorithms in iteration algorithms.

Compared with the iteration algorithms, the set-based processing technology has the following disadvantages:

The logic of SQL statements is relatively simple, and cannot implement complex business logic.
Large transactions may be generated. If the data volume is very large, large transactions may strain the memory resources of the database.