Run the TPC-H benchmark on OceanBase Database|V4.2.5| docs|Distributed Database

Run the TPC-H benchmark on OceanBase Database

Last Updated：2025-01-02 01:58:39 Updated

Introduction to TPC-H

TPC-H is a business intelligence test set developed by the Transaction Processing Performance Council (TPC) of the United States to simulate the decision-making process of applications. It is commonly used in academia and industry for evaluating the performance of decision support applications. This BI benchmark comprehensively evaluates the overall business computing capabilities of decision support systems and imposes high requirements on vendors of such systems. Due to its universal and practical business value, TPC-H is widely used in credit analysis and credit card analysis of banks, telecom operation analysis, tax analysis, and tobacco industry decision analysis.

The TPC-H benchmark is the successor of TPC-D, a benchmark developed by TPC in 1994 for decision support systems. TPC-H implements a data warehouse in the Third Normal Form (3NF) that contains eight basic relationships. The main evaluation metric is the response time of each query from its submission to the return of results. The TPC-H benchmark measures the number of queries executed per hour (QphH@size) in a database. H indicates the average number of complex queries executed per hour and size indicates the scale of the database. This metric reflects the query processing capacity of a database system. The TPC-H benchmark is modeled based on real-world production and operation environments. This enables it to evaluate some key performance metrics that cannot be evaluated by other benchmarks. The TPC-H benchmark fills the gap in data warehouse testing and encourages database vendors and research institutions to push the decision support technology to its limit.

Overview

This topic describes how to run the TPC-H benchmark on OceanBase Database in the following ways:

Automatically perform the TPC-H benchmark by using OceanBase Deployer (obd).
Use the official TPC-H tools to manually run TPC-H benchmark step by step.

Note

OceanBase Database is extensively optimized in versions later than V4.0.0 to improve the user experience, ease of use, and performance. You can use the TPC-H benchmark only for performance tuning based on basic parameters.

Prepare the environment

Before testing, prepare the test environment as per the following requirements:

Java Development Kit (JDK): Use JDK 1.8u131 or later.
Make: Run the yum install make command to install make.
GCC: Run the yum install gcc command to install GCC.
MySQL-devel: Run the yum install mysql-devel command to install MySQL-devel.
MySQL Connector/Python: Run the sudo yum install MySQL-python command to install MySQL Connector/Python.
PrettyTable: Run the pip install prettytable command to install PrettyTable.
Java Database Connectivity (JDBC): Use MySQL Connector/J 5.1.47.
TPC-H tools: Click here to download the tools. If you use obd to perform the test, you do not need to obtain the tools.
OBClient: For more information, see OBClient documentation.
OceanBase Database: For more information, see Quick start with OceanBase Database Community Edition.
IOPS: We recommend that the disk IOPS be above 10,000.
Tenant specifications:
```
CREATE RESOURCE UNIT tpch_unit max_cpu 26, memory_size '70g';
CREATE RESOURCE POOL tpch_pool unit = 'tpch_unit', unit_num = 1, zone_list=('zone1','zone2','zone3');
CREATE TENANT tpch_mysql resource_pool_list=('tpch_pool'),  zone_list('zone1', 'zone2', 'zone3'), primary_zone=RANDOM, locality='F@zone1,F@zone2,F@zone3' set variables ob_compatibility_mode='mysql', ob_tcp_invited_nodes='%',secure_file_priv = '';
```
Notice
- The preceding tenant specifications are provided based on the hardware configurations in TPC-H benchmark report of OceanBase Database. You need to adjust the specifications based on the hardware configurations of your database.
- We recommend that you do not use the obd cluster autodeploy command to deploy the cluster. To ensure stability, this command does not maximize the resource utilization. For example, it does not use all of the memory. If you use this command, we recommend that you modify the configuration file separately to maximize the resource utilization.

Deploy a cluster

Four servers are required to run this test. Deploy TPC-H and obd on one server for stress testing, and deploy an OceanBase cluster using obd on the other three servers in the 1-1-1 architecture, which means that the OceanBase cluster has three zones, with each containing one OBServer node.

Note

In the TPC-H test, the server for deploying TPC-H and obd requires only 4 CPU cores and 16 GB of memory.
After successful deployment, create the tenant and users required for running the TPC-H benchmark. The sys tenant is a built-in system tenant for managing the cluster. Do not use the sys tenant to run the benchmark. Set the value of the primary_zone parameter of the tenant to RANDOM. The value RANDOM indicates that the leader of the new table partitions is randomly assigned to one of the three OBServer nodes.

Use obd to run the benchmark

sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
sudo yum install obtpch
sudo ln -s /usr/tpc-h-tools/tpc-h-tools/ /usr/local/
obd test tpch obperf  --tenant=tpch_mysql -s 100 --remote-tbl-dir=/tmp/tpch100

Considerations

For more information about the parameters required for running the TPC-H benchmark by using obd, see obd test tpch.
In this example, default values are used for most parameters. You can adjust their values as needed. For example, the cluster name used in this example is obperf, and the tenant name is tpch_mysql.
If you use obd to run the test, you must use obd to install and deploy the cluster. Otherwise, the information about the cluster cannot be obtained. As a result, performance tuning cannot be performed based on the configurations of the cluster.
If you want to change the password of the sys tenant on a terminal, you must log in to the terminal, change the password of the sys tenant to the default value, and then run the obd cluster edit-config command to set a password for the sys tenant in the configuration file. The password is specified by the root_password parameter. After the obd cluster edit-config command is executed, you must run the obd cluster reload command for the modification to take effect.
After you run the obd test tpch command, the system lists the test steps and outputs in detail. A larger data amount requires a longer test time.
The remote directory remote-tbl-dir must have sufficient space to store the TPC-H data. Therefore, we recommend that you store test data on a separate disk.
The obd test tpch command automatically completes all operations, including the generation and transmission of test data, OceanBase Database parameter optimization, data loading, and testing. If an error occurs during the process, you can retry the test by referring to obd test tpch. For example, you can skip data generation and transmission, and directly load data and run the test.

Manually run the TPC-H benchmark

Use the official TPC-H tools to manually run the TPC-H benchmark step by step. By manually performing the TPC-H benchmark, you can better understand OceanBase Database, especially the settings of some parameters.

Optimize the test environment

To perform environment optimization, log in to the sys tenant.

Optimize OceanBase Database parameters.

Run the obclient -h$host_ip -P$host_port -uroot@sys -A command in the sys tenant. Then execute the following statements:

ALTER SYSTEM flush plan cache GLOBAL;
ALTER system SET enable_sql_audit=false;
select sleep(5);
ALTER system SET enable_perf_event=false;
ALTER system SET syslog_level='PERF';
alter system set enable_record_trace_log=false;

Set the tenant parameters.

Run the obclient -h$host_ip -P$host_port -u$user@$tenant -p$password -A command as the test user. Then execute the following statements:

SET GLOBAL ob_sql_work_area_percentage = 80;
SET GLOBAL ob_query_timeout = 36000000000;
SET GLOBAL ob_trx_timeout = 36000000000;
SET GLOBAL max_allowed_packet = 67108864;
# parallel_servers_target = max_cpu * server_num * 8
SET GLOBAL parallel_servers_target = 624;

Install the TPC-H tools

Download the TPC-H tools. For more information, visit the TPC-H Tools Download page.

Decompress the package and enter the TPC-H directory.

[wieck@localhost ~] $ unzip 7e965ead-8844-4efa-a275-34e35f8ab89b-tpc-h-tool.zip
[wieck@localhost ~] $ cd TPC-H_Tools_v3.0.0

Replicate the makefile.suite file.

[wieck@localhost TPC-H_Tools_v3.0.0] $ cd dbgen/
[wieck@localhost dbgen] $ cp makefile.suite Makefile

Redefine the CC, DATABASE, MACHINE, and WORKLOAD parameters in the makefile.suite file.

[wieck@localhost dbgen] $ vim Makefile
CC      = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
#                                  SQLSERVER, SYBASE, ORACLE, VECTORWISE
# Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS,
#                                  SGI, SUN, U2200, VMS, LINUX, WIN32
# Current values for WORKLOAD are:  TPCH
DATABASE= MYSQL
MACHINE = LINUX
WORKLOAD = TPCH

Add new macro definitions to the tpcd.h file.

[wieck@localhost dbgen] $ vim tpcd.h

#ifdef MYSQL
#define GEN_QUERY_PLAN ""
#define START_TRAN "START TRANSACTION"
#define END_TRAN "COMMIT"
#define SET_OUTPUT ""
#define SET_ROWCOUNT "limit %d;\n"
#define SET_DBASE "use %s;\n"
#endif

Compile the file.
```
make
```

Generate data

You can generate 10 GB, 100 GB, or 1 TB of data as needed. The following example shows the command to generate 100 GB of data.

./dbgen -s 100
mkdir tpch100
mv *.tbl tpch100

Generate SQL query statements

Note

You can perform the following steps to generate SQL query statements and then modify the statements. Alternatively, you can use the SQL query statements provided on GitHub. To use the SQL query statements provided on GitHub, you must change the value of the cpu_num field in the statements to the actual degree of parallelism (DOP).

Copy the qgen and dists.dss files to the queries directory.
```
cp qgen queries
cp dists.dss queries
```

In the queries directory, create a script named gen.sh to generate SQL query statements.

[wieck@localhost queries] $ vim gen.sh
#!/usr/bin/bash
for i in {1..22}
do  
./qgen -d $i -s 100 > db"$i".sql
done

Run the gen.sh script.
```
chmod +x  gen.sh
./gen.sh
```

Modify the SQL query statements.

dos2unix *

For more information about the modified SQL query statements, visit GitHub. You must change the value of the cpu_num field in the SQL statements to the actual DOP. To ensure the best performance, we recommend that you set the value to the total number of available CPU cores.

You can execute the following statement in the sys tenant to query the total number of available CPU cores for a tenant:

SELECT sum(cpu_capacity_max) FROM __all_virtual_server;

Here is a sample SQL query statement q1 after modification:

SELECT /*+    parallel(96) */   ---Specify the execution DOP.
  l_returnflag,
  l_linestatus,
  sum(l_quantity) as sum_qty,
  sum(l_extendedprice) as sum_base_price,
  sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
  sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
  avg(l_quantity) as avg_qty,
  avg(l_extendedprice) as avg_price,
  avg(l_discount) as avg_disc,
  count(*) as count_order
FROM
  lineitem
WHERE
  l_shipdate <= date '1998-12-01' - interval '90' day 
GROUP BY
  l_returnflag,
  l_linestatus
ORDER BY
  l_returnflag,
  l_linestatus;

Create tables

Create a schema file named create_tpcds_mysql_table_part.ddl.

DROP TABLE IF EXISTS lineitem;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS partsupp;
DROP TABLE IF EXISTS part;
DROP TABLE IF EXISTS customer;
DROP TABLE IF EXISTS supplier;
DROP TABLE IF EXISTS nation;
DROP TABLE IF EXISTS region;
DROP TABLEGROUP IF EXISTS tpch_tg_lineitem_order_group;
DROP TABLEGROUP IF EXISTS tpch_tg_partsupp_part;

CREATE TABLEGROUP IF NOT EXISTS tpch_tg_lineitem_order_group binding true partition by key 1 partitions cpu_num;
CREATE TABLEGROUP IF NOT EXISTS tpch_tg_partsupp_part binding true partition by key 1 partitions cpu_num;

DROP TABLE IF EXISTS lineitem;
CREATE TABLE lineitem (
  l_orderkey BIGINT NOT NULL,
  l_partkey BIGINT NOT NULL,
  l_suppkey INTEGER NOT NULL,
  l_linenumber INTEGER NOT NULL,
  l_quantity DECIMAL(15,2) NOT NULL,
  l_extendedprice DECIMAL(15,2) NOT NULL,
  l_discount DECIMAL(15,2) NOT NULL,
  l_tax DECIMAL(15,2) NOT NULL,
  l_returnflag char(1) DEFAULT NULL,
  l_linestatus char(1) DEFAULT NULL,
  l_shipdate date NOT NULL,
  l_commitdate date DEFAULT NULL,
  l_receiptdate date DEFAULT NULL,
  l_shipinstruct char(25) DEFAULT NULL,
  l_shipmode char(10) DEFAULT NULL,
  l_comment varchar(44) DEFAULT NULL,
  PRIMARY KEY(l_orderkey, l_linenumber))row_format = condensed
  tablegroup = tpch_tg_lineitem_order_group
  partition by key (l_orderkey) partitions cpu_num;

DROP TABLE IF EXISTS orders;
CREATE TABLE orders (
  o_orderkey bigint not null,
  o_custkey bigint not null,
  o_orderstatus char(1) default null,
  o_totalprice bigint default null,
  o_orderdate date not null,
  o_orderpriority char(15) default null,
  o_clerk char(15) default null,
  o_shippriority bigint default null,
  o_comment varchar(79) default null,
  PRIMARY KEY (o_orderkey))row_format = condensed
  tablegroup = tpch_tg_lineitem_order_group
  partition by key(o_orderkey) partitions cpu_num;

DROP TABLE IF EXISTS partsupp;
CREATE TABLE partsupp (
  ps_partkey bigint not null,
  ps_suppkey bigint not null,
  ps_availqty bigint default null,
  ps_supplycost bigint default null,
  ps_comment varchar(199) default null,
  PRIMARY KEY (ps_partkey, ps_suppkey))row_format = condensed
  tablegroup tpch_tg_partsupp_part
  partition by key(ps_partkey) partitions cpu_num;

DROP TABLE IF EXISTS part;
CREATE TABLE part (
  p_partkey bigint not null,
  p_name varchar(55) default null,
  p_mfgr char(25) default null,
  p_brand char(10) default null,
  p_type varchar(25) default null,
  p_size bigint default null,
  p_container char(10) default null,
  p_retailprice bigint default null,
  p_comment varchar(23) default null,
  PRIMARY KEY (p_partkey))row_format = condensed
  tablegroup tpch_tg_partsupp_part
  partition by key(p_partkey) partitions cpu_num;

DROP TABLE IF EXISTS customer;
CREATE TABLE customer (
  c_custkey bigint not null,
  c_name varchar(25) default null,
  c_address varchar(40) default null,
  c_nationkey bigint default null,
  c_phone char(15) default null,
  c_acctbal bigint default null,
  c_mktsegment char(10) default null,
  c_comment varchar(117) default null,
  PRIMARY KEY (c_custkey))row_format = condensed
  partition by key(c_custkey) partitions cpu_num;

DROP TABLE IF EXISTS supplier;
CREATE TABLE supplier (
  s_suppkey bigint not null,
  s_name char(25) default null,
  s_address varchar(40) default null,
  s_nationkey bigint default null,
  s_phone char(15) default null,
  s_acctbal bigint default null,
  s_comment varchar(101) default null,
  PRIMARY KEY (s_suppkey))row_format = condensed
  partition by key(s_suppkey) partitions cpu_num;

DROP TABLE IF EXISTS nation;
CREATE TABLE nation (
  n_nationkey bigint not null,
  n_name char(25) default null,
  n_regionkey bigint default null,
  n_comment varchar(152) default null,
  PRIMARY KEY (n_nationkey))row_format = condensed;

DROP TABLE IF EXISTS region;
CREATE TABLE region (
  r_regionkey bigint not null,
  r_name char(25) default null,
  r_comment varchar(152) default null,
  PRIMARY KEY (r_regionkey))row_format = condensed;

Load data

You can write scripts based on the data and SQL query statements generated in the preceding steps. As an example, you can perform the following steps to load data:

Create a directory for loading scripts.

[wieck@localhost dbgen] $ mkdir load
[wieck@localhost dbgen] $ cd load
[wieck@localhost load] $ cp xx/create_tpch_mysql_table_part.ddl ./

Create a script named load.py.

[wieck@localhost load] $ vim load.py

#!/usr/bin/env python
#-*- encoding:utf-8 -*-
import os
import sys
import time
import commands
hostname='$host_ip'  # Notice! ! You need to specify the IP address of an OBServer node, such as OBServer node A.
port='$host_port'               # The port number of OBServer node A.
tenant='$tenant_name'              # The tenant name.
user='$user'               # The username.
password='$password'           # The password.
data_path='$path'         # Notice! ! You need to specify the directory of tbl on OBServer node A.
db_name='$db_name'             # The database name.
# Create a table
cmd_str='obclient -h%s -P%s -u%s@%s -p%s -D%s < create_tpch_mysql_table_part.ddl'%(hostname,port,user,tenant,password,db_name)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str='obclient -h%s -P%s -u%s@%s -p%s  -D%s -e "show tables;" '%(hostname,port,user,tenant,password,db_name)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c  -D%s -e "load data /*+ parallel(80) */ infile '%s/customer.tbl' into table customer fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c  -D%s -e "load data /*+ parallel(80) */ infile '%s/lineitem.tbl' into table lineitem fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/nation.tbl' into table nation fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c  -D%s -e "load data /*+ parallel(80) */ infile '%s/orders.tbl' into table orders fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s   -D%s -e "load data /*+ parallel(80) */ infile '%s/partsupp.tbl' into table partsupp fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c  -D%s -e "load data /*+ parallel(80) */ infile '%s/part.tbl' into table part fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c  -D%s -e "load data /*+ parallel(80) */ infile '%s/region.tbl' into table region fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result
cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c  -D%s -e "load data /*+ parallel(80) */ infile '%s/supplier.tbl' into table supplier fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path)
result = commands.getstatusoutput(cmd_str)
print result

Load the data.

Notice

To load data, you need to install the OBClient.

python load.py
(0,'')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.\nTABLE_NAME\nT1\nLINEITEM\nORDERS\nPARTSUPP\nPART\nCUSTOMER\nSUPPLIER\nNATION\nREGION')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')
(0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')

Perform a major compaction.

Note

MySQL [(none)]> use oceanbase
Database changed
MySQL [oceanbase]> ALTER SYSTEM major freeze tenant=<your tenant name>;
Query OK, 0 rows affected

Check whether the major compaction is completed.

MySQL [oceanbase]> SELECT FROZEN_SCN, LAST_SCN FROM oceanbase.CDB_OB_MAJOR_COMPACTION;
+---------------------+---------------------+
| FROZEN_SCN          | LAST_SCN            |
+---------------------+---------------------+
| 1667239201167716767 | 1667239201167716767 |
| 1667239200111919300 | 1667239200111919300 |
| 1667239201167452168 | 1667239201167452168 |
| 1667239201168053124 | 1667239201168053124 |
| 1667239201167520213 | 1667239201167520213 |
+---------------------+---------------------+

If all FROZEN_SCN and LAST_SCN values are equal, the major compaction is completed.

Manually collect statistics.

Run the obclient -h$host_ip -P$host_port -u$user@$tenant -p$password -A -D$database command as the test user. Then, execute the following statements:
```
call dbms_stats.gather_schema_stats('$db_name',degree=>96);
```

Run the test

You can write scripts based on the data and SQL query statements generated in the preceding steps. As an example, you can perform the following steps to run the test:

Write a script named tpch.sh in the queries directory.

[wieck@localhost queries] $ vim tpch.sh

  #!/bin/bash
  TPCH_TEST="obclient -h $host_ip -P $host_port -utpch_100g_part@tpch_mysql  -D tpch_100g_part  -ptest -c"
  # Run a warmup.
  for i in {1..22}
  do
      sql1="source db${i}.sql"
      echo $sql1| $TPCH_TEST >db${i}.log  || ret=1
  done
  # Run the formal test.
  for i in {1..22}
  do
      starttime=`date +%s%N`
      echo `date  '+[%Y-%m-%d %H:%M:%S]'` "BEGIN Q${i}"
      sql1="source db${i}.sql"
      echo $sql1| $TPCH_TEST >db${i}.log  || ret=1
      stoptime=`date +%s%N`
      costtime=`echo $stoptime $starttime | awk '{printf "%0.2f\n", ($1 - $2) / 1000000000}'`
      echo `date  '+[%Y-%m-%d %H:%M:%S]'` "END,COST ${costtime}s"
  done

Run the test script.
```
sh tpch.sh
```
Note

For more information about the test results, see TPC-H benchmark report of OceanBase Database.

FAQ

What do I do when data import fails with the following error message returned?
```
ERROR 1017 (HY000) at line 1: File not exist
```
Place the tbl file in a directory on the server that hosts OceanBase Database, because only local data can be imported and loaded.
What do I do when the following error message is returned while I was viewing the data?
```
ERROR 4624 (HY000): No memory or reach tenant memory limit
```
You are running out of memory. Allocate more memory to the tenant.
What do I do when the following error message is returned during data import?
```
ERROR 1227 (42501) at line 1: Access denied
```
You need to grant the access privilege to the login user. Run the following command:
```
grant file on *.* to tpch_100g_part;
```
What do I do when the following error message is returned after I ran the dos2unix * command to modify the SQL query statement?
```
-bash: dos2unix: command not found
```
You need to install dos2unix. Run the following command to install it:
```
yum install -y dos2unix
```