Introduction to TPC-H
TPC-H is a business intelligence test set developed by the Transaction Processing Performance Council (TPC) of the United States to simulate the decision-making process of applications. It is commonly used in academia and industry for evaluating the performance of decision support applications. This BI benchmark comprehensively evaluates the overall business computing capabilities of decision support systems and imposes high requirements on vendors of such systems. Due to its universal and practical business value, TPC-H is widely used in credit analysis and credit card analysis of banks, telecom operation analysis, tax analysis, and tobacco industry decision analysis.
The TPC-H benchmark is the successor of TPC-D, a benchmark developed by TPC in 1994 for decision support systems. TPC-H implements a data warehouse in the Third Normal Form (3NF) that contains eight basic relationships. The main evaluation metric is the response time of each query from its submission to the return of results. The TPC-H benchmark measures the number of queries executed per hour (QphH@size) in a database. H indicates the average number of complex queries executed per hour and size indicates the scale of the database. This metric reflects the query processing capacity of a database system. The TPC-H benchmark is modeled based on real-world production and operation environments. This enables it to evaluate some key performance metrics that cannot be evaluated by other benchmarks. The TPC-H benchmark fills the gap in data warehouse testing and encourages database vendors and research institutions to push the decision support technology to its limit.
Overview
This topic describes how to run the TPC-H benchmark on OceanBase Database in the following ways:
Automatically perform the TPC-H benchmark by using OceanBase Deployer (obd).
Use the official TPC-H tools to manually run TPC-H benchmark step by step.
Note
OceanBase Database is extensively optimized in versions later than V4.0.0 to improve the user experience, ease of use, and performance. You can use the TPC-H benchmark only for performance tuning based on basic parameters.
Prepare the environment
Before testing, prepare the test environment as per the following requirements:
Java Development Kit (JDK): Use JDK 1.8u131 or later.
Make: Run the
yum install makecommand to install make.GCC: Run the
yum install gcccommand to install GCC.MySQL-devel: Run the
yum install mysql-develcommand to install MySQL-devel.MySQL Connector/Python: Run the
sudo yum install MySQL-pythoncommand to install MySQL Connector/Python.PrettyTable: Run the
pip install prettytablecommand to install PrettyTable.Java Database Connectivity (JDBC): Use MySQL Connector/J 5.1.47.
TPC-H tools: Click here to download the tools. If you use obd to perform the test, you do not need to obtain the tools.
OBClient: For more information, see OBClient documentation.
OceanBase Database: For more information, see Quick start with OceanBase Database Community Edition.
IOPS: We recommend that the disk IOPS be above 10,000.
Tenant specifications:
CREATE RESOURCE UNIT tpch_unit max_cpu 26, memory_size '70g'; CREATE RESOURCE POOL tpch_pool unit = 'tpch_unit', unit_num = 1, zone_list=('zone1','zone2','zone3'); CREATE TENANT tpch_mysql resource_pool_list=('tpch_pool'), zone_list('zone1', 'zone2', 'zone3'), primary_zone=RANDOM, locality='F@zone1,F@zone2,F@zone3' set variables ob_compatibility_mode='mysql', ob_tcp_invited_nodes='%',secure_file_priv = '';Notice
- The preceding tenant specifications are provided based on the hardware configurations in TPC-H benchmark report of OceanBase Database. You need to adjust the specifications based on the hardware configurations of your database.
- We recommend that you do not use the
obd cluster autodeploycommand to deploy the cluster. To ensure stability, this command does not maximize the resource utilization. For example, it does not use all of the memory. If you use this command, we recommend that you modify the configuration file separately to maximize the resource utilization.
Deploy a cluster
Four servers are required to run this test. Deploy TPC-H and obd on one server for stress testing, and deploy an OceanBase cluster using obd on the other three servers in the 1-1-1 architecture, which means that the OceanBase cluster has three zones, with each containing one OBServer node.
Note
In the TPC-H test, the server for deploying TPC-H and obd requires only 4 CPU cores and 16 GB of memory.
After successful deployment, create the tenant and users required for running the TPC-H benchmark. The
systenant is a built-in system tenant for managing the cluster. Do not use thesystenant to run the benchmark. Set the value of theprimary_zoneparameter of the tenant toRANDOM. The valueRANDOMindicates that the leader of the new table partitions is randomly assigned to one of the three OBServer nodes.
Use obd to run the benchmark
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
sudo yum install obtpch
sudo ln -s /usr/tpc-h-tools/tpc-h-tools/ /usr/local/
obd test tpch obperf --tenant=tpch_mysql -s 100 --remote-tbl-dir=/tmp/tpch100
Considerations
For more information about the parameters required for running the TPC-H benchmark by using obd, see obd test tpch.
In this example, default values are used for most parameters. You can adjust their values as needed. For example, the cluster name used in this example is
obperf, and the tenant name istpch_mysql.If you use obd to run the test, you must use obd to install and deploy the cluster. Otherwise, the information about the cluster cannot be obtained. As a result, performance tuning cannot be performed based on the configurations of the cluster.
If you want to change the password of the
systenant on a terminal, you must log in to the terminal, change the password of thesystenant to the default value, and then run the obd cluster edit-config command to set a password for thesystenant in the configuration file. The password is specified by theroot_passwordparameter. After theobd cluster edit-configcommand is executed, you must run theobd cluster reloadcommand for the modification to take effect.After you run the
obd test tpchcommand, the system lists the test steps and outputs in detail. A larger data amount requires a longer test time.The remote directory
remote-tbl-dirmust have sufficient space to store the TPC-H data. Therefore, we recommend that you store test data on a separate disk.The
obd test tpchcommand automatically completes all operations, including the generation and transmission of test data, OceanBase Database parameter optimization, data loading, and testing. If an error occurs during the process, you can retry the test by referring to obd test tpch. For example, you can skip data generation and transmission, and directly load data and run the test.
Manually run the TPC-H benchmark
Use the official TPC-H tools to manually run the TPC-H benchmark step by step. By manually performing the TPC-H benchmark, you can better understand OceanBase Database, especially the settings of some parameters.
Optimize the test environment
To perform environment optimization, log in to the sys tenant.
Optimize OceanBase Database parameters.
Run the
obclient -h$host_ip -P$host_port -uroot@sys -Acommand in thesystenant. Then execute the following statements:ALTER SYSTEM flush plan cache GLOBAL; ALTER system SET enable_sql_audit=false; select sleep(5); ALTER system SET enable_perf_event=false; ALTER system SET syslog_level='PERF'; alter system set enable_record_trace_log=false;Set the tenant parameters.
Run the
obclient -h$host_ip -P$host_port -u$user@$tenant -p$password -Acommand as the test user. Then execute the following statements:SET GLOBAL ob_sql_work_area_percentage = 80; SET GLOBAL ob_query_timeout = 36000000000; SET GLOBAL ob_trx_timeout = 36000000000; SET GLOBAL max_allowed_packet = 67108864; # parallel_servers_target = max_cpu * server_num * 8 SET GLOBAL parallel_servers_target = 624;
Install the TPC-H tools
Download the TPC-H tools. For more information, visit the TPC-H Tools Download page.
Decompress the package and enter the TPC-H directory.
[wieck@localhost ~] $ unzip 7e965ead-8844-4efa-a275-34e35f8ab89b-tpc-h-tool.zip [wieck@localhost ~] $ cd TPC-H_Tools_v3.0.0Replicate the
makefile.suitefile.[wieck@localhost TPC-H_Tools_v3.0.0] $ cd dbgen/ [wieck@localhost dbgen] $ cp makefile.suite MakefileRedefine the
CC,DATABASE,MACHINE, andWORKLOADparameters in themakefile.suitefile.[wieck@localhost dbgen] $ vim Makefile CC = gcc # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata) # SQLSERVER, SYBASE, ORACLE, VECTORWISE # Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS, # SGI, SUN, U2200, VMS, LINUX, WIN32 # Current values for WORKLOAD are: TPCH DATABASE= MYSQL MACHINE = LINUX WORKLOAD = TPCHAdd new macro definitions to the
tpcd.hfile.[wieck@localhost dbgen] $ vim tpcd.h #ifdef MYSQL #define GEN_QUERY_PLAN "" #define START_TRAN "START TRANSACTION" #define END_TRAN "COMMIT" #define SET_OUTPUT "" #define SET_ROWCOUNT "limit %d;\n" #define SET_DBASE "use %s;\n" #endifCompile the file.
make
Generate data
You can generate 10 GB, 100 GB, or 1 TB of data as needed. The following example shows the command to generate 100 GB of data.
./dbgen -s 100
mkdir tpch100
mv *.tbl tpch100
Generate SQL query statements
Note
You can perform the following steps to generate SQL query statements and then modify the statements. Alternatively, you can use the SQL query statements provided on GitHub. To use the SQL query statements provided on GitHub, you must change the value of the cpu_num field in the statements to the actual degree of parallelism (DOP).
Copy the
qgenanddists.dssfiles to thequeriesdirectory.cp qgen queries cp dists.dss queriesIn the
queriesdirectory, create a script namedgen.shto generate SQL query statements.[wieck@localhost queries] $ vim gen.sh #!/usr/bin/bash for i in {1..22} do ./qgen -d $i -s 100 > db"$i".sql doneRun the
gen.shscript.chmod +x gen.sh ./gen.shModify the SQL query statements.
dos2unix *For more information about the modified SQL query statements, visit GitHub. You must change the value of the
cpu_numfield in the SQL statements to the actual DOP. To ensure the best performance, we recommend that you set the value to the total number of available CPU cores.You can execute the following statement in the
systenant to query the total number of available CPU cores for a tenant:SELECT sum(cpu_capacity_max) FROM __all_virtual_server;Here is a sample SQL query statement
q1after modification:SELECT /*+ parallel(96) */ ---Specify the execution DOP. l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order FROM lineitem WHERE l_shipdate <= date '1998-12-01' - interval '90' day GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
Create tables
Create a schema file named create_tpcds_mysql_table_part.ddl.
DROP TABLE IF EXISTS lineitem;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS partsupp;
DROP TABLE IF EXISTS part;
DROP TABLE IF EXISTS customer;
DROP TABLE IF EXISTS supplier;
DROP TABLE IF EXISTS nation;
DROP TABLE IF EXISTS region;
DROP TABLEGROUP IF EXISTS tpch_tg_lineitem_order_group;
DROP TABLEGROUP IF EXISTS tpch_tg_partsupp_part;
CREATE TABLEGROUP IF NOT EXISTS tpch_tg_lineitem_order_group binding true partition by key 1 partitions cpu_num;
CREATE TABLEGROUP IF NOT EXISTS tpch_tg_partsupp_part binding true partition by key 1 partitions cpu_num;
DROP TABLE IF EXISTS lineitem;
CREATE TABLE lineitem (
l_orderkey BIGINT NOT NULL,
l_partkey BIGINT NOT NULL,
l_suppkey INTEGER NOT NULL,
l_linenumber INTEGER NOT NULL,
l_quantity DECIMAL(15,2) NOT NULL,
l_extendedprice DECIMAL(15,2) NOT NULL,
l_discount DECIMAL(15,2) NOT NULL,
l_tax DECIMAL(15,2) NOT NULL,
l_returnflag char(1) DEFAULT NULL,
l_linestatus char(1) DEFAULT NULL,
l_shipdate date NOT NULL,
l_commitdate date DEFAULT NULL,
l_receiptdate date DEFAULT NULL,
l_shipinstruct char(25) DEFAULT NULL,
l_shipmode char(10) DEFAULT NULL,
l_comment varchar(44) DEFAULT NULL,
PRIMARY KEY(l_orderkey, l_linenumber))row_format = condensed
tablegroup = tpch_tg_lineitem_order_group
partition by key (l_orderkey) partitions cpu_num;
DROP TABLE IF EXISTS orders;
CREATE TABLE orders (
o_orderkey bigint not null,
o_custkey bigint not null,
o_orderstatus char(1) default null,
o_totalprice bigint default null,
o_orderdate date not null,
o_orderpriority char(15) default null,
o_clerk char(15) default null,
o_shippriority bigint default null,
o_comment varchar(79) default null,
PRIMARY KEY (o_orderkey))row_format = condensed
tablegroup = tpch_tg_lineitem_order_group
partition by key(o_orderkey) partitions cpu_num;
DROP TABLE IF EXISTS partsupp;
CREATE TABLE partsupp (
ps_partkey bigint not null,
ps_suppkey bigint not null,
ps_availqty bigint default null,
ps_supplycost bigint default null,
ps_comment varchar(199) default null,
PRIMARY KEY (ps_partkey, ps_suppkey))row_format = condensed
tablegroup tpch_tg_partsupp_part
partition by key(ps_partkey) partitions cpu_num;
DROP TABLE IF EXISTS part;
CREATE TABLE part (
p_partkey bigint not null,
p_name varchar(55) default null,
p_mfgr char(25) default null,
p_brand char(10) default null,
p_type varchar(25) default null,
p_size bigint default null,
p_container char(10) default null,
p_retailprice bigint default null,
p_comment varchar(23) default null,
PRIMARY KEY (p_partkey))row_format = condensed
tablegroup tpch_tg_partsupp_part
partition by key(p_partkey) partitions cpu_num;
DROP TABLE IF EXISTS customer;
CREATE TABLE customer (
c_custkey bigint not null,
c_name varchar(25) default null,
c_address varchar(40) default null,
c_nationkey bigint default null,
c_phone char(15) default null,
c_acctbal bigint default null,
c_mktsegment char(10) default null,
c_comment varchar(117) default null,
PRIMARY KEY (c_custkey))row_format = condensed
partition by key(c_custkey) partitions cpu_num;
DROP TABLE IF EXISTS supplier;
CREATE TABLE supplier (
s_suppkey bigint not null,
s_name char(25) default null,
s_address varchar(40) default null,
s_nationkey bigint default null,
s_phone char(15) default null,
s_acctbal bigint default null,
s_comment varchar(101) default null,
PRIMARY KEY (s_suppkey))row_format = condensed
partition by key(s_suppkey) partitions cpu_num;
DROP TABLE IF EXISTS nation;
CREATE TABLE nation (
n_nationkey bigint not null,
n_name char(25) default null,
n_regionkey bigint default null,
n_comment varchar(152) default null,
PRIMARY KEY (n_nationkey))row_format = condensed;
DROP TABLE IF EXISTS region;
CREATE TABLE region (
r_regionkey bigint not null,
r_name char(25) default null,
r_comment varchar(152) default null,
PRIMARY KEY (r_regionkey))row_format = condensed;
Load data
You can write scripts based on the data and SQL query statements generated in the preceding steps. As an example, you can perform the following steps to load data:
Create a directory for loading scripts.
[wieck@localhost dbgen] $ mkdir load [wieck@localhost dbgen] $ cd load [wieck@localhost load] $ cp xx/create_tpch_mysql_table_part.ddl ./Create a script named
load.py.[wieck@localhost load] $ vim load.py #!/usr/bin/env python #-*- encoding:utf-8 -*- import os import sys import time import commands hostname='$host_ip' # Notice! ! You need to specify the IP address of an OBServer node, such as OBServer node A. port='$host_port' # The port number of OBServer node A. tenant='$tenant_name' # The tenant name. user='$user' # The username. password='$password' # The password. data_path='$path' # Notice! ! You need to specify the directory of tbl on OBServer node A. db_name='$db_name' # The database name. # Create a table cmd_str='obclient -h%s -P%s -u%s@%s -p%s -D%s < create_tpch_mysql_table_part.ddl'%(hostname,port,user,tenant,password,db_name) result = commands.getstatusoutput(cmd_str) print result cmd_str='obclient -h%s -P%s -u%s@%s -p%s -D%s -e "show tables;" '%(hostname,port,user,tenant,password,db_name) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/customer.tbl' into table customer fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/lineitem.tbl' into table lineitem fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/nation.tbl' into table nation fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/orders.tbl' into table orders fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -D%s -e "load data /*+ parallel(80) */ infile '%s/partsupp.tbl' into table partsupp fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/part.tbl' into table part fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/region.tbl' into table region fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/supplier.tbl' into table supplier fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print resultLoad the data.
Notice
To load data, you need to install the OBClient.
python load.py (0,'') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.\nTABLE_NAME\nT1\nLINEITEM\nORDERS\nPARTSUPP\nPART\nCUSTOMER\nSUPPLIER\nNATION\nREGION') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')Perform a major compaction.
Note
Log in to the
systenant to perform a major compaction.MySQL [(none)]> use oceanbase Database changed MySQL [oceanbase]> ALTER SYSTEM major freeze tenant=<your tenant name>; Query OK, 0 rows affectedCheck whether the major compaction is completed.
MySQL [oceanbase]> SELECT FROZEN_SCN, LAST_SCN FROM oceanbase.CDB_OB_MAJOR_COMPACTION; +---------------------+---------------------+ | FROZEN_SCN | LAST_SCN | +---------------------+---------------------+ | 1667239201167716767 | 1667239201167716767 | | 1667239200111919300 | 1667239200111919300 | | 1667239201167452168 | 1667239201167452168 | | 1667239201168053124 | 1667239201168053124 | | 1667239201167520213 | 1667239201167520213 | +---------------------+---------------------+If all
FROZEN_SCNandLAST_SCNvalues are equal, the major compaction is completed.Manually collect statistics.
Run the
obclient -h$host_ip -P$host_port -u$user@$tenant -p$password -A -D$databasecommand as the test user. Then, execute the following statements:call dbms_stats.gather_schema_stats('$db_name',degree=>96);
Run the test
You can write scripts based on the data and SQL query statements generated in the preceding steps. As an example, you can perform the following steps to run the test:
Write a script named
tpch.shin thequeriesdirectory.[wieck@localhost queries] $ vim tpch.sh #!/bin/bash TPCH_TEST="obclient -h $host_ip -P $host_port -utpch_100g_part@tpch_mysql -D tpch_100g_part -ptest -c" # Run a warmup. for i in {1..22} do sql1="source db${i}.sql" echo $sql1| $TPCH_TEST >db${i}.log || ret=1 done # Run the formal test. for i in {1..22} do starttime=`date +%s%N` echo `date '+[%Y-%m-%d %H:%M:%S]'` "BEGIN Q${i}" sql1="source db${i}.sql" echo $sql1| $TPCH_TEST >db${i}.log || ret=1 stoptime=`date +%s%N` costtime=`echo $stoptime $starttime | awk '{printf "%0.2f\n", ($1 - $2) / 1000000000}'` echo `date '+[%Y-%m-%d %H:%M:%S]'` "END,COST ${costtime}s" doneRun the test script.
sh tpch.shNote
For more information about the test results, see TPC-H benchmark report of OceanBase Database.
FAQ
What do I do when data import fails with the following error message returned?
ERROR 1017 (HY000) at line 1: File not existPlace the tbl file in a directory on the server that hosts OceanBase Database, because only local data can be imported and loaded.
What do I do when the following error message is returned while I was viewing the data?
ERROR 4624 (HY000): No memory or reach tenant memory limitYou are running out of memory. Allocate more memory to the tenant.
What do I do when the following error message is returned during data import?
ERROR 1227 (42501) at line 1: Access deniedYou need to grant the access privilege to the login user. Run the following command:
grant file on *.* to tpch_100g_part;What do I do when the following error message is returned after I ran the
dos2unix *command to modify the SQL query statement?-bash: dos2unix: command not foundYou need to install dos2unix. Run the following command to install it:
yum install -y dos2unix