Introduction to TPC-H
TPC-H is a business intelligence test set developed by the Transaction Processing Performance Council (TPC) of the United States to simulate the decision-making process of applications. It is commonly used in academia and industry for evaluating the performance of decision support applications. This BI benchmark comprehensively evaluates the overall business computing capabilities of decision support systems and imposes high requirements on vendors of such systems. Due to its universal and practical business value, TPC-H is widely used in credit analysis and credit card analysis of banks, telecom operation analysis, tax analysis, and tobacco industry decision analysis.
The TPC-H benchmark is the successor of TPC-D, a benchmark developed by TPC in 1994 for decision support systems. TPC-H implements a data warehouse in the Third Normal Form (3NF) that contains eight basic relationships. The main evaluation metric is the response time of each query from its submission to the return of results. The TPC-H benchmark measures the number of queries executed per hour (QphH@size) in a database. H indicates the average number of complex queries executed per hour and size indicates the scale of the database. This metric reflects the query processing capacity of a database system. The TPC-H benchmark is modeled based on real-world production and operation environments. This enables it to evaluate some key performance metrics that cannot be evaluated by other benchmarks. The TPC-H benchmark fills the gap in data warehouse testing and encourages database vendors and research institutions to push the decision support technology to its limit.
Overview
This topic describes how to run the TPC-H benchmark on OceanBase Database in the following ways:
Automatically perform the TPC-H benchmark by using OceanBase Deployer (OBD).
Use the official TPC-H tools to manually run TPC-H benchmark step by step.
Note
OceanBase Database is extensively optimized in versions later than V4.0.0 to improve the user experience, ease of use, and database performance. The test method described in this topic tunes only the basic parameters for performance improvement.
Prepare the environment
Before testing, prepare the test environment as per the following requirements:
Java Development Kit (JDK): Use V1.8u131 or later.
Make: Run the
yum install makecommand to install make.GCC: Run the
yum install gcccommand to install GCC.MySQL-devel: Run the
yum install mysql-develcommand to install MySQL-devel.MySQL Connector/Python: Run the
sudo yum install MySQL-pythoncommand to install MySQL Connector/Python.PrettyTable: Run the
pip install prettytablecommand to install PrettyTable.Java Database Connectivity (JDBC): Use mysql-connector-java-5.1.47.
TPC-H tools: Click here to download the tools. If you use OBD to perform the test, you do not need to obtain the tools.
OBClient: For more information, see OBClient Documentation.
OceanBase Database: For more information, see Get started with OceanBase Database.
IOPS: We recommend that the disk IOPS be above 10,000.
Tenant specifications:
CREATE RESOURCE UNIT tpch_unit max_cpu 26, memory_size '70g'; CREATE RESOURCE POOL tpch_pool unit = 'tpch_unit', unit_num = 1, zone_list=('zone1','zone2','zone3'); CREATE TENANT tpch_mysql resource_pool_list=('tpch_pool'), zone_list('zone1', 'zone2', 'zone3'), primary_zone=RANDOM, locality='F@zone1,F@zone2,F@zone3' set variables ob_compatibility_mode='mysql', ob_tcp_invited_nodes='%',secure_file_priv = '';Notice
- In the sample statements above, the tenant specifications are configured based on the hardware configurations in the TPC-H benchmark report of OceanBase Database. You need to adjust the specifications based on the hardware configurations of your database.
- We recommend that you not use the
obd cluster autodeploycommand to deploy the cluster. To ensure stability, this command does not maximize the resource utilization. For example, it does not use all of the memory. We recommend that you modify the configuration file separately to maximize the resource utilization.
Deploy the cluster
Four servers are required to run this test. TPC-H and OBD are deployed on a separate server for stress testing. When you use OBD to deploy an OceanBase cluster, three servers are required, which are evenly distributed in three zones.
Note
In the TPC-H test, each server for deploying TPC-H and OBD requires only four CPU cores and 16 GB of memory.
After successful deployment, create the tenant and users required for running the TPC-H benchmark. The
systenant is a built-in system tenant used for managing the cluster. Do not use thesystenant to run the benchmark. Set the value of theprimary_zoneparameter of the tenant toRANDOM. The valueRANDOMindicates that the leader of the new table partitions is randomly assigned to one of the three servers.
Use OBD to run the test
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
sudo yum install obtpch
sudo ln -s /usr/tpc-h-tools/tpc-h-tools/ /usr/local/
obd test tpch obperf --tenant=tpch_mysql -s 100 --remote-tbl-dir=/tmp/tpch100
Considerations
For more information about the parameters required for running the TPC-H benchmark by using OBD, see obd test tpch.
In this example, default values are used for most parameters. In the user scenario, you can set the parameters based on the actual situation. For example, the cluster name used in this example is
obperfand the tenant name istpch_mysql.If you use OBD to run the test, you must use OBD to install and deploy the cluster. Otherwise, the information about the cluster cannot be obtained. As a result, performance tuning cannot be performed based on the configurations of the cluster.
If you want to change the password of the
systenant on a terminal, you must log on to the terminal, change the password of thesystenant to the default value, and then run the obd cluster edit-config command to set a password for thesystenant in the configuration file. The password is specified by the# root_password: # root user passwordparameter. After theobd cluster edit-configcommand is executed, you must run theobd cluster reloadcommand for the modification to take effect.After you run the
obd test tpchcommand, the system lists the test steps and outputs in detail. A larger data amount requires a longer test time.The remote directory
remote-tbl-dirmust have sufficient space to store the TPC-H data. Therefore, we recommend that you store test data on a separate disk.The
obd test tpchcommand automatically completes all operations, including the generation and transmission of test data, OceanBase Database parameter optimization, data loading, and testing. If an error occurs during the process, you can retry the test by referring to obd test tpch. For example, you can skip data generation and transmission, and directly load data and run the test.
Manually run the TPC-H benchmark
Use the official TPC-H tools to manually run the TPC-H benchmark step by step. By manually performing the TPC-H benchmark, you can better understand OceanBase Database, especially the settings of some parameters.
Optimize the test environment
To perform environment optimization, log on to the sys tenant.
Optimize OceanBase Database parameters.
Run the
obclient -h$host_ip -P$host_port -uroot@sys -Acommand in thesystenant. Then, execute the following statements:ALTER SYSTEM flush plan cache GLOBAL; ALTER system SET enable_sql_audit=false; select sleep(5); ALTER system SET enable_perf_event=false; ALTER system SET syslog_level='PERF'; alter system set enable_record_trace_log=false;Set the tenant parameters.
Run the
obclient -h$host_ip -P$host_port -u$user@$tenant -p$password -Acommand as the test user. Then, execute the following statements:SET GLOBAL ob_sql_work_area_percentage = 80; SET GLOBAL ob_query_timeout = 36000000000; SET GLOBAL ob_trx_timeout = 36000000000; SET GLOBAL max_allowed_packet = 67108864; # parallel_servers_target = max_cpu * server_num * 8 SET GLOBAL parallel_servers_target = 624;
Install the TPC-H tools
Download TPC-H tools. For more information, visit the TPC-H Tools Download page.
Unzip the package and enter the TPC-H directory.
[wieck@localhost ~] $ unzip 7e965ead-8844-4efa-a275-34e35f8ab89b-tpc-h-tool.zip [wieck@localhost ~] $ cd TPC-H_Tools_v3.0.0Replicate the
makefile.suitefile.[wieck@localhost TPC-H_Tools_v3.0.0] $ cd dbgen/ [wieck@localhost dbgen] $ cp makefile.suite MakefileRedefine the
CC,DATABASE,MACHINE, andWORKLOADparameters in themakefile.suitefile.[wieck@localhost dbgen] $ vim Makefile CC = gcc # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata) # SQLSERVER, SYBASE, ORACLE, VECTORWISE # Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS, # SGI, SUN, U2200, VMS, LINUX, WIN32 # Current values for WORKLOAD are: TPCH DATABASE= MYSQL MACHINE = LINUX WORKLOAD = TPCHAdd new macro definitions to the
tpcd.hfile.[wieck@localhost dbgen] $ vim tpcd.h #ifdef MYSQL #define GEN_QUERY_PLAN "" #define START_TRAN "START TRANSACTION" #define END_TRAN "COMMIT" #define SET_OUTPUT "" #define SET_ROWCOUNT "limit %d;\n" #define SET_DBASE "use %s;\n" #endifCompile the file.
make
Generate data
You can generate 10 GB, 100 GB, or 1 TB of data as needed. The following example shows the command to generate 100 GB of data.
./dbgen -s 100
mkdir tpch100
mv *.tbl tpch100
Generate SQL query statements
Note
You can perform the following steps to generate SQL query statements and then modify the statements. Alternatively, you can directly use the SQL query statements provided in GitHub. To use the SQL query statements provided in GitHub, you must change the value of the cpu_num field in the statements to the actual concurrency.
Copy the
qgenanddists.dssfiles to thequeriesdirectory.cp qgen queries cp dists.dss queriesIn the
queriesdirectory, create agen.shscript to generate SQL query statements.[wieck@localhost queries] $ vim gen.sh #!/usr/bin/bash for i in {1..22} do ./qgen -d $i -s 100 > db"$i".sql doneRun the
gen.shscript.chmod +x gen.sh ./gen.shModify the SQL query statements.
dos2unix *For more information about the modified SQL query statements, see GitHub. You need to change the value of the
cpu_numfield in the SQL statements to the actual concurrency. To ensure the best performance, we recommend that you set the value to the total number of available CPU cores.You can run the following command under the
systenant to view the total number of available CPU cores for a tenant:SELECT sum(cpu_capacity_max) FROM __all_virtual_server;Here is a sample SQL query statement q1 after modification:
SELECT /*+ parallel(96) */ ---Specify the execution concurrency. l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order FROM lineitem WHERE l_shipdate <= date '1998-12-01' - interval '90' day GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
Create tables
Create a schema file named create_tpch_mysql_table_part.ddl.
DROP TABLE IF EXISTS lineitem;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS partsupp;
DROP TABLE IF EXISTS part;
DROP TABLE IF EXISTS customer;
DROP TABLE IF EXISTS supplier;
DROP TABLE IF EXISTS nation;
DROP TABLE IF EXISTS region;
DROP TABLEGROUP IF EXISTS tpch_tg_lineitem_order_group;
DROP TABLEGROUP IF EXISTS tpch_tg_partsupp_part;
CREATE TABLEGROUP IF NOT EXISTS tpch_tg_lineitem_order_group binding true partition by key 1 partitions cpu_num;
CREATE TABLEGROUP IF NOT EXISTS tpch_tg_partsupp_part binding true partition by key 1 partitions cpu_num;
DROP TABLE IF EXISTS lineitem;
CREATE TABLE lineitem (
l_orderkey BIGINT NOT NULL,
l_partkey BIGINT NOT NULL,
l_suppkey INTEGER NOT NULL,
l_linenumber INTEGER NOT NULL,
l_quantity DECIMAL(15,2) NOT NULL,
l_extendedprice DECIMAL(15,2) NOT NULL,
l_discount DECIMAL(15,2) NOT NULL,
l_tax DECIMAL(15,2) NOT NULL,
l_returnflag char(1) DEFAULT NULL,
l_linestatus char(1) DEFAULT NULL,
l_shipdate date NOT NULL,
l_commitdate date DEFAULT NULL,
l_receiptdate date DEFAULT NULL,
l_shipinstruct char(25) DEFAULT NULL,
l_shipmode char(10) DEFAULT NULL,
l_comment varchar(44) DEFAULT NULL,
PRIMARY KEY(l_orderkey, l_linenumber))row_format = condensed
tablegroup = tpch_tg_lineitem_order_group
partition by key (l_orderkey) partitions cpu_num;
DROP TABLE IF EXISTS orders;
CREATE TABLE orders (
o_orderkey bigint not null,
o_custkey bigint not null,
o_orderstatus char(1) default null,
o_totalprice bigint default null,
o_orderdate date not null,
o_orderpriority char(15) default null,
o_clerk char(15) default null,
o_shippriority bigint default null,
o_comment varchar(79) default null,
PRIMARY KEY (o_orderkey))row_format = condensed
tablegroup = tpch_tg_lineitem_order_group
partition by key(o_orderkey) partitions cpu_num;
DROP TABLE IF EXISTS partsupp;
CREATE TABLE partsupp (
ps_partkey bigint not null,
ps_suppkey bigint not null,
ps_availqty bigint default null,
ps_supplycost bigint default null,
ps_comment varchar(199) default null,
PRIMARY KEY (ps_partkey, ps_suppkey))row_format = condensed
tablegroup tpch_tg_partsupp_part
partition by key(ps_partkey) partitions cpu_num;
DROP TABLE IF EXISTS part;
CREATE TABLE part (
p_partkey bigint not null,
p_name varchar(55) default null,
p_mfgr char(25) default null,
p_brand char(10) default null,
p_type varchar(25) default null,
p_size bigint default null,
p_container char(10) default null,
p_retailprice bigint default null,
p_comment varchar(23) default null,
PRIMARY KEY (p_partkey))row_format = condensed
tablegroup tpch_tg_partsupp_part
partition by key(p_partkey) partitions cpu_num;
DROP TABLE IF EXISTS customer;
CREATE TABLE customer (
c_custkey bigint not null,
c_name varchar(25) default null,
c_address varchar(40) default null,
c_nationkey bigint default null,
c_phone char(15) default null,
c_acctbal bigint default null,
c_mktsegment char(10) default null,
c_comment varchar(117) default null,
PRIMARY KEY (c_custkey))row_format = condensed
partition by key(c_custkey) partitions cpu_num;
DROP TABLE IF EXISTS supplier;
CREATE TABLE supplier (
s_suppkey bigint not null,
s_name char(25) default null,
s_address varchar(40) default null,
s_nationkey bigint default null,
s_phone char(15) default null,
s_acctbal bigint default null,
s_comment varchar(101) default null,
PRIMARY KEY (s_suppkey))row_format = condensed
partition by key(s_suppkey) partitions cpu_num;
DROP TABLE IF EXISTS nation;
CREATE TABLE nation (
n_nationkey bigint not null,
n_name char(25) default null,
n_regionkey bigint default null,
n_comment varchar(152) default null,
PRIMARY KEY (n_nationkey))row_format = condensed;
DROP TABLE IF EXISTS region;
CREATE TABLE region (
r_regionkey bigint not null,
r_name char(25) default null,
r_comment varchar(152) default null,
PRIMARY KEY (r_regionkey))row_format = condensed;
Load data
You can write scripts based on the data and SQL query statements generated in the preceding steps. As an example, you can take the following steps to load data:
Create a directory for loading scripts.
[wieck@localhost dbgen] $ mkdir load [wieck@localhost dbgen] $ cd load [wieck@localhost load] $ cp xx/create_tpch_mysql_table_part.ddl ./Create the
load.pyscript.[wieck@localhost load] $ vim load.py #!/usr/bin/env python #-*- encoding:utf-8 -*- import os import sys import time import commands hostname='$host_ip' # Notice! ! You need to specify the IP address of an OBServer node, such as OBServer node A. port='$host_port' # The port number of OBServer node A. tenant='$tenant_name' # The tenant name. user='$user' # The username. password='$password' # The password. data_path='$path' # Notice! ! You need to specify the directory of tbl on OBServer node A. db_name='$db_name' # The database name. # Create a table. cmd_str='obclient -h%s -P%s -u%s@%s -p%s -D%s < create_tpch_mysql_table_part.ddl'%(hostname,port,user,tenant,password,db_name) result = commands.getstatusoutput(cmd_str) print result cmd_str='obclient -h%s -P%s -u%s@%s -p%s -D%s -e "show tables;" '%(hostname,port,user,tenant,password,db_name) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/customer.tbl' into table customer fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/lineitem.tbl' into table lineitem fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/nation.tbl' into table nation fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/orders.tbl' into table orders fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -D%s -e "load data /*+ parallel(80) */ infile '%s/partsupp.tbl' into table partsupp fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/part.tbl' into table part fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/region.tbl' into table region fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print result cmd_str=""" obclient -h%s -P%s -u%s@%s -p%s -c -D%s -e "load data /*+ parallel(80) */ infile '%s/supplier.tbl' into table supplier fields terminated by '|';" """ %(hostname,port,user,tenant,password,db_name,data_path) result = commands.getstatusoutput(cmd_str) print resultLoad the data.
Notice
To load data, you need to install the OBClient.
python load.py (0,'') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.\nTABLE_NAME\nT1\nLINEITEM\nORDERS\nPARTSUPP\nPART\nCUSTOMER\nSUPPLIER\nNATION\nREGION') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.') (0, 'obclient: [Warning] Using a password on the command line interface can be insecure.')Perform a major compaction.
Note
To perform a major compaction, you need to log on as the
systenant.MySQL [(none)]> use oceanbase Database changed MySQL [oceanbase]> ALTER SYSTEM major freeze tenant=<your tenant name>; Query OK, 0 rows affectedCheck whether the major compaction is complete.
MySQL [oceanbase]> SELECT FROZEN_SCN, LAST_SCN FROM oceanbase.CDB_OB_MAJOR_COMPACTION; +---------------------+---------------------+ | FROZEN_SCN | LAST_SCN | +---------------------+---------------------+ | 1667239201167716767 | 1667239201167716767 | | 1667239200111919300 | 1667239200111919300 | | 1667239201167452168 | 1667239201167452168 | | 1667239201168053124 | 1667239201168053124 | | 1667239201167520213 | 1667239201167520213 | +---------------------+---------------------+If all
FROZEN_SCNandLAST_SCNvalues are equal, the major compaction is complete.Manually collect statistics.
Run the
obclient -h$host_ip -P$host_port -u$user@$tenant -p$password -A -D$databasecommand as the test user. Then, run the following command:call dbms_stats.gather_schema_stats('$db_name',degree=>96);
Run the test
You can write scripts based on the data and SQL query statements generated in the preceding steps. As an example, you can take the following steps to perform the test:
Write the
tpch.shscript in thequeriesdirectory.[wieck@localhost queries] $ vim tpch.sh #!/bin/bash TPCH_TEST="obclient -h $host_ip -P $host_port -utpch_100g_part@tpch_mysql -D tpch_100g_part -ptest -c" # Run a warmup. for i in {1..22} do sql1="source db${i}.sql" echo $sql1| $TPCH_TEST >db${i}.log || ret=1 done # Run the formal test. for i in {1..22} do starttime=`date +%s%N` echo `date '+[%Y-%m-%d %H:%M:%S]'` "BEGIN Q${i}" sql1="source db${i}.sql" echo $sql1| $TPCH_TEST >db${i}.log || ret=1 stoptime=`date +%s%N` costtime=`echo $stoptime $starttime | awk '{printf "%0.2f\n", ($1 - $2) / 1000000000}'` echo `date '+[%Y-%m-%d %H:%M:%S]'` "END,COST ${costtime}s" doneRun the test script.
sh tpch.shNote
For more information about the test results, see TPC-H benchmark report of OceanBase Database.
FAQ
Q: What do I do when data import fails? Here is the error message:
ERROR 1017 (HY000) at line 1: File not existA: Place the tbl file in a directory on the server that hosts OceanBase Database, because only local data can be imported and loaded.
Q: What do I do when an error occurred when I was viewing the data? Here is the error message:
ERROR 4624 (HY000): No memory or reach tenant memory limitA: You are running out of memory. Allocate more memory to the tenant.
Q: What do I do when an error occurred during data import? Here is the error message:
ERROR 1227 (42501) at line 1: Access deniedA: You need to grant the access privilege to the logon user. Run the following command:
grant file on *.* to tpch_100g_part;Q: What do I do when an error occurred when I was adjusting dos2unix * for the SQL query? Here is the error message:
-bash: dos2unix: command not foundA: You need to install dos2unix. Run the following command to install it:
yum install -y dos2unix