Blog编组 28
What does HTAP really mean?
右侧logo

oceanbase database

Gartner first introduced Hybrid Transaction/Analytical Processing (HTAP) in 2014 and defined it as a capability that supports both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) tasks at the same time.

This concept has quickly sparked a lot of discussion in the database community:

1. What does HTAP mean to users? Do they really want a system capable of both OLTP and OLAP?

2. To process the same set of data, does HTAP sacrifice the performance of OLTP or OLAP, or can it ensure the performance of both?

3. What technologies are required to build an HTAP-capable system? Which storage format is better, the OLTP-friendly row-based storage or the OLAP-friendly column-based storage? How to prevent OLTP and OLAP tasks from interfering with each other?

This article seeks to answer these questions based on OceanBase’s engineering practices in the past decade.

Prior to HTAP: The Separation of OLTP and OLAP

When talking about a database, users usually mean a database management system (DBMS), which, in its early years of development, did not separate OLTP and OLAP.

In 1970, Edgar F. Codd invented the relational model, an approach closely related to OLTP. Six years later, Jim Gray developed a transaction model. As databases were used in more and more applications, and the processing capacity of a single database got exhausted faster, people separated complex analytical workloads from the database. For example, Barry Devlin and Paul Murphy developed the “business data warehouse” in 1988, and Edgar defined OLAP in 1993 with 12 rules to distinguish it from OLTP.

Today, OLAP is implemented by various methods, such as relational OLAP (ROLAP) of a relational database capable of massively parallel processing (MPP), multi-dimensional OLAP (MOLAP) of a data cube, and big data technologies based on the Hadoop or Spark framework.

In classic database architecture, OLTP and OLAP tasks are processed separately. Data is extracted regularly from the OLTP system to the OLAP system through the ETL process. As a result, a data warehouse exports analytical results one or more days after the data is produced (T+1, or T+n).

The Realization of HTAP

The OLTP > ELT > OLAP model is still widely used today. However, this processing mode not only produces data latency for analytics but also makes data synchronization complicated, leading to frequent online business faults in a production system.

A natural idea, then, is to make the application simple by analyzing the data while it is being processed in a transaction, which is HTAP.

The preferred way: Adding OLAP capability to OLTP databases

To achieve this goal, one method is to introduce the OLAP capability to an OLTP database, like Oracle and SQL Server. Compared with MySQL, an OLTP system solely for simple queries based on key-value pairs, Oracle and SQL Server not only can process complicated queries but also achieve OLAP by in-memory column store (IMC) and column-based storage/indexes, respectively. This method, where the OLAP capability is developed on top of OLTP, is referred to as “real-time operational analytics”. It pushes the OLTP performance to its limits and represents what I think the true HTAP is. OceanBase Database adopts this method, except that OceanBase Database is built in a native distributed architecture, which can handle a larger data volume than monolithic RDBMSs.

A more challenging way: Adding OLTP capability to OLAP databases

Another method is to introduce real-time writing capabilities to an OLAP database, such as some MPP databases that are built for real-time OLAP tasks. This method is essentially all about real-time analytics and generally results in a real-time data warehouse. Due to the challenging OLTP requirements of the core business, however, such a data warehouse does not support all transaction processing capabilities. For example, it does not support cross-server strong consistency, large or long-running transactions, triggers, foreign keys, or constraints.

True HTAP, or real-time operational analytics, requires high-performance OLTP on top of real-time analytics. A system with such HTAP capabilities inherently supports real-time analytics, a feature of real-time data warehouses.

Typical HTAP Scenarios

While a true HTAP system supports both OLTP and OLAP, it is not a one-size-fits-all solution due to its engineering complexity and lack of technical maturity. An offline data warehouse, for example, will do a better job with an OLAP architecture rather than with an HTAP architecture.

Below are some typical scenarios that fit HTAP.

Enterprise-level hybrid workloads

An open-source database like MySQL handles only simple queries. Users usually modify the business logic to get around complex ones. Enterprise-level workloads, in most cases, involve HTAP tasks where simple key-value queries are handled along with complex batch processing jobs, or even real-time analysis in order to generate reports. These tasks, such as those of an Enterprise Resource Planning (ERP) system, often use large or long-running transactions, as well as triggers, foreign keys, constraints, and other strict data validation features.

Real-time data mid-ends

Database sharding is a common strategy for MySQL databases. Data from all shards are synchronized to a dedicated aggregation database for real-time analytics. An HTAP system capable of distributed processing can take care of the workloads of both the MySQL transaction database and the aggregation database.

Unified processing of online and historical data

When designing a database system, a database administrator (DBA) often separates online data from historical data by creating an online database for transaction processing and a historical database for the storage of cold data that is regularly migrated from the online database. An HTAP system allows a DBA to integrate historical data and online data to support more flexible query methods with lower business complexity.

User-oriented real-time analytics

Conventional data warehouses are often built for in-house jobs, which are quite tolerant to latency. As data becomes more valuable, many transactions involve interactions with end users, requiring higher responsiveness and better processing performance of a system. For example, e-commerce platforms analyze the advertising plans and keywords of each advertiser or agent in real time, and payment platforms perform real-time risk control analysis on each transaction.

HTAP Myths

What’s the definition of the “same set of data”?

A true HTAP system supports both transaction and analytical operations on the “same set of data”. So, what exactly is the “same set of data”? To answer this question, keep in mind that the point is to figure out the most user-friendly and cost-effective way of data processing. I believe that different replicas of the data, or multiple forms of the data, such as column indexes, B-tree indexes, and bitmap indexes, can be considered as the “same set of data”, provided that the HTAP requirements are satisfied with the minimum data redundancy. For example, OceanBase Database often stores three or five replicas of the same set of data. One is the leader, and the others are followers. You can select a follower for OLAP queries. In addition, Oracle IMC allows you to create column indexes in memory for tables, and SQL Server supports column indexes on disk or memory for tables. From the perspective of a user, you are using “the same set of data of the same system”.

Is resource isolation for the “same set of data” possible?

I have been asked by many whether OceanBase Database supports resource isolation for the “same set of data”? Well, given the preceding explanation of the “same set of data”, this question is easy to answer. In an HTAP database, if it is dominated by OLTP workloads, we can perform both OLTP and OLAP tasks on the leader replica and configure logical resource isolation in the database. Otherwise, we can perform OLTP tasks on the leader and OLAP tasks on a dedicated or read-only follower, which means that resources are physically isolated.

Some service providers in the industry have devised solutions that combine two different databases. For example, they may deploy a RocksDB or MySQL database for OLTP workloads and a Spark or ClickHouse database for OLAP workloads, and route different types of SQL queries in the system. Such solutions are not working on the “same set of data” and do not provide the true HTAP capability. Why? The deployment of two databases will inevitably lead to a significant increase in costs. To guarantee the high availability of the system, you must build a multi-replica disaster recovery architecture separately for the OLTP database and OLAP database. Moreover, it is technically infeasible to ensure data consistency between the two databases.

Does HTAP require a company to use only one database system?

Just because HTAP deals with both OLTP and OLAP workloads, doesn’t mean that HTAP is a panacea for all challenges or that a company can deploy only one HTAP database. It depends on a bunch of factors that are not necessarily technical. Normally, a company has multiple business departments, which means that decision-makers of the OLTP and OLAP databases come from different departments, and the OLTP database is split based on business. For most companies, it is impracticable to deploy only one database system. A workable solution is to build an HTAP database for each business line. For example, a company can build an HTAP database for its trading business to support both real-time processing of online transactions and real-time analytics of historical orders.

In comparison to making up a system with multiple databases such as MySQL, ClickHouse, Elastic Search, Presto, and Kylin, like crafting a mansion with building blocks, HTAP does simplify applications significantly. Nevertheless, even if an HTAP system is so powerful, its data warehouse is often separately deployed. On the one hand, real-time business and the data warehouse are managed by different departments. On the other hand, the real-time business often deals with a single data source, while the data warehouse deals with multiple data sources for multiple business lines.

An HTAP system is not a synonym of a native distributed architecture. As I mentioned earlier, Oracle and SQL Server databases adopt a centralized architecture but can provide the HTAP capability. Oracle Exadata Cloud at Customer (ExaCC) is also one of the typical HTAP solutions. But, it is true that a native distributed architecture can arm a distributed HTAP database, typically OceanBase Database, with the capability to crunch massive data, and greatly broaden the applications of the HTAP database.

Is there a trade-off between OLTP and OLAP?

A true HTAP system supports real-time analytics on the basis of high-performance OLTP. Theoretically, it is possible to provide HTAP without compromising the analytical performance, because the “same set of data” in an HTAP system can be one of the many replicas or forms of the data. For example, the leader can use row-based storage while the followers use column-based storage. However, as the true HTAP capability is developed based on OLTP and involves more complicated engineering, the OLAP performance of HTAP systems is generally not as great as dedicated OLAP systems.

That said, HTAP is suitable for an OLTP system, or a system dealing with both OLTP and real-time OLAP tasks, rather than an offline data warehouse or the processing of massive unstructured data.

Core Technologies of HTAP

True HTAP is achievable by developing the OLAP capability based on an OLTP database. Traditional monolithic databases use disk-oriented B+ tree storage engines. A B+ tree solution is tailored for the real-time processing of small transactions. It shows excellent read and write performance but costs more for storage when compared to up-to-date data structures such as the log-structured-merge-tree (LSM-tree). OceanBase Database, for example, uses the optimized storage engine based on LSM-tree. After using OceanBase Database for all business lines, Alipay’s storage cost is cut to about 1/3 that of the original B+ tree solution.

To introduce the OLAP capability into an OLTP database, especially OLAP of massive data, I think the following jobs are necessary:

First, it is necessary to build the database in a native distributed architecture and use a low-cost storage engine to enhance its capability for big data processing.

This way, the database can handle not only real-time transactions of the latest data but also full analytics of massive historical data. OceanBase Database adopts an LSM-tree-based storage solution, which uses both row-based and column-based storage. It significantly reduces storage costs and achieves a great balance between OLTP and OLAP performance.

Second, the resources for OLTP and OLAP workloads must be isolated, including the physical isolation of replicas and the logical isolation of CPU, network, disk, and memory resources for the same replica.

Incorporating control groups (cgroups) into the database engine for logical resource isolation is a good solution.

Third, the database must deliver great performance in processing complex queries and massive data queries, which involves optimizers, parallel execution, vectorized execution engines, and other core technologies.

In a distributed database, the optimizer must take account of the costs on a single server and on multiple servers because its search scope has increased dramatically. Other challenges include how to perform parallel executions, deal with server faults, implement an efficient vectorized engine, and design the format of in-memory data.

Last but not least, the database must better support OLAP-oriented data development and modeling capabilities.

Generally, a data warehouse has multiple layers to manage different types of data, such as detailed data, service data, and application data. To support real-time analytics, an HTAP database must support efficient and easy-to-use materialized views, external tables, and fast data import, and be compatible with various data development tools and business intelligence tools.

I have been engaged in the research and development of database technologies for more than ten years. What we want to do is to provide HTAP in a distributed architecture, thereby further expanding the application boundaries of HTAP.

HTAP is not a one-size-fits-all solution. It is more suitable for a business line that needs to deal with both OLTP and real-time OLAP tasks. I believe that an ultimate HTAP solution must be a cost-effective one that uses “the same set data of the same system” for both transaction processing and real-time analytics. Replicas of the same set of data can be stored in different forms in a row- and column-based storage, and used for different workloads.

About the author

Charlie Yang, OceanBase CTO

Charlie joined OceanBase in 2010 as one of its founding members. He led all major tasks of OceanBase architecture design and technology upgrades and was instrumental in the adoption of OceanBase Database across businesses of Ant Group. Under his leadership, the OceanBase team took the TPC-C benchmark test and broke the world record twice. He is also the author of Large-scale Distributed Storage Systems: Theory Analysis and Practical Framework. Charlie aims to lead the OceanBase team in making the next-generation enterprise-level distributed database more open, flexible, and efficient, and easy to use.

ICON_SHARE
ICON_SHARE