Blog编组 28
The Story Behind OceanBase's Integrated Architecture: What Has Been Done and Why?

The Story Behind OceanBase's Integrated Architecture: What Has Been Done and Why?

右侧logo

When OceanBase was founded 14 years ago, there were many popular open-source databases in the industry, such as MySQL. As an open-source relational database, MySQL has been widely adopted by individual developers and businesses because of its fast performance, user-friendly interfaces, active community, and quick support, making it one of the most popular databases for web applications and services. However, MySQL doesn't perform well when processing complex queries, and has difficulty in scaling and handling large data volumes.

Therefore, OceanBase decided to use an integrated approach to solve the challenges of stand-alone and distributed databases. It can handle not only small queries, but also large queries, and has been continuously innovating and upgrading for more scenarios, from Key-Value to TP/AP integration, and from simple queries to complex queries.

Since 2010, OceanBase has focused on OLTP (Online Transaction Processing) scenarios, gradually realizing several integrations: TP and AP integration, cloud and on-premises integration, and stand-alone and distributed integration. Focusing on mission-critical workloads of the core systems, from 1.x to 4.x, OceanBase has been enhancing its capabilities in terms of stability, high performance, high compatibility, and cost effectiveness.


What do we mean when we talk about the integration?

Guided by the idea of integration, OceanBase has made significant technological innovations, gradually incorporating key distributed technologies into the database.

For example, by introducing Paxos, OceanBase achieves an RPO = 0 and an RTO < 8 seconds. By enabling the "five IDCs across three regions" deployment architecture, OceanBase achieves IDC and region-level disaster recovery. By proposing the stand-alone and distributed integrated architecture and introducing the LSM-Tree storage engine into the database, OceanBase significantly reduces storage costs.

The framework of OceanBase's structure can be simplified into three levels of "integration" design:

Firstly, the integration of stand-alone and distributed architectures. As the foundation of OceanBase, the stand-alone and distributed integrated architecture was released in 2022, which aims to address the issue of data scalability. With this architecture, data of any size can be processed with a single system.

Secondly, the integration of engines.  On top of the integrated architecture, OceanBase further solves the issues of data storage and computation with integrated engines, including integrated storage engines, integrated transactions, integrated SQL engines, and in the future, storage-compute-separated engines.

Finally, the integration of products. Use one database product to provide multiple data services to customers and solve 80% of their demands, supporting multiple workloads, multiple data models, and multiple data interfaces.

In contrast to relational data, semi-structured data such as JSON and XML can be more flexible, which is essential for some complex applications. The demand for handling diverse data types such as GIS and KV is increasingly evident. Meanwhile, databases have evolved from OLTP to OLAP, and now HTAP. Companies typically use different databases for different problems, leading to an increasing complexity in data management.

OceanBase has always focused on OLTP scenarios, gradually enhancing core capabilities that meet modern data architecture needs, such as multi-model, multi-tenant, multi-workload, and multi-infrastructure. With all these integrated features, OceanBase simplifies the data processing and management for the users.

●  Multi-tenant: Unified technology stack and simplified database infrastructure. OceanBase provides multi-tenant and resource isolation capabilities, ensuring that multiple database instances can be integrated into one cluster. From the user's perspective, they no longer need to worry about resource isolation issues. Moreover, the unified technology stack greatly simplifies the database infrastructure and significantly improves system utilization.

●  Multi-workload: OceanBase provides excellent performance and cost-effectiveness for all workloads with one set of data, supporting real-time analysis on a high-performance OLTP basis without worrying about the complexity of ETL while ensuring data consistency.

●  Stand-alone and distributed integrated architecture: A single database that meets the demands of businesses of all sizes. OceanBase 4.0 allows users to smoothly scale databases from a small-scale stand-alone deployment to a large-scale distributed cluster, or scale down from a distributed cluster to stand-alone deployment.

●  Multi-model: A single database that supports multiple data types. Starting from 4.0, OceanBase provides support for multiple data types through OBKV. Regardless of the data types, relational data, JSON, Key-Value, GIS or other non-structured data can be effectively processed within the same database, reducing the application development difficulties and operational complexity.

●  Multi-infrastructure: From on-premises to cloud, choose the infrastructure as needed. Starting from 4.0, OceanBase supports multiple infrastructures such as local data centers, Alibaba Cloud, AWS, and cross-infrastructure deployment, significantly reducing complexity through consistent architecture and management.


From integrated architecture to integrated engines

With the release of OceanBase 4.0, the industry's first stand-alone and distributed integrated database, it is possible to achieve scalability for both large and small enterprises, as well as startups. Based on this architecture, we can further introduce integrated storage engines, integrated SQL engines, and storage-compute-separated engines.


Integrated storage engines

HTAP is a hot topic in the industry. It is widely known that OLTP requires row storage, while OLAP requires columnar storage. Generally, there are two approaches to designing a storage engine that can integrate row storage and columnar storage into a single system.

OceanBase is a shared-nothing architecture, where each replica uses the same storage format, either row storage or a hybrid row-column storage. All OLAP and OLTP requests are directly served by the primary replica. This approach has the benefit of no consistency or data latency issues, but lacks support for columnar storage, making it more suitable for OLTP+light OLAP scenarios.

The second approach is to apply different storage formats in OceanBase replicas. For example, use a the row-stored primary replica for OLTP and a column-stored backup replica for OLAP. This approach significantly improves the OLAP processing capability, but introduces additional millisecond-level latency and transient data inconsistency between primary and backup replicas.


Integrated SQL engines

When it comes to mixed workloads, it's essential to consider both simple and complex queries. For simple queries, users are mostly concerned with query latency, so the best approach is serial execution, where the SQL layer pulls data from the storage layer. For complex queries involving larger datasets, users focus on whether the database can leverage its parallel processing capabilities. Therefore, the optimal approach for complex queries is parallel execution, with the SQL layer pushing the execution plan down to the storage layer to reduce data transfer overhead.

Today, OceanBase's integrated SQL engine employs a push-pull hybrid approach: pulling data for simple queries and pushing execution plans for complex ones. This effectively integrates both simple and complex queries into a unified system. Additionally, OceanBase supports a very useful feature called Auto DOP (Degree of Parallelism), where the optimizer can automatically determine whether to use serial execution or parallel execution, and the specific level of concurrency based on statistical information.

Moreover, OceanBase's resource isolation feature further ensures that OLTP and OLAP workloads won't be affected by each other.


Storage-compute-separated engines

Deploying OceanBase to multi-cloud infrastructure presents a technical challenge, as it's based on a shared-nothing architecture while generally cloud adapts a shared-storage architecture.

Fortunately, OceanBase's LSM-Tree storage engine divides data into baseline and incremental data and ensures that the baseline data is identical across multiple replicas. When deploying multiple replicas to multi-cloud infrastructure, it is possible for the baseline data of multiple replicas to share the same storage, thus reducing storage costs and computational expenses. Moreover, OceanBase can further reduce computational costs through log replicas or arbitration replicas, achieving near-zero RPO and elastic scalability in the cloud, and integrating shared-nothing and shared-storage architectures seamlessly.

With the integrated engine in place, it is possible to further build integrated products.


What has been done to build the integrated products?

Over the past few years, OceanBase has done a lot in terms of integrated products. Here are some examples:

Integrated SQL Functionality

The goal of the integrated database is to achieve SQL functionality in a distributed architecture that is on par with stand-alone databases. However, certain SQL functionalities are very difficult to implement in a distributed architecture.

For example, large transactions with a large number of participants and partitions are almost impossible to complete in a distributed architecture. Another example is table locks, where locking a table in a distributed system with a large number of partitions is also almost impossible to achieve. The fundamental problem is that distributed databases generally have an independent log stream for each partition, so the complexity of operations like large transactions and table locks will be proportional to the number of partitions.

With OceanBase's stand-alone and distributed integrated architecture and dynamic log stream technology, all partitions on a server are dynamically integrated into a single log stream, so the complexity of operations like large transactions and table locks is only proportional to the number of servers, not the number of partitions. In 4.2, OceanBase achieved no restrictions on the size of transactions and full DDL functionality, including table locks.


Multi-model Integration

OceanBase supports multiple models within a single database product and achieves mutual operations between different models.

For example, many HBase users are migrating their business to OceanBase. They can use HBase-compatible interfaces for writing and SQL for reading.

OceanBase 4.0 provides multi-model capabilities through OBKV, simplifying application development and reducing operational complexity.


Integrated Product Family

In terms of tools, we aim to upgrade the core capabilities of tool products in mission-critical workload scenarios to be compatible with more key ecosystems. Whether it's ODC, OMS, Binlog, OCP, or OAS, there have been significant upgrades for key business scenarios.

●  ODC is dedicated to creating an enterprise-level collaborative development platform, incorporating the security and compliance processes into the workflow of database developers to enable all change operations to be traceable and roll-backable.

●  OMS provides the capability for bi-directional synchronization and one-click rollback, allowing long-term parallel running of old and new systems, and ensuring a one-click rollback to the system in case of any problems with the new system.

●  OCP has further enhanced the diagnostic monitoring capabilities of OceanBase and supports comprehensive control in various scenarios. Additionally, OAS provides self-service to customers, evolving the product based on years of operational experience from Ant Group and Alipay.

●  In addition, OceanBase continues to improve compatibility, providing support for MySQL Binlog to help users integrate their databases into the MySQL ecosystem more conveniently.


Prospects for the Integrated Database

OceanBase has always focused on two topics:  better performance and lower costs. With the integrated products, OceanBase is pursuing the best performance in a distributed architecture. Currently, OceanBase's stand-alone performance has reached or even exceeded that of MySQL, and has significant advantages in deployment costs, hardware costs, and migration and learning costs:

●  Stand-alone deployment costs: Under the same hardware conditions, stand-alone OceanBase's SQL and transaction processing performance is comparable to MySQL, with 1/3 storage costs in some scenarios.

●  Hardware costs brought by vertical/linear expansion: As the business scales up, whether in the cloud or on-premises, upgrading hardware is necessary to improve database performance. In this process, compared to the non-linear cost growth of traditional databases, OceanBase's horizontal expansion can achieve true linear growth in performance and costs, reducing hardware costs for users.

●  Migration and learning costs: OceanBase has achieved the integration of stand-alone and distributed architecture. This allows users to smoothly scale from stand-alone to distributed deployment without additional migration and learning costs.


ICON_SHARE
ICON_SHARE