Photo by Markus Spiske on Unsplash
This article was organized by OceanBase based on an interview with a senior database expert at AXA SPDB.
The data center of AXA SPDB Investment R&D Department stores 1 TB to 2 TB of data, and the amount of last year’s financial security data of some marketing models that need to be reported to regulators reaches around 5 TB. In other words, our data can be accommodated in a small data warehouse. As for our data business, calculation tasks involving small and medium amounts of data account for more than 80%, and big data calculation tasks, which rarely happen, can be technically prevented to some extent.
For years, we had managed our TP (Transactional Processing) database, a widely used commercial monolithic database, with Cloudera Distributed Hadoop (CDH), a data center management tool. CDH excels in data storage and processes large calculations fast, but has disadvantages in most small calculations, instant queries based on online analytical processing (OLAP), and real-time data processing. For example, CDH does not support the atomicity, consistency, isolation, or durability (ACID) of transactions and supports only limited SQL-DML statements, which prevents effective data writes. In addition, CDH must push its result tables back to the TP database for data application maintenance. Then, we synchronize the manually maintained data and the data of the source system to CDH at T+1 frequency. CDH integrates the data for calculations and then pushes the results back to the TP database. The result data can then be used by DataApi, report, and other data query applications. Resources are wasted as the data is pushed back and forth between the TP database and CDH.
However, the TP database is not scalable and cannot handle large OLAP tasks, and CDH does not support fast report queries or real-time data service unless Impala, Presto, or other real-time analytical components are provided, which is not an option in our plan. So, we started researching distributed databases, hoping to support both online transaction processing (OLTP) and OLAP in one system.
We did our research on OceanBase Database and TiDB, two distributed databases that are capable of hybrid transaction and analytical processing (HTAP).
To achieve both OLTP and OLAP in a TiDB database, we have to build TiFlash replicas. In comparison, the integrated architecture of OceanBase Database is easier to operate and maintain and more user-friendly. So, we didn’t test or verify TiDB in a production environment.
We finally comprehensively compared features of Hive, the TP database, and the OceanBase Database based on our production environment.
Table 1 Comparison between OceanBase and the previous tech stack
Based on the comparison, we dived deeper into OceanBase Database and looked forward to its following powerful capabilities.
OceanBase Migration Service (OMS)
We have requested its support for data synchronization from the previous TP database. If supported, we can replace DataX and other data extract-transform-load (ETL) tools with OMS for data migration.
OceanBase Developer Center (ODC)
ODC is a great alternative to tools such as CDH Hue and DBeaver for data access and provides a user experience equal to that provided by the previous vendor. ODC also supports web clients and will soon support Nginx-based request forwarding, so that we can try taking the front-end computer out of our system.
OceanBase Control Platform (OCP)
OCP provides user-friendly permission control features and makes it easy to get started in resource and tenant management. It allows us to push forward the long-suspended lab project.
After the technical solution was finalized, we conducted production-level tests on the features, performance, and compatibility of the OceanBase Database.
We first verified its features mainly by examining the completeness of its SQL syntax and functions.
Compared to Hive, OceanBase Database is fully compatible with MySQL syntax and supports a large number of window functions, which is great for our business. Most scripts migrated from Hive can run directly in MySQL tenants of OceanBase Database. That is, we need to modify only a little part of the code. Over 90% of the code was migrated from the data center of our Investment R&D Department to OceanBase Database with zero modifications. The remaining less than 10% of the code was modified mainly due to the difference in functions. For example, OceanBase Database does not support covariance functions and we wrote new code by ourselves.
When it comes to the performance test, as we migrated a large number of small scripts to OceanBase Database, the overall runtime was about 30% less than that of CDH. Small scripts are not much of an advantage of OceanBase Database over the previous TP database. However, the distributed HTAP capabilities of OceanBase Database allow us to run a large number of concurrent OLAP tasks. In addition, the data storage space occupied in OceanBase Database is significantly lower than that in the previous TP database. Our data of about 8 TB in the previous TP database occupies roughly 2.4 TB in OceanBase Database.
As for the compatibility test, we connected to OceanBase Database by using MySQL drivers and it worked seamlessly with our ETL and scheduling tools, data synchronization tools, report applications, and the DataAPI platform, which are operated in different environments. We ran into a little problem in that process and OceanBase technical support solved it by replacing the involved MySQL driver with another version.
Our previous system architecture consisted of a TP database and CDH. We must migrate data from the TP database and CDH instances to OceanBase Database. We have learned some lessons during the migration process.
First, when migrating data from CDH to OceanBase Database, we batch-processed the CREATE TABLE statements by using Hive metadata.
We also converted the STRING data type to the VARCHAR(200) data type by default. Some special or extra-long fields were manually processed.
Due to the limitations of OceanBase Database on table partitioning for each node, it is not recommended to create a large number of partitions as in a Hive data warehouse. Instead, we partitioned only special large tables. We created indexes on tables storing no more than 10 million rows of account records to accelerate the query and partitioned larger tables based on the date and data source.
OceanBase Database does not support the LOAD DATA LOCAL mode, which brought inconvenience to our business migration from a Hadoop Distributed File System (HDFS). So, we had to batch export data from tables in the ORC format to TXT files by running the hive -e command and then used OBLOADER to import data to OceanBase Database.
Second, the data in our previous TP database was migrated to a MySQL tenant of OceanBase Database in offline mode. The challenge mainly lies in the differences between the previous syntaxes and SQL functions. So, we adapted the previous query syntax to the new MySQL tenant where necessary.
Third, our previous solution did not involve a primary replica or leader. As a result, intensive cross-node data interactions during the batch processing task led to excessive remote procedure calls (RPCs), which caused errors. This was the biggest challenge in our migration project. Then, we specified a leader, reduced RPCs in INSERT operations, and made full use of hints, a handy feature for OLAP tenants. If we didn’t use hints to handle OLAP tasks, we couldn’t experience the powerful HTAP capabilities of the OceanBase Database. After we added hints, the execution time of SQL scripts in OceanBase Database was stunningly reduced from 40 minutes to 6 minutes.
Last but not least, OceanBase Database works in OLTP mode by default and we must adjust many parameters for it to handle OLAP tasks more efficiently. We hope OceanBase can improve the features of OCP for tenants in different modes so that users can quickly configure their OLAP tenants.
Now, we own a database with an integrated architecture, a well-defined O&M solution, and a wealth of eco-tools. OceanBase Database almost ticks all the boxes that we are looking for in a database. With much less O&M workload, we are able to pay more attention to the value of data itself. And with the help of the OceanBase ecosystem, we have created more valuable data applications.
Thanks to the HTAP capabilities of OceanBase Database, we can manage our manually recorded data, warehoused data, and some application data in the integrated architecture of OceanBase Database, which not only saves the resources for real-time synchronization but also allows parameters that are manually modified in the business system to take effect on various business modules in real-time. Also, the Browser/Server architecture and the clear permission management mechanism of Web ODC allow us to share model data, the latest information, and part of OceanBase Database’s computing resources directly to members of various business lines, which greatly improves the efficiency of data usage.
OceanBase Database is new and needs to provide more scenario-based solutions. For example, our data service scenario requires many database features, such as data synchronization, data processing, data scheduling, metadata management, data storage, and data desensitization. To meet requirements in this scenario, OceanBase Database can offer an exemplary technical framework solution that integrates all the abovementioned features.
In addition, we are looking forward to more OLAP features of the OceanBase Database, such as external tables, materialized views, and cross-tenant data access. We also hope that ODC supports connections from data sources such as Hive, TP databases, and other data warehouse tools or databases.
Feel free to drop a message below to join the discussion about this case study!