Author: Technical development department of AXA SPDB Investment Managers Co., Ltd. (AXA SPDB)
AXA SPDB Investment Managers Co., Ltd. (AXA SPDB), a Sino-French joint venture fund company established in 2007, requires easy-to-use and efficient business data management tools that deliver high storage performance, ensure quick response, and support real-time data processing. As the original Oracle + Cloudera Distributed Hadoop (CDH) solution could not meet its expectations, AXA SPDB sought to optimize its database architecture and evaluated OceanBase Database, TiDB, Hive, and Oracle. After comprehensive comparisons and tests, the company decided on OceanBase Database. In this article, the technical development department of AXA SPDB shared their experience in database selection.
The data center of AXA SPDB Investment R&D Department stores 1 TB to 2 TB of data, and the amount of last year's financial security data of some marketing models that need to be reported to supervision departments reaches around 5 TB. In other words, our data can be accommodated in a small data warehouse. As for our data business, calculation tasks involving small and medium amounts of data account for more than 80%, and big data calculation tasks, which rarely happen, can be technically prevented to some extent.
For years, we have managed our Oracle database with CDH, a data center management tool. CDH excels in data storage and processes large calculations fast, but has disadvantages in most of small calculations, instant queries based on online analytical processing (OLAP), and real-time data processing. For example, CDH does not support the atomicity, consistency, isolation, or durability (ACID) of transactions and supports only limited DML SQL statements, which prevents effective data writes. In addition, CDH must push its result tables back to Oracle for data application maintenance. Then, we synchronize the manually maintained data and the data of the source system to CDH at T+1 frequency. CDH integrates the data for calculations and then pushes the results back to Oracle. The result data can then be used by DataApi, report, and other data query applications. Resources are wasted as the data is pushed back and forth between Oracle and CDH.
However, the Oracle database is not scalable and cannot handle large OLAP tasks, and CDH does not support fast report query or real-time data service unless Impala, Presto, or other real-time analytical components are provided, which is not an option in our plan. So, we started researching distributed databases, hoping to support both online transaction processing (OLTP) and OLAP in one system.
We did our research on OceanBase Database and TiDB, two distributed databases that are capable of hybrid transaction and analytical processing (HTAP).
To achieve both OLTP and OLAP in a TiDB database, we have to build TiFlash replicas. In comparison, the integrated architecture of OceanBase Database is easier to operate and maintain and more user-friendly. So, we didn't test or verify TiDB in a production environment.
We finally comprehensively compared features of Hive, Oracle, and OceanBase Database based on our production environment.
Based on the comparison, we dived deeper into OceanBase Database and looked forward to its following powerful capabilities:
After the technical solution was finalized, we conducted production-level tests on the features, performance, and compatibility of OceanBase Database.
We first verified its features mainly by examining the completeness of its SQL syntax and functions.
Compared to Hive, OceanBase Database is fully compatible with MySQL syntax and supports a large number of window functions, which is great to our business. Most scripts migrated from Hive can run directly in MySQL tenants of OceanBase Database. That is, we need to modify only a little part of the code. Over 90% of the code was migrated from the data center of our Investment R&D Department to OceanBase Database with zero modifications. The remaining less than 10% of the code was modified mainly due to the difference in functions. For example, OceanBase Database does not support covariance functions and we wrote new code by ourselves.
Compared to Oracle, MySQL tenants of OceanBase Database show more differences. In this case, you need to use OceanBase Database Enterprise Edition and create Oracle tenants to take care of the Oracle-specific functions and special dialects.
When it comes to the performance test, as we migrated a large number of small scripts to OceanBase Database, the overall runtime was about 30% less than that of CDH. Small scripts are not much of an advantage of OceanBase Database over Oracle. However, the distributed HTAP capabilities of OceanBase Database allow us to run a large number of concurrent OLAP tasks. In addition, the data storage space occupied in OceanBase Database is significantly lower than that in Oracle. Our data of about 8 TB in the Oracle database occupies roughly 2.4 TB in OceanBase Database.
As for the compatibility test, we connected to OceanBase Database by using MySQL drivers and it worked seamlessly with our ETL and scheduling tools, data synchronization tools, report applications, and the DataAPI platform, which operated in different environments. We ran into a little problem in that process and OceanBase Technical Support solved it by replacing the involved MySQL driver with another version.
Our original system architecture consisted of Oracle and CDH. We must migrate data from Oracle and CDH instances to OceanBase Database. Now, let's share the lessons we learned in the migration process.
First, when migrating data from CDH to OceanBase Database, we batch processed the CREATE TABLE statements by using Hive metadata.
OceanBase Database does not support the LOAD DATA LOCAL mode, which brought inconvenience to our business migration from a Hadoop distributed file system (HDFS). So, we had to batch export data from tables in the ORC format to TXT files by running the hive -e command, and then used obloader to import data to OceanBase Database.
Second, the data in our Oracle database was migrated to a MySQL tenant of OceanBase Database in offline mode. The challenge mainly lies in the differences between the Oracle and MySQL syntaxes and SQL functions. So, we adapted the Oracle-based query syntax to the new MySQL tenant where necessary.
Third, our original solution did not involve a primary replica, or leader. As a result, intensive cross-node data interactions during the batch processing task led to excessive remote procedure calls (RPCs), which caused errors. This was the biggest challenge in our migration project. Then, we specified a leader, reduced RPCs in INSERT operations, and made full use of hints, a handy feature for OLAP tenants. If we didn't use hints to handle OLAP tasks, we couldn't experience the powerful HTAP capabilities of OceanBase Database. To be specific, after we added hints, the execution time of SQL scripts in OceanBase Database was stunningly reduced from 40 minutes to 6 minutes.
Last but bot least, OceanBase Database works in OLTP mode by default and we must adjust many parameters for it to handle OLAP tasks more efficiently. So, we hope OceanBase can improve the features of OCP for tenants in different modes, so that users can quickly configure their OLAP tenants.
Now, we own a database with an integrated architecture, a well-defined O&M solution, and a wealth of support tools. OceanBase Database almost ticks all the boxes that we are looking for in a database. With much less O&M workload, we are able to pay more attention to the value of data itself. And with the help of the OceanBase ecosystem, we have created more valuable data applications.
Thanks to the HTAP capabilities of OceanBase Database, we can manage our manually recorded data, warehoused data, and some application data in the integrated architecture of OceanBase Database, which not only saves the resources for real-time synchronization, but also allows parameters that are manually modified in the business system to take effect on various business modules in real time. Also, the Browser/Server architecture and the clear permission management mechanism of Web ODC allow us to share model data, the latest information, and part of OceanBase Database's computing resources directly to members of various business lines, which greatly improves the efficiency of data usage.
We would like to extend our thanks to OceanBase community members for their help in our database selection, data migration, and business switchover to the new system. A few humble suggestions:
OceanBase Database is new and needs to provide more scenario-based solutions. For example, our data service scenario requires many database features, such as data synchronization, data processing, data scheduling, metadata management, data storage, and data desensitization. To meet requirements in this scenario, OceanBase Database can offer an exemplar technical framework solution that integrates all the features mentioned above.
In addition, we are looking forward to more OLAP features of OceanBase Database, such as external tables, materialized views, and cross-tenant data access. We also hope that ODC supports connections from data sources such as Hive, Oracle, MySQL, and other data warehouse tools or databases.