Blog编组 28
6 Questions to Ask about Distributed SQL Database Operation & Maintenance

6 Questions to Ask about Distributed SQL Database Operation & Maintenance

右侧logo

oceanbase database

In December 2022, an OceanBase expert Bao Lei and an Oracle ACE Huangfu Xiaofei talked online about Operation & Maintenance (O&M) in distributed SQL databases.


They discussed the following six concerns that database O&M engineers are concerned with:


How do O&M in distributed SQL databases differ from that in centralized databases?

What are the challenges in O&M in distributed SQL databases?

What are the skills that O&M teams must own?

What are the key points to effectively managing a large database cluster?

How to guarantee and enhance the overall data monitoring from the database kernel to middleware, deployment architecture, and business?

How to fast scale a distributed database to handle extended business links and changing business scenarios?

This article is a summary based on their conversation.


1. What are the differences in O&M between a conventional centralized architecture and a distributed architecture?

Generally speaking, there are three differences:


(1) Hardware environment.


The load capacity of a conventional centralized architecture depends largely on the hardware capacity or the capacity of its technical environment. So, many users are moving their business from midrange servers to x86 ones, which, to be honest, also require quite high hardware configurations. A distributed architecture, on the contrary, can be deployed in an average PC hardware environment and guarantees data reliability by using the distributed consistency protocol. Such an architectural difference leads to less-demanding requirements for the technical environment. Users are therefore able to make effective use of their existing servers to reduce costs.


(2) Database scalability.


Many users are migrating their business from conventional Oracle databases to others or upgrading to distributed platforms. How come? Because when the data volume grows to a certain level, the memory, CPU, or storage of a midrange server cannot be scaled up to handle business peaks. Yes, you can add more nodes, but the number is limited for most centralized architectures. In a scaled Oracle database, all the nodes use the same set of data. In comparison, a distributed database does not have strict restrictions on the number of compute and storage nodes. When resources of a distributed database become insufficient to support your business, you can scale it out as needed.


(3) O&M costs.


Users, especially decision-makers, will consider the costs for sure. Distributed databases are more affordable than centralized ones based on a comprehensive comparison of their basic software costs and software service fees.


2. What are the new challenges in the O&M management of distributed databases?

Compared to conventional databases, distributed SQL databases face the following new challenges in terms of O&M:


(1) Data security.


This still concerns users the most and always comes first regardless of what database you use. Like a conventional Oracle database, the key goals of data security for a distributed database remain unchanged: keep the sensitive data confidential; protect the core data from breaches; and make sure that O&M operations are auditable. To be specific, any paths to access a database, whether from applications or clients and certain operations on the database, especially privileged operations, must be auditable to ensure security. Of course, this topic is broad and involves other issues that users are concerned about, such as user management and security reinforcement.


(2) Availability.


When a database fails, you must recover it quickly to meet industry standards and regulatory requirements. Customers want to reduce the risk of a server crash and minimize the impact on their business. For O&M engineers and database administrators (DBAs), they want O&M to be easy so that they can put their hands on the job just by referring to the documentation, or after one trial or two. In addition, replica switchover and recovery should be easy to get started for DBAs.


(3) Database performance.


This challenge is about how to quickly locate and analyze the database performance issues in the production environment. On the one hand, we respond to and deal with any performance issues instantly when they occur. On the other hand, we should be actively prepared, which means that we should have a set of methods in place before launching a system to check and predict possible issues in the system's performance. For example, if we can optimize database development specifications, DBAs can put more effort into development and testing work; if we can provide well-rounded fault handling processes and methodologies, O&M engineers are able to analyze and solve issues with the help of the knowledge base, user manuals, and database O&M tools. The aim of all the above is to increase efficiency.


(4) Performance and capacity assessment.


As the business keeps growing, the business traffic and business type may change. For example, bank transaction process changes or marketing campaigns are scheduled during peak hours. How much business can be held by the database? Can the database withstand traffic pressure with guaranteed business continuity? To answer those questions, O&M engineers must run an assessment of the database capacity. In this case, proper tools in the hands of experienced experts would be the best.


(5) Regulatory compliance and adaptability to business changes, which are especially crucial for the financial sector. As a matter of fact, the business of financial institutions changes every year or even every quarter. The institutions must comply with the requirements of regulatory authorities while responding to the complex and changing financial market.


In addition, customers have defined their O&M duties and regulations based on their previous conventional databases. DBAs, architects, developers, as well as staff members doing jobs with middleware, servers, storage, and network hosts may all have participated in O&M. As distributed databases are used in more scenarios, some customers or industries have to adjust their outdated division of duties. As mentioned before, a distributed database requires the DBA to put more effort into development and testing work. As a result, the performance issues and failure rate will decrease significantly in the production environment.


3. What are the skills that the O&M team of a distributed database should own?

Initially, the O&M team of an enterprise that uses a distributed database should master the principle, architecture, and characteristics of distributed databases. While laying a solid theoretical foundation, the O&M team should also study best practices and success stories as much as reasonably possible.


In addition, the O&M team should be skilled in daily O&M operations such as inspection, backup, performance optimization, troubleshooting, switchover, disaster recovery environment management, as well as data synchronization.


Moreover, the O&M team should have a good command of distributed database O&M tools, which play an important role in enhancing the O&M skills and efficiency, such as OceanBase Control Platform (OCP), the O&M management tool of OceanBase Database.


Note: We will post an article about the features of OCP. Stay tuned!


4. What are the key points for effectively operating and maintaining a large database cluster?

(1) Proper division of O&M duties. This is especially true for financial institutions, whose database O&M involves multiple departments and vendors. Enterprises must properly divide O&M duties based on their respective needs. Some enterprises may assign the O&M duties of databases, middleware, and hosts to one person or group, while others finely divide the duties and assign them to multiple groups based on the product type, operating system, and vendor of databases and middleware. Each division method has its advantages and disadvantages. Enterprises can figure out solutions that best fit their business, and should make changes accordingly if the existing division holds back the O&M of their distributed databases.


(2) Well-defined O&M regulations. The O&M of a database, regardless of its type, should be guided by well-defined regulations, mainly specifications, and workflows, such as O&M specifications, development specifications, and routine O&M workflows. So far, financial institutions in general have done a great job in this regard.


(3) A professional O&M team. An enterprise should build up an O&M team of professionals with expert skills and competencies, that is, someone who can take full control of its distributed database.


5. How to guarantee and enhance the overall data monitoring from the database kernel to middleware, deployment architecture, and business?

Traditionally, financial institutions, especially banks, tend to build their data system in an IOE architecture, which consists of IBM servers, Oracle databases, and EMC storage. Generally, banks use Enterprise Service Bus (ESB) to interconnect their core business and channel systems in a service-oriented architecture (SOA) for the purpose of data exchange, format conversion, and message routing between various business modules. Shocked by the surge of Internet finance (ITFIN) and faced with the popularization of emerging technologies such as cloud computing and cloud-native, traditional financial institutions are moving to transfer their data systems to the cloud-native and distributed architecture. The efficient use of this new architecture requires O&M engineers to not only upgrade their skill sets but also refresh their way of thinking.


Unlike conventional centralized databases, the access paths to a distributed database in a cloud-native architecture are complex and interwoven. For example, if the business needs to be reconstructed into a unit-based architecture, data middleware products will be introduced, which means that the business traffic may flow across multiple virtualized nodes and VPC networks. In that case, any jitter in the traffic link will have an impact on the response time (RT) and throughput performance (transactions per second, TPS) of a sensitive transaction system. However, it is pretty tricky to analyze the issue and find the root cause.


DBAs must update their troubleshooting routine. In the past, when receiving feedback on slow business speed or glitches, the DBA of a centralized database might solve the problem just by logging onto the database and checking for unusual wait events and sessions, without knowing the exact causes or the fault location. As we usher in the cloud-native era, a term is often mentioned: business observability. The reason is that, as we increase the capacity, scalability, and availability of databases in the cloud by using microservices and the unit-based architecture, the overall system complexity is raised, making monitoring work more difficult.


To ensure business observability, you should first design an observing mechanism top-down from business monitoring (core transactions) to application monitoring (service calls, QPS, RT, and garbage collection in the Java virtual machine) and resource monitoring (CPU, memory, and network at the virtual and physical layers). The design covers metrics, tracing, and logging. Based on the observed information, DBAs can finally fast detect, analyze, and solve problems by using troubleshooting methodology and monitoring tools, and then figure out best practices.


Based on the design idea of metrics, tracing, and logging, it is important to pay attention to business monitoring metrics that are measured by second or minute, and those top metrics. For example, you should design such metrics as the total number of transactions, number of successful transactions, number of failed transactions, success rate, and average time consumption. These metrics can then be collectively presented in a dashboard's trend or comparison charts. You can also monitor and collect application logs and monitor application resources, and associate business/error logs with the context trace ID. When a problem occurs in the application, the DBA is able to quickly track down the specific business logs by using the trace ID and analyze and solve the problem in time.


The above-mentioned observability points are about top-down monitoring from applications. Therefore, given the great number of nodes and extended links of a distributed database, DBAs should perform troubleshooting in a perspective of full-link diagnosis, in addition to making use of their conventional O&M skills. An important feature is to associate the business trace ID with the data access layer (data middleware layer). OceanBase has been working to keep optimizing this feature of OBSharding, our middleware product. In a unit-based architecture, if the business trace ID and middleware trace ID are associated, the system can calculate the time consumed at each access stage of a business transaction. Without this feature, we usually need to use tools such as Arthas and TcpDump to help with the analysis based on the conventional stage-by-stage method. In OceanBase Database, the monitoring of database performance relies on the metrics provided by OCP, such as Suspicious SQL. OCP also allows users to monitor tenant exceptions based on metrics such as tenant MemStore, tenant CPU, tenant sessions, tenant thread queue, and locks.


6. How to fast scale a distributed database to handle extended business links and changing business scenarios?

Fast scaling means strong scalability, a feature that is much expected from both centralized and distributed databases. A conventional database is vertically scaled to cope with business growth by such means as increasing the disk capacity and computing resources of a server. For distributed databases, horizontal scaling is a common feature, and unlike their centralized counterparts, we can scale the capacity of a distributed database by adding more nodes.


Horizontal scaling allows a shared-nothing system like OceanBase Database to achieve a favorable linear capacity increase. On top of that, we need to perform modifications to implement a distributed architecture, so as to avoid the negative impact of some distributed transactions. A big advantage of horizontal scaling is that it allows you to scale a distributed system in and out as needed. When the computing resources are running out or a server fails, or when you need to relocate an Internet data center (IDC), you can have the job done with the proper tools. Horizontal scaling improves the efficiency of major promotion campaigns at lower costs.


OceanBase Database achieves system scaling based on load balancing at the partition level, and it supports the upgrade of server hardware configurations without interrupting system operation. For example, you can upgrade the hardware configurations of each OBSever in each zone in rolling mode. For the cluster as a whole, this upgrade mode increases its capacity seamlessly and causes zero interruptions to the business.


During a major promotion campaign, the volume of a bank’s business will be multiplied. The bank must scale its business system in advance. It can evaluate the after-scaling cluster size based on capacity metrics and monitoring metrics from OCP. The scaling can be performed at the zone or tenant level. For each zone, you can perform scaling on the same number of nodes at a time or in rolling mode. For example, you can scale a three-replica cluster to a five-replica cluster, or add servers to each zone. As OceanBase Database supports read/write splitting, you can then route some time-consuming read requests to the new read-only replicas. After the business traffic goes back to normal, you can scale the database online. To sum up, horizontal scaling effectively addresses the vulnerabilities of a conventional database in terms of storage capacity and scaling efficiency.


DBAs can perform these O&M operations on the GUI of OCP. Furthermore, OCP also provides APIs for customers to integrate its features with orchestration tools and orchestrate fast scaling-related sub-tasks into a task stream.

ICON_SHARE
ICON_SHARE