Data subscription is a technology for continuously capturing incremental changes from a data source. The core of this technology is to use Change Data Capture (CDC) to real-time or near real-time transmit data changes (additions, updates, and deletions) to a target system. In the OceanBase ecosystem, data subscription is primarily used in incremental database migration scenarios to ensure data consistency between the source database and the target OceanBase database.
Scenarios
Data subscription has the following characteristics:
- Low latency synchronization: ensures data consistency between the source and target systems, with latency typically in seconds or milliseconds
- Resource efficiency: only transmits differential data, significantly reducing network bandwidth and storage resource consumption
- Flexible scalability: supports parallel subscriptions to multiple target systems (such as data warehouses, analytics platforms, and caching systems)
- Strong fault tolerance: includes fault recovery and resume-from-breakpoint capabilities, ensuring data is not lost
Typical scenarios include:
- Real-time data synchronization
- Business system integration: real-time data synchronization from order systems to BI platforms, supporting real-time reporting and decision analysis
- Cache updates: real-time synchronization of database changes to caching systems such as Redis and Memcached
- Search engine synchronization: real-time synchronization of database changes to search engines such as Elasticsearch and Solr
- Data architecture upgrades
- Database migration: migration from traditional databases to OceanBase with zero downtime switching
- Modernization of architecture: splitting monolithic databases into distributed architectures for read/write separation
- Cloud-native transformation: migration of local databases to OceanBase instances in the cloud
- Data governance and analysis
- Data lake construction: real-time synchronization of business data to data lakes for offline analysis
- Real-time data warehouse: construction of real-time data warehouses for streaming analysis and machine learning
- Multi-active architecture: implementation of multi-active databases across regions to improve system availability
Core tools and capabilities comparison
Data subscription in the OceanBase ecosystem primarily involves three core tools: self-developed migration tools (such as OMS), external migration tools (such as Flink CDC and DataX), and message middleware (such as Kafka). These tools are suitable for different data subscription scenarios. Below, we will select representative tools and introduce their characteristics, applicable scenarios, and technical advantages.
Self-developed migration tool: OMS
OMS (OceanBase Migration Service) is an enterprise-level data migration and subscription service provided by OceanBase, specifically designed for the OceanBase database ecosystem. It offers a one-stop migration solution from traditional databases to OceanBase.
Core capabilities
- High-performance full and incremental migration: supports rapid migration of TB-level data, with full migration based on logical or physical backups and incremental migration through log parsing (such as MySQL Binlog and Oracle Archive Log)
- Zero-downtime migration: supports smooth switching without interrupting business operations, ensuring business continuity
- Multi-database compatibility: natively supports migration from mainstream databases such as MySQL, Oracle, PostgreSQL, and DB2 without additional adaptation work
- Visual monitoring: provides real-time monitoring of migration progress, latency, and error alerts, supporting end-to-end visual management of migration tasks
Technical advantages
- Native optimization for OceanBase: deeply optimized for OceanBase's distributed architecture and storage engine, resulting in significantly higher migration efficiency compared to general tools
- Low intrusion: only requires reading database logs, without modifying source database configurations, minimizing impact on the source system
- High availability: supports multi-node deployment with automatic failover capabilities, ensuring high availability of migration services
- Data consistency: supports distributed transactions, ensuring data consistency and integrity, meeting enterprise-level data quality requirements
Applicable scenarios
- Enterprise-level database migration projects, especially from traditional databases to OceanBase
- Business scenarios with strict requirements for data consistency and availability
- Large-scale data migration and real-time synchronization needs, such as upgrading core business systems
External migration tool: Flink CDC
Flink CDC is a distributed stream processing engine based on Apache Flink, focused on real-time data subscription and streaming computation. It directly reads database logs through CDC connectors to achieve end-to-end real-time data processing.
Core capabilities
- End-to-end Exactly-Once consistency: ensures data accuracy during transmission and computation, avoiding data duplication or loss
- Flexible data transformation: supports complex business logic, including field mapping, data cleaning, and aggregation calculations
- Multi-source support: supports various data sources such as MySQL, Oracle, PostgreSQL, and MongoDB through CDC connectors
- Unified stream and batch processing: processes real-time stream data and batch data together, simplifying the data processing architecture
Technical advantages
- High-performance computing: supports large-scale parallel processing with throughput reaching millions of TPS, meeting high-concurrency data processing needs
- State management: built-in state storage for complex state calculations and window operations, such as session analysis and real-time aggregation
- Fault tolerance mechanism: based on Checkpoint for fault recovery, ensuring data consistency after failures
- Rich ecosystem: seamlessly integrates with big data components such as Kafka, Hive, and Elasticsearch, building a complete data processing ecosystem
Applicable scenarios
- Real-time data analysis and streaming computation, such as real-time reporting and real-time risk control
- Complex data transformation and cleaning needs, such as multi-source data integration and data standardization
- Multi-source data integration and real-time data warehouse construction, such as real-time data lakes and streaming data warehouses
Message middleware: Kafka
Kafka is a distributed stream processing platform that serves as an intermediary and buffer layer in data subscription architectures, connecting data producers and consumers.
Core capabilities
- High-throughput message transmission: supports up to a million TPS, meeting large-scale data stream processing needs
- Persistent storage: data is persisted to disk, supporting Exactly-Once semantics to ensure data is not lost
- Multi-consumer subscription: supports multiple consumers subscribing to the same topic in parallel, enabling data reuse
- Partitioning and replication: supports horizontal scaling and high availability deployments, meeting large-scale cluster requirements
Technical advantages
- Decoupled architecture: decouples data producers and consumers, enhancing system flexibility and enabling independent scaling and maintenance
- Buffering capability: handles fluctuations and peak traffic in downstream systems, providing data buffering to smooth traffic fluctuations
- Multi-target distribution: a single piece of data can be subscribed to by multiple consumers, enabling data reuse and reducing data acquisition costs
- Horizontal scaling: supports cluster horizontal scaling, meeting large-scale data processing needs with good scalability
Applicable scenarios
- As an intermediate layer for CDC tools, temporarily storing change data, such as data buffering from OMS to the target system
- Building real-time data pipelines to connect different data processing components, such as data streams from databases to analytics systems
- Data buffering and traffic smoothing to handle business peak hours and system fluctuations
Tool selection recommendations
Select based on data source type: Relational databases to OceanBase
Recommended tool: OMS
Applicable scenarios: Migration from databases such as MySQL, Oracle, and PostgreSQL to OceanBase
Core advantages:
- Native optimization: deeply optimized for OceanBase's distributed architecture and storage engine, resulting in significantly higher migration efficiency compared to general tools
- Visual interface: provides real-time monitoring of migration progress, latency, and error alerts, reducing O&M complexity
- Enterprise-level guarantees: supports automated disaster recovery and rollback, providing enterprise-level SLA guarantees
- Low intrusion: only requires reading database logs, without modifying source database configurations
Select based on business scenarios
Real-time data synchronization scenarios
Recommended tool combination: OMS + Kafka
Typical applications:
- Real-time data synchronization between business systems
- Real-time updates to caching systems (Redis, Memcached)
- Real-time index updates in search engines (Elasticsearch, Solr)
- Real-time data pushing to message queues
Applicable scenarios:
- Financial and e-commerce enterprises with high requirements for data consistency
- Online business systems requiring real-time data synchronization
- Small and medium-sized enterprises sensitive to O&M costs
Real-time analysis scenarios
Recommended tool combination: Flink CDC + Kafka + analytics system
Typical applications:
- Real-time data analysis and report generation
- Streaming machine learning and AI applications
- Real-time risk control and monitoring systems
- Real-time recommendations and personalized services
Applicable scenarios:
- Internet enterprises requiring real-time data analysis
- Scenarios with extremely high requirements for data processing performance
- Enterprises with big data technology teams
Data lake construction scenarios
Recommended tool combination: Flink CDC + Kafka + data lake
Typical applications:
- Real-time data lake construction
- Multi-source data integration and unified management
- Real-time data warehouse construction
- Data governance and analysis platforms
Applicable scenarios:
- Large enterprises needing to build a data middle platform
- Multi-line-of-business data integration needs
- Enterprises with high requirements for data governance