Performance troubleshooting steps
Use the following methods to determine whether the Store component has performance bottlenecks.
Query the number of records processed per second from the CDC log.
grep NEXT_RECORD_RPS libobcdc.logIf the CDC processing speed is slower than the business data speed of the source, run the following command to check whether the issue is caused by the OMS Community Edition server.
Check whether the CDC process has triggered the traffic control.
grep "NEED_SLOW_DOWN=1 PAUSED=1" libobcdc.logNEED_SLOW_DOWN=1indicates that the traffic control is triggered because the memory usage is high, which limits the log pulling efficiency. CDC is paused to avoid further increasing the system pressure when the traffic control is triggered due to issues such as I/O or server load.You can modify the
memory_limitparameter to adjust the throttling threshold. View the current value in the/home/ds/store/store{port}/etc/libobcdc.conffile and increase the parameter value if necessary. Here is an example:liboblog.memory_limit=20G liboblog.part_trans_task_active_count_upper_bound=500000If the traffic control is not triggered, query the logs for CLOG pulling to check the RPC latency.
grep do_stat libobcdc.log [2025-04-21 16:05:13.905681] INFO [TLOG.FETCHER] do_stat (ob_log_ls_fetch_stream.cpp:309) [20155][][T0][Y0-0000000000000000-0-0] [lt=9] [STAT] [FETCH_STREAM] stream="xxx.xxx.xxx.1:2882"(0x7fa62d4131f0:HOT)({tenant_id:1028, ls_id:{id:1002}})(FETCHED_LOG:153.11GB) traffic=41.85MB/sec log_size=438879806 size/rpc=13.50MB log_cnt/rpc=946 rpc_cnt=31(3/sec) single_rp c=0(0/sec)(upper_limit=0(0/sec),max_log=0(0/sec),no_log=0(0/sec),max_result=0(0/sec)) rpc_time=312357 svr_time=(queue=41,process=224677) net_time=(l2s=1146,s2l=83859) cb_time=2632 h andle_rpc_time=13739 flush_time=860 read_log_time=12870(log_entry=2600,trans=0) trans_count=0 trans_size=0.00BThe
rpc_time=312357 svr_time=(queue=41,process=224677)in the log indicates that the RPC latency is 312 ms, and the server spent 224 ms processing the RPC. Generally, the RPC latency is only several tens of milliseconds. This indicates that the RPC latency is excessively high. In this case, query the OBServer logs and adjust relevant parameters.Keywords in the OBServer log:
fetch_log done. This line of log is expected to print the statistics of log pulling. If the value offetch_archive_timeis not 0 in this line of log, increase the value oflog_disk_sizeto increase the storage space for CLOG.
After the Store component is ruled out as the cause, check the performance-related parameters of the Full-Import/Incr-Sync component.
Performance-related configurations for Full-Import/Incr-Sync components
Usually, setting useSchemaCache to true in the Source is sufficient for most scenarios. If the required records per second (RPS) is still not met, you can set buildRecordConcurrent to true.
Source
| Parameter | Description |
|---|---|
| useSchemaCache | Specifies whether to cache the schema. Valid values: true and false. Default value: false.If you set this parameter to true, the Store component caches the schema when reading data, which accelerates the message conversion of the Store. |
| buildRecordConcurrent | Specifies whether to asynchronously convert Store messages. Valid values: true and false.If you set this parameter to true, data is pulled from the Store and message conversion is performed in parallel. The number of parallel threads is the same as workerNum. |
Sink
The following two parameters configure the producer client properties of Kafka.
| OMS Community Edition parameter | Corresponding Kafka client parameter | Description |
|---|---|---|
| lingerMs | ProducerConfig.LINGER_MS_CONFIG | The waiting time of Kafka for sending batches of data. If you want to increase the throughput, you can increase the amount of data sent in each batch. The default value is 10, in milliseconds. |
| batchSize | ProducerConfig.BATCH_SIZE_CONFIG | The maximum number of messages sent in each batch by the Kafka client. Default value: 1048576, in bytes (1 MB). |
| workerNum | The number of concurrent worker threads of the Sink. Default value: 16. |
If enablePreprocessConfig is set to true in the coordinator, lingerMs and batchSize will be automatically configured based on the JVM memory. If you manually configure these two parameters, your configurations take precedence.
# View the automatically configured parameters
grep "auto set " connector.log
# View the configurations finally used by the system
cat conf/runningConf.json
Coordinator
Shuffle-related configurations of OMS Community Edition
| Parameter | Description |
|---|---|
| shuffleBucketSize | The number of buckets. OMS Community Edition usually reads and sends a batch of data in a bucket and then reads and sends the next batch of data in the bucket. The number of buckets determines the number of records that can be sent at the same time. Default value: 128. |
| shuffleFlushIntervalMs | The time interval for reading bucketed data periodically. The smaller the interval, the more real-time it is. The unit is milliseconds, and the default value is 100. |
| shuffleMinBatchSize | The number of records in a bucket must be greater than or equal to the value of this parameter before the bucket is read and sent. If the number of records in a bucket is less than the value of this parameter, the system waits for shuffleFlushIntervalMs and then reads and sends the records in the bucket. Default value: 20. |
| shuffleMaxBatchSize | The maximum number of records to be read and sent in one time. Default value: 64. |
Use Arthas for performance analysis
# Log in to the OMS Community Edition container
cp /root/arthas-bin.zip /home/ds
su - ds
unzip arthas-bin.zip
/opt/alibaba/java/bin/java -jar arthas-boot.jar pid(Incremental component process number)
profiler start
profiler getSamples
profiler status
# Enter the stop command after waiting for 1 minute. This will generate an HTML file containing flame graphs.
profiler stop --format html
# Exit Arthas.
exit