Background
OceanBase Cloud Platform (OCP) V3.3.0 provides the data mid-end platform to support the full-link tracing feature of OceanBase Database V4.0. The platform integrates features such as full-link tracing and log retrieval, which rely on an OpenSearch cluster to store and query data. In addition, the OpenSearch cluster can be deployed with the OCP cluster, which is more convenient to export data from a private cloud instance and reduces the deployment and learning costs.
Prerequisites
You have prepared the resources for an OpenSearch cluster based on the Estimated specifications of an OpenSearch cluster section at the end of this topic.
Deploy an OpenSearch cluster
Obtain the OpenSearch Docker image and store it on the OCP host. You can run the following commands as needed to obtain the image:
Note
The latest version of the image is V3.3.2.
docker pull oceanbase/opensearch:3.3.2After you store the image on the OCP host, run the following command to start the image:
docker run -d --net host --env OPENSEARCH_USERNAME=xxx --env OPENSEARCH_PASSWORD=xxx --env OPENSEARCH_NODE_URLS=ip1,ip2,ip3 --env OPENSEARCH_JVM_HEAP=4g --name ocp-opensearch reg.docker.alibaba-inc.com/oceanbase-platform/ocp-opensearch:3.3.2-x64Notice
- You must start the OpenSearch Docker image in host network mode.
- To mount the host directory to the OpenSearch directory, you must first specify that only the admin user is allowed to mount the host directory. The UID and GID of the admin user in the container are both 500. If the UID and GID of the admin user in the container are inconsistent with those of the host admin user, the mount fails. To solve this issue, you can start and log on to the container, run the
chown admin:admin /data/1/opensearchcommand, and restart the container.
The following table describes environment variables related to the start of the image:
Environment variable Description OPENSEARCH_USERNAME Required.
The username used to log on to OpenSearch and cerebro.OPENSEARCH_PASSWORD Required.
The password used to log on to OpenSearch and cerebro.OPENSEARCH_NODE_URLS Required.
The node list of OpenSearch.OPENSEARCH_JVM_HEAP Optional.
The Java virtual machine (JVM) memory of OpenSearch. We recommend that you set this parameter to half of the node memory and not greater than 32 GB. Default value:4 GB.OPENSEARCH_HTTP_PORT Optional.
The port to access OpenSearch from the Internet. Default value:9200.OPENSEARCH_TCP_PORT Optional.
The port for communication between OpenSearch nodes. Default value:9300.CEREBRO_PORT Optional.
The port used by cerebro. Default value:9100.ELASTICSEARCH_EXPORTER_PORT Optional.
The port used by elasticsearch_exporter. Default value:9114.Connect OCP to OpenSearch.
After you start OCP, you need to add the endpoints and credentials of OpenSearch on the
System Parameters page. For more information, see Configure trace query parameters.
OpenSearch services
The OCP-OpenSearch Docker image integrates three services: OpenSearch, elasticsearch_exporter, and cerebro.
OpenSearch
OpenSearch is a community-driven distributed search and analysis engine based on Apache Lucene. It is derived from Elasticsearch 7.10.2 and uses the Apache License 2.0 protocol. After you add data to OpenSearch, you can perform full-text search, search by field, search for multiple indexes, boost field weights, rank results by score, sort results by field, and aggregate results.
elasticsearch_exporter
elasticsearch_exporter exports various metrics of OpenSearch in the Prometheus format, and is already integrated with OCP Prometheus. It uses the open-source Apache License 2.0 protocol. elasticsearch_exporter allows you to obtain the monitoring data of OpenSearch, including index performance, JVM memory usage and garbage collection, cluster health and node availability, and system and network metrics.
cerebro
cerebro is an OpenSearch management tool that allows you to perform O&M operations on web pages of OpenSearch. You can use cerebro to create and delete indexes and modify cluster parameters. cerebro is also a piece of open-source software and uses the MIT License 2.0 protocol.
Log directory
Log files are stored in the /home/admin/logs directory.
Estimated specifications of an OpenSearch cluster
Disk space
When you determine the disk space, consider the following factors:
- The number of replicas.
- The index overhead, which is usually 10% the source data volume.
- The space reserved for the operating system. By default, 5% of the disk space is reserved.
- The OpenSearch internal overhead. In general, 20% of the disk space is reserved for internal operations, such as field merging and log storage.
- Security threshold. In general, at least 20% of the disk space is reserved.
Therefore, the minimum disk space required can be calculated by using the following formula: Minimum disk space = Source data volume × (1 + Number of replicas) × 1.7. You can use this formula to control the disk space.
CPU and memory resources
OpenSearch consumes most computing resources to perform data writes and queries, and does not impose high requirements on CPU. We recommend that you configure a memory size of 16 GB, 32 GB, or 64 GB, and allocate no more than 32 GB or half the total memory size to OpenSearch.
Shard evaluation
The size and number of shards are important factors that affect the stability and performance of an OpenSearch cluster. We recommend that you divide indexes into shards based on the following strategies:
- Perform data tests on OpenSearch before index sharding. If a large amount of data is involved, you must reduce data writes to relieve the OpenSearch workload. In this case, we recommend that you configure multiple primary shards with one replica for each. If a small amount of data is involved with moderate data writes, you can configure one primary shard with multiple replicas or one primary shard with one replica.
- We recommend that you limit the size of a shard within 30 GB to ensure optimal performance, and increase it to no more than 50 GB only in special circumstances. This is because a large shard size slows down the recovery of Elasticsearch, and a small shard size leads to a lot of shards. Each shard consumes a certain amount of CPU and memory resources. Therefore, a great number of small shards can cause problems such as poor read and write performance and insufficient memory. If the evaluation indicates that you need to configure a shard that is larger than 50 GB, we recommend that you properly divide shards before you create an index, and then perform reindexing later. This method is time-consuming but ensures zero downtime.
- We recommend that the number of primary and replica shards is equal to or an integer multiple of the number of data nodes. This way, shards are evenly distributed across all nodes, which facilitates load balancing.