The Data Center Edition provides a cluster-based solution for users who want to benefit from High Availability.
Also, starting with the Enterprise Edition, SonarQube provides a way to handle a greater throughput of analyses by increasing the number of Compute Engine workers.
But, even with a cluster, SonarQube scalability remains limited on one side by the memory that can be allocated to the Compute Engine on the nodes, and on the other side by the current "official" DCE architecture that allows a maximum of 2 application nodes.
We want to unleash the potential of the DCE so that it becomes an option for scalability.
There's no real need to add additional search nodes in a cluster. The DCE architecture already recommends to setup 3 search nodes for HA and this setup is enough to handle a "big" load.
But, we want ops to be able to add application nodes to increase computing capabilities.
Of course, the DCE should keep ensuring High Availability.
We are not looking for automatically adjusting the number of nodes to the load. Still, allowing scalability implies the following improvements:
Ops should have an indicator they can monitor to know if there’s a need for scaling.
It can be:
- The number of tasks in the Compute engine queue
- The maximum pending time for the tasks in the queue
Since portfolios computation can bring a lot of tasks in the queue, the maximum pending time seems to be the best indicator.
In other words, this time corresponds to the waiting time of the oldest task in the queue.
SonarQube will expose the pending time:
- As a new ComputeEngineTasks JMX attribute (see https://docs.sonarqube.org/latest/instance-administration/monitoring/):
PendingTime Pending time (in ms) of the oldest Background Task waiting to be processed. This measure, together with PendingCount, helps you know if analyses are stacking and are taking too long to start being processed. It can help you evaluate if it would be interesting to increase SonarQube performance by configuring additional Compute Engine workers (Enterprise Edition) or additional nodes (Data Center Edition)
- In the answer of the web service api/ce/activity_status:
Ex: "pendingTime": 489156
This web service will become public.
- In the Background Tasks page (project and global pages):
Ex: "25 pending (for 8 minutes)"
Ops should have the possibility to dynamically start a new node, without having to restart all the other nodes.
Without replicating the configuration on the different nodes, a node that restarts doesn’t know the others:
Some ideas to solve this problem:
- Use by default broadcasting with Hazelcast - dangerous when several prod / pre-prod environments co-exist
- A companion who handles the global configuration
- A shared configuration folder / file is passed at startup
The simplest idea remains to document as part of the process that the ops should always keep up-to-date the configuration on all the nodes.
A application node that starts with a configuration that is inconsistent with the other nodes should log a warning so that ops know the configuration needs to be updated.
Ops should be able to stop a node without having to shutdown the whole cluster, in order to downscale or simply replace a node by another.
To avoid generating errors and risking data inconsistency, a graceful shutdown is required at node level. It should make the Compute Engine stop processing new tasks and stop the server only when all the ongoing tasks are completed.
For that purpose:
- Stopping an application node or a single node instance will, by default, perform a graceful shutdown of the node
- A new force-stop option will be offered to give the possibility to abort the tasks in progress and immediately shutdown an application node
- On a search node, stop and force-stop will behave the same way
To benefit from this graceful stop while shutting down the server SonarQube is installed on, we recommend to do a shutdown of SonarQube first. Indeed, some operating system may give only a couple of seconds for the processes to complete before forcing the stop.
The graceful shutdown will also be available in other editions with the same effect.
In CE,DE,EE editions:
- The confirmation window in the System info page will explain that the server restart can take a bit of time:
Are you sure you want to restart the server? The restart will first wait for ongoing background tasks to complete.
- We could then display a message to the administrators in the System page explaining that the server is preparing to shutdown. (nice to have)
The configuration of multiple (10) workers should still be possible to leverage the capabilities of each node.
Documentation should now reflect that:
- A new indicator allows to know the load of the instance
- Additional applications nodes can be dynamically started
- There's a proper way to individually shutdown an application node
- Plugins need to be installed on the application nodes only
- Some tasks may not be properly handled by the cluster if the leader crashes. This can be the case for:
- License expiration notification
- License LOC threshold notification
- Some improvements can be done to be more robust (see SONARCLOUD-310):
- Abort analysis report tasks in the Compute Engine after a configurable maximum amount of time
- Clean up orphan tasks out of Hazelcast, to improve our resilience when Hazelcast is in trouble
- The DCE should better help ops who wants to closely monitor the nodes.
It currently provides data with 2 ways:
- A high level operational monitoring via the health WS
- JMX data
- Changing the number of Compute Engine workers requires to restart the cluster for the value to be applied on all the nodes.
With the DCE, the property should rather be considered as a system parameter and be configured via sonar.ce.workerCount in sonar.properties. It should no more appear in the Background Tasks page.
The maximum value should remain 10. If a value > 10 is configured, the maximum value should apply and a message in ce.log should mention this change.
- Removing Hazelcast from the architecture could simplify the configuration and thus the operability.
For the record, Hazelcast is currently required to:
- Change the log level on the application logs
- Display the status in system information
- Elect a web leader at startup to avoid every application node to perform an upgrade
- Clean up orphan Compute engine tasks
- Lock recurrent tasks