The purpose of clustering is to guarantee the High Availability of SonarQube for our customers.
To do so, if a cluster node crashes or there's an application failure on a node, the service should not be impacted (excl. possible impacts on performance). And, part of that, the application should guarantee the consistency of the data that is stored and displayed after the node has disappeared.
But, for the time being, we can face 2 different kind of inconsistencies:
- In Database
When an operation that writes in Database is interrupted, the data may remain inconsistent and can leave the operation half done from a functional point of view. Ex: During a delete of a project
- In Elasticsearch
If an indexation is ongoing in Elasticsearch and is hardly stopped, this can lead to discrepancies between the content in Elasticsearch and the Database. Ex: While indexing the results of a report.
Currently, the only way to recover from this situation is to stop the overall cluster, drop Elasticsearch indexes and restart the cluster. And we now want to avoid such an option as it would require to interrupt the service.
Those inconsistencies can happen in different situations:
- When an action is triggered from the UI (or from a Web Service) but can't complete.
- When the application is executing a background task which is interrupted, such as the integration of an analysis report.
- When a node starts but crashes.
We need to provide mechanisms and/or apply patterns in the code to guarantee the resilience of the application, in order to avoid / eliminate:
- inconsistencies in the Database
- discrepancies between Elasticsearch and the Database.
Whenever possible, SonarQube should avoid those inconsistencies to happen.
In some cases, it can prevent Database and Elasticsearch from being updated and can reject an action in the UI with an error. The user will then have to retry the operation.
If inconsistencies happen, SonarQube should automatically recover from any of those issues, with possibly:
- a background job that periodically performs the clean-up
- the next operation that is performed (ex: the integration a new analysis report, even if it's not on the same project)
We can accept the remediation to take several minutes to happen (up to 15min?) and, thus, the data will remain inconsistent during that time.
- Pattern 1: for a short operation that affects a few data, the application can rely on a single transaction in Database.
- Pattern 2: data can be stored in a temporary B-column and then copied to the final column in a single transaction.
- Pattern 3: for a long operation that updates multiple data, a flag can be stored in Database to know whether the operation has been completed or has to be replayed. The operation has to be reentrant.
- Pattern 4: the operation is reentrant and the consistency is restored when the operation is played again. Ex: when a new analysis report is integrated.
- Pattern 1: The application tracks in Database the changes that need to be indexed in Elasticsearch:
- When an entry is inserted/updated/deleted in the Database, the operation to be done in Elasticsearch is referenced in an event log.
- The log contains only updates of whole documents i.e. index and delete actions. In other words, the indexation mechanism doesn't perform partial updates to documents.
- As soon as an action is completed, it is removed from the event log.
- To prevent operations on a same document from being executed in an incorrect order, the document or rather the actions on the document are versioned with an automatically incremented id. An operation that would be executed "too late" is silently discarded by Elasticsearch.
- Indexation operations on Elasticsearch have to be idempotent. Ex: if we let Elasticsearch assign the ID for a new document, retrying the operation can lead to duplications.
- A version of a document should include all the changes previously done on this document. Ex: update 2 on an issue must contain the changes done by update 1 on this issue.
- Pattern 1.a:
Web services and background tasks continue to trigger the indexation in Elasticsearch. And, on a periodic basis, the application checks the log so see if some documents, older than 10min for ex, are still to be indexed / deleted. If it is the case, the application retries the corresponding operations in Elasticsearch.
- Pattern 1.b:
The indexation is performed by some external workers. All the workers are notified when operations are added to the log. When a worker is available, it takes in charge an operation or a group of operations.
A recovery mechanism handles the case where the worker disappears:
- In the log, an ongoing operation is associated to the id of the worker that performs the action
- When a worker disappears, the worker id is removed from the log and other workers are notified for them to take care of the corresponding operations.
- Important considerations:
- Operations don't need to be reentrant.
- Because the indexation is asynchronously done, Elasticsearch is not immediately consistent with the Database. The different operations and the UI have to support this temporary inconsistency.
- Pattern 1.c: Variant of pattern 1.b to avoid temporary inconsistency in a nominal case
As for patterns 1.b, the indexation is performed by some external workers. But, this time, the application waits for the indexation to really happen before it answers. Since Elasticsearch can take time to naturally refresh (up to 1s by default), a worker is notified when some indexations are completed and forces the refresh so that it doesn't take long (for ex < 50ms) for an index to be up to date. The indexation is really considered as done and the application answers only when refresh has been completed.
- Nice to have:
As an optimization, the indexation mechanism could discard intermediate operations that are scheduled for a document to execute only the latest indexation/deletion of this document.
- Pattern 2: The operation is reentrant and is technically wholly executed by an asynchronous background task. The consistency is restored when the background task plays (again) the operation. Ex: re-injection of an analysis report, asynchronous deletion of a project.
- Pattern 3:
Consistency of displayed information is guaranteed by a check in the application. Ex: When looking for a project which is being deleted, the application checks in database whether the project has been effectively deleted or not.Not really a sustainable pattern since it doesn't solve the problem but rather hides it. Risky in term of maintainability.