The DCE wants to be robust and to sustain to service in case of a technical problem on a server. It relies on a 3-nodes ES cluster to ensure that indexing, which is an essential piece for the product, remains available when a server crashes.
Elasticsearch handles replication on its own. Its default behavior is to spread data across the 3 nodes so that it's resilient to the loss of one node, still being able to operate without losing data.
Still, this behavior has some limits. Even if users spread their ES nodes over 2 different locations, ES copies can be hosted on the same physical server, in the same rack, in the same datacenter. With the default configuration of SonarQube (a number of replica configured to 1), in the event of a failure of one of these zones, the remaining location may not have all the ES data and the service can become unavailable. It can typically happen when ES nodes are spread over 2 locations and the location hosting 2 of those nodes is lost.
We want to offer a higher level of availability for DCE users who are ready to spread their ES nodes over several locations with the limit of having them inside the same region to be sure that the service keeps being fully operational in case one of the locations is lost.
Because of the split-brain problem, 3 availability zones are required to support the expected level of availability.
- The configuration of should remain optional to not make DCE configuration more complex than it currently is.
- And obviously, we want this to be properly documented.
In order to support losing a zone on the ElasticSearch cluster, DataCenter Edition users need to spread the 3 search nodes on 3 availability zones.
We'll document this recommendation.