Details

    • Type: MMF
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Labels:

      Description

      WHY

      In the context of continuous delivery SonarCloud is upgraded to a new version on a weekly basis. Upgrading requires a restart. It takes a minimum of 2 minutes and may be up to 20 minutes.
      SonarCloud has now ~30k users and weekly downtime is not acceptable.

      WHAT

      As a first step, we should be able to upgrade SonarCloud platform without any downtime when:

      • there are no DB migrations
      • there are no ElasticSearch changes (upgrade, index changes, new indices)
      • no plugins are added or removed
      • no rules are relocated to new keys (new feature of SQ 7.1)
      • no properties are relocated to new keys
      • no upgrades of billing backend (Muppet)

      Upgrading plugins, including analyzers, is supported. Rules, Quality profiles, metrics and settings may be upgraded.

      Important - supporting zero-downtime deployment does not imply a fully automated process from development to production.

      The strategy is to use the Blue/green deployment:

      • cluster is initially in version N ("blue"). It is up and running. It handles all HTTP requests.
      • a second cluster in version N+1 ("green") is started in background. It accepts only HTTP requests from internal network, for validation.
      • when the green cluster is operational, the public HTTP traffic is fully redirected to the green cluster. The blue cluster does not longer accept HTTP requests.
      • the blue cluster is stopped

      This strategy implies to support two versions of the backend up and running at the same time, on the same datastorages. It affects:

      Analyses

      Scanners are executed continuously and connect to the web servers for two needs: loading referentials (plugins, metrics, rules, ...) and sending analysis report to Compute Engine. The switch of blue to green version may occur:

      • during the analysis on scanner side, raising potential discrepancies of data between the different web service calls. For example:
        1. scanner downloads JAR of blue version of analyzer
        2. analyzer is upgraded to green version in backend and activates new rules that do not exist in blue version
        3. scanner loads the list of active rules. The new rules are not supported by the engine of blue version.
      • between the end of analysis and the beginning of Compute Engine processing: the analysis report sent by a blue scanner may be processed by a green Compute Engine.

      The different discrepancies should not raise failures and should be ignored:

      • scanner should not configure rule engines with new unsupported rules or new unsupported rule parameters
      • rule engines should support removal of rule parameters in backend
      • Compute Engine should support analysis containing removed rules, removed Quality profiles, removed languages or removed metrics by silently ignoring them.

      For keeping things simple, only a single version of Compute Engine workers should run at the same time and Compute Engine workers
      should be disabled during synchronization of referentials by the green startup. That is achieved by pausing blue workers before the installation of green cluster. All in-progress tasks are finished and pending tasks are not temporarily not processed.

      Web application

      • Users should keep their web sessions open. No enforced logout. That's handled by JWT by design.
      • UX should be fluid: no unexpected behaviour, no 404 pages, no errors, no missing messages in l10n bundles

      Production infrastructure

      • the green cluster should be configured with new IPs or ports
      • the Elasticsearch nodes are connected by two versions of servers (web + Compute Engine). The Hazelcast constraints should be relaxed. A solution is to remove Elasticsearch node from Hazelcast cluster.
      • the HTTP traffic entry-point should be able to switch atomically requests from blue to green cluster. The HTTP requests being processed by blue server should not be interrupted. They may return after the effective switch.
      • the green cluster should be validated and optionally warmed-up before switching traffic.

      Build pipeline

      The blue-green deployment should be disabled for the milestones that do not respect the conditions.
      This support of blue-green deployment should not be guessed by developers when sending the deployment request. It should be automatically computed in a safe way.

      HOW

      Server startup tasks

      The following tasks should not break compatibility of datastorages with previous blue version:

      • registration of rules
      • registration of built-in Quality profiles
      • registration of metrics
      • registration of built-in Quality gates
      • registration of permission templates

      Scanner engine

      The cache of engine JAR file should support the switch of version. That's not possible with the current workflow:

      1. GET /batch/index returns sonar-scanner-engine-shaded-7.2.0.1234-all.jar|9db3dcc366e04f93376aa2f1a99b420d when requesting the blue version
      2. if the file with checksum 9db3dcc366e04f93376aa2f1a99b420d is not in local cache, then it's downloaded from /batch/file?name=sonar-scanner-engine-shaded-7.2.0.1234-all.jar. This filename does not longer exist in green version.
      3. the downloaded file is cached with the checksum defined in the first step

      The solution is to slighty change the workflow:

      1. GET /batch/index returns sonar-scanner-engine-shaded-all.jar|9db3dcc366e04f93376aa2f1a99b420d when requesting the blue version.
      2. if the file with checksum 9db3dcc366e04f93376aa2f1a99b420d is not in local cache, then it's downloaded from /batch/file?name=sonar-scanner-engine-shaded-all.jar. This filename exists in the green version, even if it does not have the expected checksum.
      3. checksum of the downloaded file is got from an header of the HTTP response and used to cache the file. It could also be used to verify the integrity of the download.

      The third step requires to release scanners. Hopefully it can be considered as an optimization. The workflow works perfectly without changing it. The engine JAR will just be downloaded twice in the row.

      Scanner plugins

      The cache of scanner plugins should support the switch of version. That's not possible with the current workflow:

      1. GET /api/plugins/installed returns the list of plugins, including filename and checksum, installed on the blue version
      2. GET /deploy/plugins/<plugin key>/<filename> is requested for each uncached plugin, for example /deploy/plugins/scmgit/sonar-scm-git-plugin-1.4.0.1037.jar. If request reaches the green cluster, then the file may not exist.
      3. the downloaded file is cached with the checksum defined in the first step

      The solution is to slightly change the workflow:

      1. GET /api/plugins/installed
      2. call GET /api/plugins/download?key=<plugin key> for each uncached plugin. This request is supported by both blue and green versions.
      3. checksum of the downloaded file is got from an header of the HTTP response and used to cache the file. It could also be used to verify the integrity of the download.

      Contrary to scanner engine, the third step does not require to release scanners. The algorithm is part of scanner engine.

      Analyzer rule engines

      Analyzer in blue version should not fail when a rule introduced in green version is activated. This is achieved by filtering the rules returned by GET api/rules/search.protobuf?activation=true&qprofile=xxx, that is called for each Quality profile. The filter compares the date of creation of the rule and the date of plugin installation. If the downloaded plugin is older, then rule is ignored. That implies to store the date of plugin installation and the date of creation of rule (the cluster startup date, not the technical date of db insertion).

      Compute Engine Lifecycle

      Compute Engine should be paused during all the deployment. That requires three new web services:

      • POST api/ce/pause triggers the pause. Returns immediately. The information is stored in table internal_properties. Workers stop poping the queue. The in-progress tasks are finished and are not canceled. Pending tasks remain in queue.
      • POST api/ce/status returns the status (starting/pausing/paused/running/resuming/resumed/stopping/stopped)
      • POST api/ce/resume triggers the resume.

      See the "Deployment workflow" section for more details.

      Compute Engine Resiliency

      Analysis task should not fail when:

      • the analysis report contains issues on rules that have been disabled in green version. These issues are ignored.
      • the analysis report references a Quality profile that has been dropped from green version. The reference to profile name should be kept. Note that this use-case already occurs when Quality profiles are edited during analysis.

      Webapp reloading

      The webapp loaded in browsers should be reloaded after the version switch in order to ensure the compatibility with web services. Public web services are supposed to be backward-compatible but not internal web services, so a blue webapp may be incompatible with green web services.
      Reloading is triggered:

      • every x minutes by a JS daemon scheduler if it detects a version change. The scheduler calls GET api/server/version.
      • when webapp fails to download the JS file of a page that has not been loaded yet. This is possible thanks to the unique file names of web assets (js and css) generated during build.

      Plugin assets are not yet following the unique file name strategy. This must be improved in order to support the loading of new version of billing page.

      High Availability Cluster

      The Elasticsearch nodes are not part of the blue-green deployment. They are considered as a datastorage that is always up. Upgrading is out of the scope of this MMF.

      The Elasticsearch nodes are part of both blue and green clusters during the switch. That is not supported for the time being by HA configuration. The solution is to simply remove the Elasticsearch nodes from the Hazelcast cluster. The properties sonar.cluster.node.host and sonar.cluster.node.port should be removed from these search nodes.

      Deployment workflow

      1. build pipeline generates a zip artifact of the green installation. It is a SonarQube zip completed with the target plugins copied in the directory extensions/plugins.
      2. ensure that blue-green deployment is supported by checking the conditions when starting a green distribution on blue datastores.
      3. install the application nodes of green cluster
      4. POST api/ce/pause on blue cluster
      5. wait for GET api/ce/status to return the status "paused"
      6. start the green application nodes
      7. validate that green cluster is operational by browsing pages
      8. stop blue cluster
      9. call POST api/ce/resume

      In case of failure, the green cluster is uninstalled and Compute Engine is resumed.

      Notes

      • the blue-green deployment should be applied to continuously deploy the dogfood-on-next branch on https://next.sonarqube.com if no db migrations are required.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                xavier.bourguignon Xavier Bourguignon
                Reporter:
                xavier.bourguignon Xavier Bourguignon
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: