Details
-
Type:
Bug
-
Status: Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 8.3
-
Component/s: Compute Engine, Database, Measures
-
Labels:None
-
Edition:Community
-
Production Notes:None
Description
Problem
Customers running large instances on Oracle or SQLServer are complaining that background tasks takes too long.
Upon analysis, it appears that the step responsible for persisting live measure is taking a significant share of that time (68 to 74% according to some samples analyzed by services).
See this discuss thread for details on feedback from Services.
This problem goes unnoticed at SonarSource as we are running all our instances (Next, Peach and SonarCloud) on PostgreSQL.
Current state
All live measures are persisted in an "upsert" approach.
Either the Database provides an UPSERT feature or SQ simulates that behavior.
Simulation is implemented by first doing a regular UPDATE and then, if that SQL statement returns "0 row updated", doing a regular INSERT.
Finally, it deletes all live measures which are now irrelevant.
As of today, only PostgreSQL provides an UPSERT feature in all the versions SQ supports. Therefore, "upsert" is simulated on both Oracle and SQLServer.
Limitations of the simulated "upsert"
In the best case scenario, if there are X live measures, X+1 SQL statements will be performed (X UPDATE + 1 DELETE). In the worse case scenario (eg. first analysis), 2X+1 SQL statements will be performed (X UPDATE + X INSERT + 1 DELETE).
Because we are doing UPDATE statements and need to check how many rows have been updated, batch mode provided by SQL drivers can't be used.
This implies that:
- a network round trip must be performed for each SQL statement
- DB can't optimize multiple statements put together in sequence
Note that the number of live measures increases with the number of components (files and directories) in the project. The number of live measures per component is roughly the same for all of them.
Put together, this means that the duration of this step increases with:
- the size of the project (big variation possible)
- the number of measures per component (small variation)
- the network latency
- the performance of each UPDATE and INSERT into table LIVE_MEASURES. This performance is affected by:
- the size of the LIVE_MEASURES table
- out of date table indices
- DB load which may come from many concurrent background tasks (per node and/or SQ DCE edition)
- possibly some DB tuning, hardware limitations, ...
Possible solutions
Reminder: goal of this step
Live measures represent the current state of measure of a branch/PR of project. They can be updated by the analysis when a new report comes in but also as user interact with the UI (eg. resolving an issue).
The goal of the step persisting live measure is to persist the last version of the measures of the project computed by the analysis.
The analysis measures are the one truth about the state of the project when they are persisted. All live measures at the end of the step must be the one of the analysis, this implies that:
- no other live measure is left in DB
- any other live measure is overwritten to match the one of the analysis
This is important as this guarantees there is a mean for live measures of a given branch to be "eventually consistent".
Options
1. Upsert approach
It's particularly efficient with PostgreSQL:
- Up to 100 measures are "upserted" with a single statement
- batch mode can be used
- there is no worse case scenario (such as first analysis when simulating upsert)
On the opposite, simulating Upsert is quite inefficient (see "Limitations of the simulated "upsert"" paragraph).
SQ supports several DB vendors and several version for each of them.
Currently, as long as a single version of a given vendor does not provide an UPSERT feature, it won't be used at all for this vendor.
This is technically challenging, but Upsert could be supported for specific versions of a vendor.
Note: since there is no standard, it's also possible that Upsert features may change from one version to another of the same DB
2. delete all + mass inserts (per component)
Basic idea: for X live measures do 1 DELETE + X INSERTS. The delete could be done once for the entire branch, or once per component.
Pros:
- no worse case scenario
- simple
- can leverage batch mode
Cons:
- may be hard on storage management and "garbage" collection on DB side
- argument does not apply on Postgres as an update is implemented as a delete + an insert
- the situation with other DBs is unknown
- still directly factor of the number of live measures
Results: batch mode performs very well, and this option brought significant improvements in the execution time without much added complexity. There wasn't a significant change in the execution time between
a) Running batch mode across all components (letting auto commit be done every 250 ops) with a single delete for the entire project
b) Doing a delete per component and a commit after all insertion per component
So option b) is clearly preferable as it creates a single safe transaction per component.
3. Query existing measures and insert or update accordingly
The idea here is to work with batches of measures (managed by the step itself).
For each full batch, first query what measures currently exist in DB (for pairs of component/metric). Based on the result, update existing measures and insert non-existing measures.
Results: The select is quite slow comparing to deletes and inserts. Also doing a select is not compatible with batch mode. Overall execution time was quite slower than 2.
4. Multi-thread insertion of data
Assuming that most time is spent waiting for I/O, using multiple threads allows the CE the do more useful work, even while some of the threads are just waiting for I/O.
The improvement of the execution time was very significant (proportional to number of threads). Also the added code complexity is very manageable.
However, this assumes that the DB is not overloaded. If the DB is overloaded, this approach will make things worse, since it will hit the DB with more data concurrently.
Conclusions
Option 2 had great results and it's easy to implement and maintain.
Benchmark
Execution time of the CE tasks in ms, comparing master and master + SONAR-12928 + SONAR-13093.
Oracle 12 / moodle | 1st analyis master | 2nd analysis master | 1st analysis w/ tickets | 2nd analysis w/ tickets |
persist live measures | 1102123 | 181800 | 252107 | |
persist issues | 38849 | 21042 | 1673 | |
visitors | 68124 | 99599 | 86770 | |
duplication | 10160 | 8415 | 9484 | |
sources | 15625 | 141748 | 17749 | |
TOTAL | ** | 1254151 | 495540 | 385681 |
Postgresql / moodle | 1st analyis master | 2nd analysis master | 1st analysis w/ tickets | 2nd analysis w/ tickets |
persist live measures | 46915 | 56331 | 44740 | 48725 |
persist issues | 53708 | 38155 | 6202 | 17 |
visitors | 32506 | 25449 | 23119 | 19814 |
duplication | 653 | 624 | 885 | 713 |
sources | 37228 | 15300 | 36534 | 14385 |
TOTAL | 188049 | 149829 | 128747 | 98087 |
Future improvements
- With Oracle there is a difference of ~70s when persisting live measures between first and second analysis. This is the overhead of deleting measures before inserting them again and could be avoided if we find a way to do a real upsert with Oracle
- With Oracle JDBC drivers, every time we get a DB connection from the pool, it pings the DB to check if the connection is still valid. During the visitors step, this is done dozens of thousands of times and introduces a delay of ~30s. This can be avoided by sharing the DB connection during this step.
Attachments
Issue Links
- is related to
-
SONAR-13093 Optimize cache of issues in Compute Engine
-
- Closed
-
-
SUPPORT-16524 Loading...
- links to