In Language Team we often feel the lack of understanding how are analyzers are really used and what value exactly they bring. Having that knowledge will not only allow us to improve quality of existing feature (e.g. fix FP, improve performance), but also lead us when thinking about new features and taking some decisions about existing one. Moreover having concrete data and its trends during some time might can show in some way team performance and be considered as a measure of achievement. Note, that currently we develop almost in isolation from users feedback, only getting bug reports and reacting on them. We want to be proactive and improve quality of analyzers before some user points at it.
- Are there false positives and how important it's to fix them
- Which rules and how should we adjust to maximize their value
- Fix performance problems
- Activation/deactivation of rule in default profile
- Need for more rules of some nature
- Is default rule parameter value is a good one
- Prioritizing work to be done next
Note that with this MMF we only start team initiative "Data-Driven Decisions" it's a first baby step. We expect more things come later this year.
There is a parallel project in Language Team, "Roulette", which aims to compute similar metrics, but based on the input of SonarSources, not real users.
In high level we want regularly collect data about rules and issues from SonarCloud, and store it the way that we could apply some analytics on top of it.
We choose SonarCloud as it's used by real users and we have full control over it (including versions of plugins).
- We want to have a up-to-date status every day
- We want to collect data only for active projects (to make sure more or less recent versions of analyzers were used)
- last analysis date (~2w, experiment with the value)
- manual change on issue status
- We want to be able to differentiate data coming from private and public projects
So, finally what we want to collect:
- For each language
- number of "active" public and private projects
- number of LOCs analyzed (only for "active" projects) - private LOCs and public LOCs
- For each rule
- number of projects it's executed on (private and public) -> when rule was activated or deactivated
- number of issues (on private and public projects)
- overall number
- distribution by resolution (unresolved, FP, won't fix, fixed, removed)
- distribution by status (open, reopen, confirmed, resolved, closed)
FP, won't fix -> resolved
TBD: which branch to consider? Biggest? or recently active? or master?
We want to be able to see these metrics based on the data above:
Everything in 3 views: overall, only public projects, only private projects
- Rule X activation trend (ratio of projects number on which rule X is executed to overall number of "active" projects with that language)
- By language trend of rules "FP"/"won't fix"/"confirmed" rate (average by language and by each rule, better on one graph to be able to compare)
- Density of issues per rule (number of issues by LOC)
Data Lake: Query by SQL