Uploaded image for project: 'SonarJava'
  1. SonarJava
  2. SONARJAVA-3549

Add support for automata-based analyses for regular expressions

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.10
    • Component/s: Rules
    • Labels:

      Description

      In order to analyze control flow within regular expressions, they should be represented as an "AST extended with continuation pointers" model as is used by some reDoS analyzers and similar to the representation used in Java's regex engine. This model provides the features of a prioritized (order of alternatives matters) NFA while still providing access to the AST structure. Since this kind of automaton is just an AST with extra pointers, we should extend the current AST classes to support it and then generate it directly in the parser rather than having separate regex -> AST -> automaton phases.

      For example:

      • Java regular expression:
        "a|b"
      • AST model is:
      • Automaton:

      To represent this, each AST class should have a member continuation that points to whichever state should be matched after this one is matched successfully. To be able to easily and intuitively use this as an NFA, we should also add a method successors which returns the states that can directly follow a given state. With these methods, we get the structure of an NFA, while requiring no changes to existing rules that can simply continue using the existing structure without changes.

      For easy construction in the parser, the continuation member should be null when constructed and then set using a setter method. However, after parsing the states should be considered immutable and the setter should not be called anymore.

      To represent some aspects of Java regexen, we need synthetic states that don't correspond directly to parts of the AST. Therefore we should introduce an interface AutomatonState such that RegexTree implements AutomatonState, all AutomatonState s have a continuation and a list of successors (which are also AutomatonState s) and RegexTree s have children (which are RegexTree s). The children member can also be used to simplify the implementation BaseRegexTreeVisitor (which should still only visit RegexTree s - the synthetic states should not be visible when iterating a regex's syntactic structure only when iterating the states of the automaton via successors and/or continuation). The following synthetic AutomatonState s should be created:

      • StartState represents the initial state of the automaton. Its continuation and successor will be the RegexTree representing the whole regex. It will be accessible through the RegexParseResult object created by the parser.
      • Branch represents a branch in the control flow that does not correspond to the direct use of the | operator. The difference between Branch and Disjunction is that the latter corresponds directly to a syntax construct (and thus inherits RegexTree) while the latter does not. Therefore both should implement a common interface, so they can be handled as the same thing by rules that deal with automata states.
      • EndOfRegex represents the end of the regex being reached successfully. It's the automaton's only accepting state. Its continuation should be null.
      • EndOfLookAround represents the end of a look around group. Reaching the end of a positive look around will cause the match to continue at the point where the look around began. It should contain a reference to the LookAroundTree whose end it represents.
      • Negation introduces a negated section of the automaton (created by negative lookarounds). There should be no transitions from the negated section back to the non-negated part (which makes it important that EndOfLookAround has no continuation/successors).

      Each AutomatonState should have an transitionType to represent the type of its incoming transitions (all incoming transitions of a state will be of the same type (and with the same label) in this representation). The following transition types should exist:

      • Epsilon for epsilon transitions, which are unconditional and consume no input. This is the transition type of most nodes.
      • Character for transitions that consume a single character and may require a specific character or one of a set of characters (it may also be unconditional in case of .). This is the transition type of nodes that consume a single character, such as PlainCharacterTree, CharacterClassTree etc.
      • BackReference which requires the current input to match a string saved in a capturing group and consumes an amount of input equal to the saved string's length. This is the transition type of BackReferenceTree s.

      • LookAroundBacktracking which represents the fact that this transition will move back in the input. It's the transition type of LookAroundTree s that represent look behinds and of {{EndOfLookAround}}s that represent look aheads.
        Look behind:

        Look ahead:
      • Negation - the transition type of Negation states.

      The continuations and successors of states should be set as follows by the parser:

      • Only EndOfRegex should have null as its continuation and an empty list of successors.
      • The EndOfRegex state should be the continuation of the top-level regex.
      • The children of a CharacterClassTree are ignored. Its successor is its continuation.
      • Each element in a sequence except the last element should have the following node as its continuation.
        The continuation of the last element should be the same as that of the sequence.
      • The successor of a sequence is its first element.
      • The continuation of a (non-lookaround) group's contents should be the same as that of the group.
      • The continuation of a lookaround's contents should be an EndOfLookAround state.
      • The continuation of an EndOfLookAround state should be the same as the continuation of the corresponding LookAroundTree.
      • The continuation of a disjunction's elements should be the same as the continuation of the disjunction itself.
      • The successor of a "*" repetition should be a branch containing the body of the repetition and the continuation of the repetition as its alternatives.
        The continuation of its operand should be the repetition itself (not its continuation).
        Greedy:

        Possessive:

        Reluctant:
      • The successor of a "+" repetition should be its body. The continuation of its child should be a branch containing the repetition itself (not its continuation) and its successor.
        Greedy:

        Possessive:

        Reluctant:
      • The successor of a "?" should be a branch containing the operand and the continuation of the "?". The continuation of its operand should be the same as the continuation of the optional.
        Greedy:

        Possessive:

        Reluctant:
      • {0,1}

        same as "?", {0,m} with m>1 or {0,} same as "*", {n,m} with n>0, m>1 or {n,} same as "+"

      • {0,0}

        should probably only have its continuation as its successor (child unreachable) and there should be a code smell rule identifying this construct as nonsense. Or we could ignore it because nobody's going to write that.

      • For greedy and possessive repetitions/options the associated branch node
      • The continuation of a branch node should be the same as the continuation of the repetition or optional node for which it was created.
      • Unless otherwise stated, the successors of each node with children should be the same as the list of its children. If it has no children, its only successor should be its continuation (note that the continuation won't be part of the successors when there are children (unless otherwise stated)).

        Attachments

        1. ast1.png
          ast1.png
          65 kB
        2. back-reference.png
          back-reference.png
          115 kB
        3. back-reference-name.png
          back-reference-name.png
          122 kB
        4. backtracking-ahead.png
          backtracking-ahead.png
          143 kB
        5. backtracking-behind.png
          backtracking-behind.png
          143 kB
        6. branch1.png
          branch1.png
          52 kB
        7. branch1.png
          branch1.png
          56 kB
        8. character-class.png
          character-class.png
          26 kB
        9. disjunction.png
          disjunction.png
          24 kB
        10. disjunction-continuation.png
          disjunction-continuation.png
          85 kB
        11. doc-1-description.state.png
          doc-1-description.state.png
          29 kB
        12. doc-1-description.tree.png
          doc-1-description.tree.png
          23 kB
        13. doc-1-description-start.png
          doc-1-description-start.png
          3 kB
        14. doc-2-synthetic-states-branch.state.png
          doc-2-synthetic-states-branch.state.png
          26 kB
        15. doc-2-synthetic-states-end-of-look-around.state.png
          doc-2-synthetic-states-end-of-look-around.state.png
          34 kB
        16. doc-2-synthetic-states-negation.state.png
          doc-2-synthetic-states-negation.state.png
          39 kB
        17. doc-3-transition-type-back-reference-named.state.png
          doc-3-transition-type-back-reference-named.state.png
          51 kB
        18. doc-3-transition-type-back-reference-numerical.state.png
          doc-3-transition-type-back-reference-numerical.state.png
          45 kB
        19. doc-3-transition-type-character.state.png
          doc-3-transition-type-character.state.png
          46 kB
        20. doc-3-transition-type-epsilon.state.png
          doc-3-transition-type-epsilon.state.png
          47 kB
        21. doc-3-transition-type-look-around-backtracking-ahead.state.png
          doc-3-transition-type-look-around-backtracking-ahead.state.png
          55 kB
        22. doc-3-transition-type-look-around-backtracking-behind.state.png
          doc-3-transition-type-look-around-backtracking-behind.state.png
          56 kB
        23. doc-3-transition-type-negation.state.png
          doc-3-transition-type-negation.state.png
          5 kB
        24. doc-4-rules-disjunction.state.png
          doc-4-rules-disjunction.state.png
          27 kB
        25. doc-4-rules-end-of-look-around.state.png
          doc-4-rules-end-of-look-around.state.png
          55 kB
        26. doc-4-rules-end-of-regex.state.png
          doc-4-rules-end-of-regex.state.png
          28 kB
        27. doc-4-rules-ignored-states.tree-state.png
          doc-4-rules-ignored-states.tree-state.png
          38 kB
        28. doc-4-rules-optional-greedy.state.png
          doc-4-rules-optional-greedy.state.png
          24 kB
        29. doc-4-rules-optional-possessive.state.png
          doc-4-rules-optional-possessive.state.png
          30 kB
        30. doc-4-rules-optional-reluctant.state.png
          doc-4-rules-optional-reluctant.state.png
          24 kB
        31. doc-4-rules-plus-greedy.state.png
          doc-4-rules-plus-greedy.state.png
          28 kB
        32. doc-4-rules-plus-possessive.state.png
          doc-4-rules-plus-possessive.state.png
          37 kB
        33. doc-4-rules-plus-reluctant.state.png
          doc-4-rules-plus-reluctant.state.png
          29 kB
        34. doc-4-rules-sequence.tree-state.png
          doc-4-rules-sequence.tree-state.png
          32 kB
        35. doc-4-rules-star-greedy.state.png
          doc-4-rules-star-greedy.state.png
          24 kB
        36. doc-4-rules-star-possessive.state.png
          doc-4-rules-star-possessive.state.png
          30 kB
        37. doc-4-rules-star-reluctant.state.png
          doc-4-rules-star-reluctant.state.png
          25 kB
        38. doc-4-rules-zero-zero.state.png
          doc-4-rules-zero-zero.state.png
          13 kB
        39. end-of-look-around.png
          end-of-look-around.png
          106 kB
        40. end-of-regex.png
          end-of-regex.png
          15 kB
        41. epsilon.png
          epsilon.png
          38 kB
        42. first-tree-continuation.png
          first-tree-continuation.png
          86 kB
        43. ignored-tree.png
          ignored-tree.png
          107 kB
        44. look-around-continuation.png
          look-around-continuation.png
          110 kB
        45. negation.png
          negation.png
          112 kB
        46. negation-small.png
          negation-small.png
          13 kB
        47. optional-successor-greedy.png
          optional-successor-greedy.png
          71 kB
        48. optional-successor-possessive.png
          optional-successor-possessive.png
          94 kB
        49. optional-successor-reluctant.png
          optional-successor-reluctant.png
          73 kB
        50. plus-successor-greedy.png
          plus-successor-greedy.png
          83 kB
        51. plus-successor-possessive.png
          plus-successor-possessive.png
          89 kB
        52. plus-successor-reluctant.png
          plus-successor-reluctant.png
          82 kB
        53. sequence.png
          sequence.png
          88 kB
        54. star-successor-greedy.png
          star-successor-greedy.png
          65 kB
        55. star-successor-possessive.png
          star-successor-possessive.png
          89 kB
        56. star-successor-reluctant.png
          star-successor-reluctant.png
          66 kB
        57. start.png
          start.png
          8 kB
        58. state1.png
          state1.png
          94 kB
        59. zero-repetition.png
          zero-repetition.png
          125 kB

          Issue Links

            Activity

              People

              Assignee:
              alban.auzeill Alban Auzeill
              Reporter:
              sebastian.hungerecker Sebastian Hungerecker
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Due:
                Created:
                Updated:
                Resolved: