For all non-possessive unbounded outer repetitions in the regex (repetitions nested inside other repetitions don't have to be considered because their stack consumption can't be worse than that of the outer repetition that contains them) check whether they contain any disjunction or branch states and if so, calculate their stack consumption factor as follows ():
- Calculate the shortest path from the body node of the repetition to its continuation (i.e. from the beginning of the body to the end)
- Count all epsilon transitions except ones between two PlainCharacterTrees (because those will be represented as a single node in the Java engine). This is the number of recursions performed per iteration of the loop (approximately because our automata don't have exactly the same nodes as Java's) when consuming the least amount of characters.
- Count all character and back reference edges. This is the least amount of characters consumed per iteration of the loop (approximately in case of back references because a back reference might refer to an expression consuming more than one character, but we'll live with inaccuracy here).
- The stack consumption factor is the number of recursions divided by the number of consumed characters.
If the stack consumption factor for a given repetition is above the configured maximum, report an issue.
After implementation the calculated factor should be compared to real results of Java's regex engine and an appropriate default for the maximum should be chosen based on these tests. To perform these tests the following Java program can be used, which counts the number of characters of input that can be matched before a stack overflow occurs.
Make sure that the provided input actually matches the regex or the match will abort early and the results will be meaningless. Also make sure that the provided input matches in such a way that the shortest path through the loop is taken. The tests should be run with JITting disabled (-Xint). Results with JITting enabled are comparable but less deterministic.
Here are some results using a stack size of 1MB using openjdk 11 on Linux:
When calculating the shortest path, edges should be handled like this:
- Epsilon edges have weight 0
- Character edges have weight 1
- The weight of a back reference edge should be the shortest path between the capturing group node to which it refers and that node's continuation. The calculation of this should be cached, so that multiple back references to the same group don't calculate that distance multiple times.
- If a branch node belongs to a repetition with a minimum number greater than 1, its outgoing edge that does not go back to the repetition should have its weight increased by the minimum distance between the repetition and its continuation times min - 1. Again this distance should be cached.
- Negation edges should not be followed.
- LookAroundBacktracking edges should also not be followed as they would represent a variable decrease in the path length and we can't handle that. Instead LookAroundTrees should be treated as if they had a transition directly to their continuation. This way a sensible path of the correct length will be calculated to states that come after the lookaround as well as states inside a positive look ahead. States inside look behind won't be considered reachable, but that's okay because we wouldn't provide a path of a correct length for them anyway.
We should also use a counter that makes the rule abort early if we spend too long analyzing a single regex because regexen where every other node is a back reference or a repetition with min > 1 could lead to cubic run time.