Regex Injection

Regular expressions (regex) are a way of describing the order and type of characters that occur in a string. They are often used to validate input or search for “wildcard” matches within a set of strings. If the regular expression (rather than the string it is testing) is generated from untrusted input, or a regex that exists in your codebase is poorly designed, an attacker can perform a regex injection attack by sending malicious input that will take a huge amount of computing power to evaluate.

Risks

Prevalence Common
Exploitability Easy
Impact Worrying

Regex injection is often used to perform denial-of-service attacks on vulnerable web-servers.

Regex Injection

The following code samples show a web-server that naively allows a “wildcard” search expression to be sent from the client, and evaluates the string as a regular expression against a list of potential matches:

@app.route('/search/<pattern>')
def search(pattern):
  # Don't evaluate regexes passed over HTTP!
  regex   = re.compile(pattern)
  matches = (item for item in ITEMS if regex.match(item))

  return matches
public void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException
{
  // Don't evaluate regexes passed over HTTP!
  String search = request.getParameter("search");

  List<String> matches = ITEMS.stream()
                              .filter(item -> item.matches(search))
                              .collect(Collectors.toList());

  response.setContentType("application/json");

  PrintWriter out = response.getWriter();
  out.write(new Gson().toJson(matches));
  out.flush();
}
app.get('/search', (request, response) => {
  // Don't evaluate regexes passed over HTTP!
  const regex   = new RegExp(request.query.search)
  const matches = items.filter(item => {
    return item.match(regex)
  })

  response.send(
    `<div>
       <h1>Matches</h1>
       <p>${matches.join('</p><p>')}</p>
       <a href="/">Back</a>
     </div>`)
})
public IActionResult Search(string search)
{
    // Don't evaluate regexes passed over HTTP!
    var regex   = new Regex(search);
    var matches = Items.Where(item => regex.IsMatch(item)).ToList();

    return Json(matches);
}

Regular expression matching can be very slow if the expression matching engine has to perform a lot of backtracking. This occurs when the regular expression contains repeated matching groups, each of which contains a repeating symbol - which means the engine has to evaluate exponentially more different logical branches while scanning a string.

It this case, an “evil” regex in the following form can be supplied:

  (.*a){20}

This pattern means “twenty occurrences of: zero or more characters followed by the letter a”. This expression will require an enormous amount of compute time to evaluate against a string such as:

  aaaaaaaaaaaaaaaaaaaa!

Sending many such search requests to your server gives an attacker an easy way to perform a denial-of-service attack.

Regex Patterns in Validation

Attackers can take advantage of inefficient regexes even when they do not have control of the form of the regular expression itself. By passing a maliciously crafted “email” parameter to a sign-up page, for instance, they can probe for slow-running validation expressions and attempt to take your website offline.

Mitigation

  • Don’t generate regular expressions directly from untrusted input - define them in your codebase.
  • Use a search index like Elasticsearch or Lucene for complex searches, rather than running regular expression matches on large datasets.
  • Check any regexes within your codebase for repeating grouped patterns or ambiguous patterns. “Catastrophic” backtracking can be avoided if you follow these rules of thumb:

    • Avoid nested quantifiers like (a+)+, where a pattern potentially matching multiple characters can be applied multiple times.
    • Avoid quantified overlapping disjunctions like (a|a)+.
    • Avoid quantified overlapping adjacencies like \d+\d+.
Automatically Detecting Unsafe Regular Expressions

The Node.js library safe-regex library can be used to detect unsafe regular expressions:

$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true

Make sure to test new expressions you define them in your codebase, and consider adding a unit test to scan for unsafe regular expressions if you develop in Node.js.

Further Reading