and has 2 comments

Hero swoops in shouting 'I know regular expressions'  Sometimes a good regular expression can save a lot of time and lead to a robust, yet flexible system that works very efficiently in terms of performance. It may feel like having superpowers. I don't remember when exactly I've decided they were useful, but it happened during my PHP period, when the Linux command line system and its multitude of tools made using regular expressions a pretty obvious decision. Fast forward (waaaay forward) and now I am a .NET developer, spoiled by the likes of Visual Studio and the next-next approach to solving everything, yet I still use regular expressions, well... regularly, sometimes even when I shouldn't. The syntax is concise, flexible and easy to use most of the time.

  And yet I see many senior developers avoiding regular expressions like they were the plague. Why is that? In this post I will argue that the syntax makes a regular pattern be everything developers hate: unreadable and hard to maintain. Having many of the characters common in both XML and JSON have special meaning doesn't help either. And even when bent on learning them, having multiple flavors depending on the operating system, language and even the particular tool you use makes it difficult. However, using small incremental steps to get the job done and occasionally look for references to less used features is usually super easy, barely an inconvenience. As for using them in your code, new language features and a few tricks can solve most problems one has with regular expressions.

 The Syntax Issue

The most common scenario for the use of regular expressions is wanting to search, search and replace or validate strings and you almost always start with a bunch of strings that you want to match. For example, you want to match both dates in the format 2020-01-21 and 21/01/2020. Immediately there are problems:

  • do you need to escape the slashes?
  • if you match the digits, are you going for two digit month and day segments or do you also accept something like 21/1/2020?
  • is there the possibility of having strings in your input that look like 321/01/20201, in which case you will still match 21/01/2020, but it's clearly not a date?
  • do you need to validate stuff like months being between 1-12 and days between 1-31? Or worse, validate stuff like 30 February?

But all of these questions, as valid as they are, can be boiled down to one: given my input, is my regular expression matching what I want it to match? And with that mindset, all you have to do is get a representative input and test your regular expression in a tester tool. There are many out there, free, online, all you have to do is open a web site and you are done. My favourite is RegexStorm, because it tests .NET style regex, but a simple Google search will find many and varied tools for reading and writing and testing regular expressions.

The syntax does present several problems that you will hit every time:

  • you cannot reuse parts of the pattern
    • in the example above, even if you have clearly three items that you look for - year, month, day - you will need to copy/paste the pattern for each variation you are looking for
  • checking the same part of the input string multiple times is not what regular expressions were created for and even those that support various methods to do that do it in a hackish way
    • example: find a any date in several formats, but not the ones that are in March or in 2017
    • look behind and look ahead expressions are usually used for scenarios like this, but they are not easy to use and reduce a lot of the efficiency of the algorithm
  • classic regular expression syntax doesn't support named groups, meaning you often need to find and maintain the correct index for the match
    • what index does one of the capturing groups have?
    • if you change something, how do other indexes change?
    • how do you count groups inside groups? Do they even count if they match nothing?
  • the "flags" indicating how the regular engine should interpret the pattern are different in every implementation
    • /x/g will look for all x characters in the string in Javascript
    • C# doesn't even need a global flag and the flags themselves, like CaseInsensitive, are .NET enums in code, not part of the regular pattern string
  • from the same class of issues as the two above (inconsistent implementation), many a tool uses a regular expression syntax that is particular to it
    • a glaring example is Visual Studio, which does not use a fully compatible .NET syntax

  The Escape Issue

The starting point for writing any regular expression has to be a regex escaped sample string that you want to match. Escaping means telling the regular expression engine that characters from your sample string that have meaning to it are just simple characters. Example 12.23.34.56, which is an IP address, if used exactly like that as a regular pattern, will match 12a23b34c56, because the dot is a catch all special character for regular expressions. The pattern working as expected would be 12\.23\.34\.56. Escaping brings several severe problems:

  • it makes the pattern less humanly readable
    • think of a phrase in which all white space has been replaced with \s+ to make it more robust (it\s+makes\s+the\s+pattern\s+less\s+humanly\s+readable)
  • you only escape with a backslash in every flavor of regular expressions, but the characters that are considered special are different depending on the specific implementation
  • many characters found in very widely used data formats like XML and JSON are special characters in regular expressions and the escaping character for regex is a special character in JSON and also string in many popular programming languages, forcing you to "double escape", which magnifies the problem
    • this is often an annoyance when you try to store regular patterns in config files

  Readability and Maintainability

Perhaps the biggest issue with regular expressions is that they are hard to maintain. Have you ever tried to understand what a complex regular expression does? The author of the code started with some input data and clear objectives of what they wanted to match, went to regular expression testers, found what worked, then dumped a big string in the code or configuration file. Now you have neither the input data or the things that they wanted matched. You have to decompile the regular pattern in your head and try to divine what it was trying to do. Even when you manage to do that, how often do developers redo the testing step so they verify the changes in a regular expressions do what was intended?

Combine this with the escape issue and the duplication of subpatterns issue and you get every developer's nightmare: a piece of code they can't understand and they are afraid to touch, one that is clearly breaking every tenet of their religion, like Don't Repeat Yourself or Keep It Simple Silly, but they can't change. It's like an itch they can't scratch. The usual solution for code like that is to unit test it, but regular expression unit tests are really really ugly:

  • they contain a lot of text variables, on many lines
  • they seem to test the functionality of the regular expression engine, not that of the regular expression pattern itself
  • usually regular expressions are used internally, they are not exposed outside a class, making it difficult to test by themselves

  Risk

Last, but not least, regular expressions can work poorly in some specific situations and people don't want to learn the very abstract computer science concepts behind regular expression engines in order to determine how to solve them.

  • typical example is lazy modifiers (.*? instead of .*) which tell the engine to not be greedy (get the least, not the most)
    • ex: for input "ineffective" the regular expression .*n will work a lot worse than .*?n, because it will first match the entire word, then see it doesn't end with n, then backtrack until it gets to "in" which it finally matches. The other syntax just stops immediately as it finds the n.
  • another typical example is people trying to find the HTML tag that has a an attribute and they do something like \<.*href=\"something\".*\/\> and what it matches is the entire HTML document up to a href attribute and the end of the last atomic tag in the document.
  • the golden hammer strikes again
    • people start with a simple regular expression in the proof of concept, they put it in a config file, then for the real life application they continuously tweak the configured pattern instead of trying to redesign the code, until they get to some horrible monstrosity
    • a regex in an online library that uses look aheads and look behinds solves the immediate problem you have, so you copy paste it in the code and forget about it. Then the production app has serious performance problems. 

  Solutions

  There are two major contexts in which to look for solutions. One is the search/replace situation in a tool, like a text editor. In this case you cannot play with code. The most you can hope for is that you will find a tester online that supports the exact syntax for regular expressions of the tool you are in. A social solution would be to throw shade on lazy developers that think only certain bits of regular expressions should be supported and implemented and then only in one particular flavor that they liked when they were children.

  The second provides more flexibility: you are writing the code and you want to use the power of regular expressions without sacrificing readability, testability and maintainability. Here are some possible solutions:

  • start with simple stuff and learn as you go
    • the overwhelming majority of the time you need only the very basic features of regular expressions, the rest you can look up when you need them
    • if the regular expression becomes too complex it is an indication that maybe it's not the best approach
  • store subpatterns in constants than you can then reuse using templated strings
    • ex: var yearPattern = @"(?<year>\d{4})"; var datePattern = $@"\b(?:{yearPattern}-(?<month>\d{2})-(?<day>\d{2})|(?<month>\d{2})\/(?<day>\d{2})\/{yearPattern})\b";
    • the example above only stores the year in another variable, but you can store the two different formats, the day, the month, etc
    • in the end your code might look more like the Backus-Naur (BNF) syntax, in which every separate component is described separately
  • use verbatim strings to make the patterns more readable by avoiding double escaping
    • in C# use @"\d+" not "\\d+"
    • in Javascript they are called template literals and use backticks instead of quotes, but they have the same mechanism for escaping characters as normal strings, so they are not a solution
  • use simple regular expressions or, if not, abstract their use
    • a good solution is using a fluent interface (check this discussion out) that allows the expressivity of human language for something that ends up being a regular expression
    • no, I am not advocating creating your own regular expression syntax... I just think someone probably already did it and you just have to find the right one for you :)
  • look for people who have solved the same problem with regular expression and you don't have to rewrite the wheel
  • always test your regular expressions on valid data and pay attention to the time it took to get the matches
  • double check any use of the string methods like IndexOf, StartsWith, Contains, even Substring. Could you use regular expressions?
    • note that you cannot really chain these methods. Replacing a regular expression like ^http[s]?:// with methods always involves several blocks of code and the introduction of cumbersome constants:
      if (text.StartsWith("http")) {
        // 4 works now, but when you change the string above, it stops working
        // you save "http" to a constant and then use .Length, you write yet another line of code
        text=text.Substring(4);
      } else {
        return false;
      }
      if (text.StartsWith("s")) {
        text=text.Substring(1);
      }
      return text.StartsWith("://");
      
      // this can be easily refactored to
      return text
               .IfStartsWith("http")
               .IfStartsWithOrIgnore("s")
               .IfStartsWith("://");
      // but you have to write your own helper methods
      // and you can't save the result in a config file​

  Conclusion

Regular expressions look daunting. Anyone not familiar with the subject will get scared by trying to read a regular expression. Yet most regular expression patterns in use are very simple. No one actually knows by heart the entire feature set of regular expressions: they don't need to. Language features and online tools can help tremendously to make regex readable, maintainable and even testable. Regular expressions shine for partial input validation and efficient string search and replace operations and can be easily stored in configuration files or data stores. When regular expressions become too complex and hard to write, it might be a sign you need to redesign your feature. Often you do not need to rewrite a regular expression, as many libraries with patterns to solve most common problems already exist.

Comments

Carlos

Hello, great article, I enjoyed reading it. Thanks for sharing!

Carlos

Post a comment