Careful using Regex classes

Published Nov 15, 2021

Posted in
programming
c#

The point of regular expression character classes is to simplify your expressions, but they can introduce subtle bugs or efficiency issues.

Let's check out this StackOverflow answer to question \d less efficient than [0-9]:

\d checks all Unicode digits, while [0-9] is limited to these 10 characters. For example, Persian digits, ۱۲۳۴۵۶۷۸۹, are an example of Unicode digits which are matched with \d, but not [0-9].

This makes sense, only it has never occurred to me until this very moment. I would never use a [0-9] notation and I would replace it with a \d if found in code.

What does that mean?

One simple consequence of such a class would be performance: searching for a large list of characters is less efficient. Another would be introducing the possibility for bugs or even malicious attacks. Let's see the code for a calculator that adds two numbers. It's a silly piece of code, but imagine that a more complex one would take the user content and save it into a database, try to process it or display it.

static void Main(string[] args)
{
    Console.InputEncoding = Encoding.Unicode;
    var firstNumber = GetNumberString();
    var secondNumber = GetNumberString();
    Console.WriteLine("Sum = "+(int.Parse(firstNumber) + int.Parse(secondNumber)));
}

private static string GetNumberString()
{
    string result=null;
    var isNumber = false;
    while (!isNumber)
    {
        Console.Write("Enter a number: ");
        result = Console.ReadLine();
        isNumber = Regex.IsMatch(result, @"^\d+$");
        if (!isNumber)
        {
            Console.WriteLine($"{result} is not a number! Try again.");
        }
    }
    return result;
}

This will try to get numbers as a string and test it using the regular expression ^\d+$, which means the string has to consist of one or more digits. Note that I had to set the console input encoding to Unicode in order to be able to paste Persian numbers. This code works fine until I use Arabic or Persian digits, where it breaks in the int.Parse method. Using ^[0-9]$ as the regular expression pattern would solve this issue.

Same issue will occur with \w (warning: \w is letters AND digits) and [a-zA-Z] (or just [a-z] and using RegexOptions.IgnoreCase).

If one uses code to determine the number of matches for each regular expression pattern

var regexPattern = @"\d";
var nr = 0;
for (int i = 0; i < ushort.MaxValue; i++)
{
    string str = Convert.ToChar(i).ToString();
    if (Regex.IsMatch(str, regexPattern))
        nr++;
}
Console.WriteLine(nr);

we get this:

for \d : 370
- ALL digits
for \w : 50320
- ALL word characters (including digits)
for [^\W\d] : 49950
- ALL word characters, but not the digits
for \p{L} : 48909
- ALL letters
for [A-Za-z] : 52
- letters from a to z
for [0-9] : 10
- digits from 0 to 9

I hope this helps.

Careful using Regex classes

Comments

Post a comment