Regular expression extravaganza

Published Jan 19, 2008

Posted in
.NET
programming
C#
essay

Warning: this is going to be one long and messy article. I will also update it from time to time, since it contains work in progress.

Update: I've managed to uncover something new called lookbehinds! They try to match text that is behind the regular expression runner cursor. Using lookbehinds, one might construct a regular expression that would only match a certain maximum length, fixing the problem with huge mismatch times in some situations like CSV parsing a big file that has no commas inside.

Update 2: It wouldn't really work, since look-behinds check a match AFTER it was matched, so it doesn't optimize anything. It would have been great to have support for more regular expressions ran in parallel on the same string.

What started me up was a colleague of mine, complaining about the ever changing format of import files. She isn't the only one complaining, mind you, since it happened to me at least on one project before. Basically, what you have is a simple text file, either comma separated, semicolon separated, fixed width, etc, and you want to map that to a table. But after you make this beautiful little method to take care of that, the client sends a slightly modified file in an email attachment, with an accompanying angry message like: "The import is not working anymore!".

Well, I have been fumbling with the finer aspects of regular expressions for about two weeks. This seemed like the perfect application of Regex: just save the regular expression in a configuration string then change it as the mood and IQ of the client wildly fluctuates. What I needed was:

a general format for parsing the data
a way to mark the different matched groups with meaningful identifiers
performance and resource economy

The format is clear: regular expression language. The .NET flavour allows me to mark any matched group with a string. The performance should be as good as the time spent on the theory and practice of regular expressions (about 50 years).

There you have it. But I noticed a few problems. First of all, if the file is big (as client data usually is) translating the entire content in a string and parsing it afterwards would take gigantic amounts of memory and processing power. Regular expressions don't work with streams, at least not in .Net. What I needed is a Regex.Match(Stream stream, string pattern) method.

Without too much explanation (except the in code comments) here is a class that does that. I made it today in a few hours, tested it, it works. I'll detail my findings after the code box (which you will have to click to expand).

StreamRegex - click to expand/collapse


namespace Siderite.Text.RegularExpressions
{
/// <summary>
/// Regex Match equivalent for Streams
/// </summary>
public class StreamMatch
{
private readonly Match _match;
private readonly StreamRegex _streamRegex;
private long _streamIndex;

/// <summary>
/// Constructor
/// </summary>
/// <param name="regex">StreamRegex that generated it</param>
/// <param name="m">internal Regex match</param>
/// <param name="streamIndex">real stream index</param>
internal StreamMatch(StreamRegex regex, Match m, long streamIndex)
{
_streamRegex = regex;
_match = m;
StreamIndex = streamIndex;
}

/// <summary>
/// Empty StreamMatch
/// </summary>
public static StreamMatch Empty
{
get { return new StreamMatch(null, null,0); }
}

/// <summary>
/// Success of matching
/// </summary>
public bool Success
{
get
{
if (_match == null) return false;
// matches with length 0 would only cause infinite loops
if ((_streamRegex.EmptyRegexMode==EmptyRegexMode.NoSuccess)&&(_match.Length == 0)) 
return false;
return _match.Success;
}
}

/// <summary>
/// Length of match
/// </summary>
public int Length
{
get
{
if (_match == null) return 0;
return _match.Length;
}
}

/// <summary>
/// Groups in the match. This is a regular Regex GroupCollection
/// </summary>
public GroupCollection Groups
{
get
{
if (_match == null) return null;
return _match.Groups;
}
}

/// <summary>
/// Value of the match
/// </summary>
public string Value
{
get
{
if (_match == null) return null;
return _match.Value;
}
}

/// <summary>
/// Group names in the match including "0","1", etc
/// </summary>
public string[] GroupNames
{
get
{
if (_streamRegex == null) return null;
return _streamRegex.GroupNames;
}
}

/// <summary>
/// Group names in the match that are non numeric (words only)
/// </summary>
public IEnumerable<string> NonNumericGroupNames
{
get
{
if (_streamRegex == null) return null;
return _streamRegex.NonNumericGroupNames;
}
}

public long StreamIndex
{
get { return _streamIndex; }
set { _streamIndex = value; }
}

/// <summary>
/// indexer using a name
/// </summary>
/// <param name="groupName"></param>
/// <returns>Regular Regex Group. The group Index has no meaning</returns>
public Group this[string groupName]
{
get
{
if (_match == null) return null;
return _match.Groups[groupName];
}
}

/// <summary>
/// indexer using a numeric index
/// </summary>
/// <param name="groupNr"></param>
/// <returns>Regular Regex Group. The group Index has no meaning</returns>
public Group this[int groupNr]
{
get
{
if (_match == null) return null;
return _match.Groups[groupNr];
}
}

/// <summary>
/// Find next match
/// </summary>
/// <returns></returns>
public StreamMatch NextMatch()
{
if (_streamRegex == null) return null;
return _streamRegex.NextMatch();
}
}

public enum EmptyRegexMode
{
NoSuccess,
ThrowException,
Success
}

/// <summary>
/// Regex equivalent for Streams
/// </summary>
public class StreamRegex
{
private readonly Regex _regex;
private readonly StringBuilder _sb;
private readonly Stream _stream;
private byte[] _buffer;
private int _bufferSize = 65536;
private long _currentPosition;
private Encoding _encoding;
private string[] _groupNames;
private long _internalStreamLength;
private bool _matchInitialised;
private long _maxMatchLength;
private EmptyRegexMode _emptyRegexMode;

/// <summary>
/// Constructor
/// </summary>
/// <param name="stream">Any readable stream.</param>
/// <param name="pattern">A regex pattern</param>
public StreamRegex(Stream stream, string pattern)
: this(stream, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline| RegexOptions.Compiled)
{
}

/// <summary>
/// Constructor
/// </summary>
/// <param name="stream">Any readable stream.</param>
/// <param name="pattern">A regex pattern</param>
/// <param name="regexOptions">Regex options to be used</param>
public StreamRegex(Stream stream, string pattern, RegexOptions regexOptions)
: this(stream, pattern, regexOptions, Encoding.GetEncoding(1252))
{
}

/// <summary>
/// Constructor
/// </summary>
/// <param name="stream">Any readable stream.</param>
/// <param name="pattern">A regex pattern</param>
/// <param name="regexOptions">Regex options to be used</param>
/// <param name="encoding">Encoding of the stream</param>
public StreamRegex(Stream stream, string pattern, RegexOptions regexOptions, Encoding encoding)
: this(stream, pattern, regexOptions, encoding, long.MaxValue/2)
{
}

/// <summary>
/// Constructor
/// </summary>
/// <param name="stream">Any readable stream.</param>
/// <param name="pattern">A regex pattern</param>
/// <param name="regexOptions">Regex options to be used</param>
/// <param name="encoding">Encoding of the stream</param>
/// <param name="maxMatchLength">Maximum possible match. Specify this to minimize memory use</param>
public StreamRegex(Stream stream, string pattern, RegexOptions regexOptions, Encoding encoding,
long maxMatchLength)
{
_regex = new Regex(pattern, regexOptions);
_stream = stream;
_matchInitialised = false;
Encoding = encoding;
MaxMatchLength = maxMatchLength;
_sb = new StringBuilder();
EmptyRegexMode = EmptyRegexMode.NoSuccess;
}

/// <summary>
/// Read buffer size. default 65536 bytes
/// Warning: this works on an internal string that will be 
/// a multiple of this number. Don't make it too large.
/// </summary>
public int BufferSize
{
get { return _bufferSize; }
set
{
_bufferSize = value;
// you cannot run NextMatch now
_matchInitialised = false;
}
}

/// <summary>
/// Encoding for the stream. default Windows-1252
/// </summary>
public Encoding Encoding
{
get { return _encoding; }
set { _encoding = value; }
}

/// <summary>
/// Maximum possible match length. default long.MaxValue/2
/// </summary>
public long MaxMatchLength
{
get { return _maxMatchLength; }
set { _maxMatchLength = value; }
}

/// <summary>
/// Names of the groups in the regex expression
/// </summary>
public string[] GroupNames
{
get
{
if (_regex == null) return null;
if (_groupNames == null)
{
_groupNames = _regex.GetGroupNames();
}
return _groupNames;
}
}

/// <summary>
/// Names of the groups in the regex expression
/// that are non numeric (so only the word names)
/// </summary>
public IEnumerable<string> NonNumericGroupNames
{
get
{
if (_regex == null) return null;
List<string> list = new List<string>();
foreach (string groupName in _regex.GetGroupNames())
{
if (!Regex.IsMatch(groupName, @"^\d+$")) list.Add(groupName);
}
return list;
}
}

/// <summary>
/// defines who empty matches are treated: 
/// no success match, exception or succesful match
/// defaults to no success match
/// </summary>
public EmptyRegexMode EmptyRegexMode
{
get { return _emptyRegexMode; }
set { _emptyRegexMode = value; }
}

/// <summary>
/// Match the pattern on the stream, starting with current position
/// </summary>
/// <returns></returns>
public StreamMatch Match()
{
return Match(0L);
}

/// <summary>
/// Match the pattern on the stream, starting with current position
/// </summary>
/// <param name="index">skip index bytes</param>
/// <returns></returns>
public StreamMatch Match(long index)
{
return Match(index, long.MaxValue/2);
}

/// <summary>
/// Match the pattern on the stream, starting with current position
/// </summary>
/// <param name="index">skip index bytes</param>
/// <param name="length">read no more than length bytes</param>
/// <returns></returns>
public StreamMatch Match(long index, long length)
{
if (!_stream.CanRead) throw new Exception("Stream is not readable");
if (length <= 0) return StreamMatch.Empty;
if (index != 0) SeekStream(index);
_buffer = new byte[BufferSize];
_sb.Length = 0;
// initialize real position in stream
_currentPosition = _stream.Position;
// this is the maximum length of bytes this match will read
_internalStreamLength = length;
// you can now call NextMatch
_matchInitialised = true;
return NextMatch();
}

internal StreamMatch NextMatch()
{
if (!_matchInitialised) 
throw new Exception("First run Match, then NextMatch");
// search the buffer string first
Match m = _regex.Match(_sb.ToString());
if (m.Success&&((EmptyRegexMode!=EmptyRegexMode.NoSuccess)||(m.Length>0)))
{
if (EmptyRegexMode == EmptyRegexMode.ThrowException)
{
if (m.Length == 0)
throw new Exception(
"Successful Match has empty value. If you use NextMatch in a loop it will become infinite");
}
StreamMatch sm = new StreamMatch(this, m, _currentPosition + m.Index);

//remove matches to minimize memory usage
int extra = m.Index + m.Length;
_sb.Remove(0, extra);
// update real position in stream
_currentPosition += extra;

return sm;
}
int l;
while ((l = _stream.Read(_buffer, 0, (int) Math.Min(_internalStreamLength, _buffer.Length))) > 0)
{
_internalStreamLength -= l;

// remove part of StringBuilder where there is no chance it contains a match
// this works only if MaxMatch is specified
long extraLength = _sb.Length - MaxMatchLength;
if (extraLength > 0)
{
_sb.Remove(0, (int) extraLength);
_currentPosition += extraLength;
}


_sb.Append(Encoding.GetString(_buffer, 0, l));

// TODO: remove this
Debug.WriteLine("SB length: " + _sb.Length);

m = _regex.Match(_sb.ToString());
if (m.Success)
{
StreamMatch sm = new StreamMatch(this, m, _currentPosition + m.Index);

//remove matches to minimize memory usage
int extra = m.Index + m.Length;
_sb.Remove(0, extra);
// update real position in stream
_currentPosition += extra;
return sm;
}
}
return new StreamMatch(this, m, 0);
}

/// <summary>
/// Get all matches from the stream
/// </summary>
/// <returns></returns>
public IEnumerable<StreamMatch> Matches()
{
return Matches(0L);
}

/// <summary>
/// Get all matches from the stream
/// </summary>
/// <param name="index">skip index bytes</param>
/// <returns></returns>
public IEnumerable<StreamMatch> Matches(long index)
{
return Matches(index, long.MaxValue/2);
}

/// <summary>
/// Get all matches from the stream
/// </summary>
/// <param name="index">skip index bytes</param>
/// <param name="length">read no more than length bytes</param>
/// <returns></returns>
public IEnumerable<StreamMatch> Matches(long index, long length)
{
List<StreamMatch> list = new List<StreamMatch>();
StreamMatch m = Match(index, length);
while (m.Success)
{
list.Add(m);
m = NextMatch();
}
return list;
}

/// <summary>
/// Seek the stream to specified position
/// </summary>
/// <param name="index">seek to index from current position</param>
public void SeekStream(long index)
{
SeekStream(index, SeekOrigin.Current);
}

/// <summary>
/// Seek the stream to specified position
/// </summary>
/// <param name="index">seek to index position</param>
/// <param name="origin">from current position, start or end</param>
public void SeekStream(long index, SeekOrigin origin)
{
int _tmpSeekSize = 65536;
if (_stream.CanSeek)
{
_stream.Seek(index, origin);
}
else
{
if (origin != SeekOrigin.Current) throw new Exception("Stream is not seakable");
long l = 0;
try
{
l = _stream.Length;
}
#pragma warning disable EmptyGeneralCatchClause
catch (Exception)
{
}
#pragma warning restore EmptyGeneralCatchClause
if ((l > 0) && (l < 65536)) _tmpSeekSize = 4096;
if (l > 1000000) _tmpSeekSize = 409600;
if (index >= 0)
{
byte[] buff = new byte[_tmpSeekSize];
l = index;
while (l > 0)
{
l -= _stream.Read(buff, 0, (int) Math.Min(l, buff.Length));
}
}
else
{
throw new Exception("begin cannot be negative on a non seeking stream");
}
}

// update real position in stream
_currentPosition = _stream.Position;
// you cannot run NextMatch now
_matchInitialised = false;
}
}
}

One issue I had with it was that I kept translating a StringBuilder to a string. I know it is somewhat optimized, but the content of the StringBuilder was constantly changing. A Regex class that would work at least on a StringBuilder would have been a boost. A second problem was that if the input file was not even close to my Regex pattern, the matching would take forever, as the algorithm would add more and more bytes to the string and tried to match it.

And of course, there was my blunt and inelegant approach to regular expression writing. What does one do whan in Regex hell? Read Steve Levithan's blog, of course! It was then when I decided to write this post and also document my regular expression findings.

So, let's summarize a bit, then add a bunch of links.

the .NET regular expression flavour supports marking a group with a name like this
```
(?<nameOfGroup>someRegexPattern)
```
it also supports non capturing grouping:
```
(?:pattern)
```
This will not appear as a Group in any match although you can apply quantifiers to it
also supported are atomic or greedy grouping.
```
(?>".+")
```
The pattern above will match "abc" but not "abc"d because ".+ matches the whole pattern and the ending quote is not matched. Normally, it would backtrack, but atomic groups do not backtrack once they failed, saving time, but possibly skipping matches
one can also use lazy quantifiers:ab+? will match ab in the string abbbbbb
posessive quantifiers are not supported, but they can be substituted with atomic groups:
```
ab*+ in some regex flavours is (?>ab*) in .NET
```
let's not forget the
```
(?#this is a comment)
```
notation to add comments to a regular expression
Look-behinds! - great new discovery of mine that can match an already matched expression. I am not sure how it would hinder speed, though. Quick example: I want to match "This is a string", but not "This is a longer string, that I don't want to match, since it is ridiculously long and it would make my regex run really slow when I really need only a short string" :), both as separate lines in a text file.
```
([^\r\n]+)(?:$|[\r\n])(?<=(?:^|[\r\n]).{1,21})
```
This expression matches all strings that do not contain line breaks, then looks behind to check if there is a string begin or a line break character at at most 21 characters behind, effectively reducing the maximum length of the matched string to 20. Unfortunately, this would slow even more the search, since it would only back check a match AFTER the match completed.

What does that mean? Well, first of all, an increase in performance: using non capuring grouping will save memory, using atomic quantifiers will speed up processing. Then there is the "Unrolling the loop" trick, using atomic grouping to optimize repeated alternation like (that|this)*. Group names and comments ease the reading and reuse of regular expressions.

Now for the conclusion: using the optimizations described above (and in the following links) one can write a regular expression that can be changed, understood and used in order to break the input file into matches, each one having named groups. A csv file and a fixed length record file would be treated exactly the same. Let's say using something like (?<ZipCode>\w*),(?<City>\w*)\r\n or (?<ZipCode>\w{5})(?<City>\w{45})\r\n or use look-behinds to limit the maximum line size. All the program has to do is parse the file and create objects with the ZipCode and City properties (if present), maybe using the new C# 3.0 anonymous types. Also, I have read about the DFA versus NFA types of regular expression implementations. DFAs are a lot faster, but cannot support many features that are supported by NFA implementations. The .Net regex flavour is NFA, but using atomic grouping and other such optimizations bridges the gap between those two.

There is more to come, as I come to understand these things. I will probably keep reading my own post in order to keep my thoughts together, so you should also stay tuned, if interested. Now the links:

.NET Framework General Reference Grouping Constructs
.NET Framework General Reference Quantifiers
Steve Levithan's blog
Regular Expression Optimization Case Study
Optimizing regular expressions in Java
Atomic Grouping
Look behinds
Want faster regular expressions? Maybe you should think about that IgnoreCase option
Scott Hanselman's .NET Regular Expression Tool list
Compiling regular expressions (also worth noting is that the static method Regex.Match will cache about 15 used regular expressions so that they can be reused. There is also the Regex.CacheSize property that can be used to change that number)
Regular expressions at Wikipedia
Converting a Regular Expression into a Deterministic Finite Automaton
From Regular Expressions to DFA's Using
Compressed NFA's

There is still work to be done. The optimal StreamRegex would not need StringBuilders and strings, but would work directly on the stream. There are a lot of properties that I didn't expose from the standard Regex and Match objects. The GroupCollection and Group objects that my class exposes are normal Regex objects, some of their properties do not make sense (like index). Normally, I would have inherited from Regex and Match, but Match doesn't have a public constructor, even if it is not sealed. Although, I've read somewhere that one should use composition over inheritance whenever possible. Also, there are some rules to be implemented in my grand importing scheme, like some things should not be null, or in a range of values or in some relation to other values in the same record and so on. But that is beyond the scope of this article.

Any opinions or suggestions would really be apreciated, even if they are not positive. As a friend of mine said, every kick in the butt is a step forward or a new and interesting anal experience.

Update:

I've taken the Reflected sources of System.Text.RegularExpressions in the System.dll file and made my own library to play with. I might still get somewhere, but the concepts in that code are way beyond my ability to comprehend in the two hours that I allowed myself for this project.

What I've gathered so far:

the Regex class is no sealed
Regex calls on a RegexRunner class, which is also public and abstract
RegexRunner asks you to implement the FindFirstChar, Go and InitTrackCount methods, while all the other methods it has are protected but not virtual. In the MSDN documentation on it, this text seals the fate of the class This API supports the .NET Framework infrastructure and is not intended to be used directly from your code.
The RegexRunner class that the Regex class calls on is the RegexInterpreter class, which is a lot of extra code and, of course, is internal sealed

.

The conclusion I draw from these points and the random experiments I did on the code itself are that there is no convenient way of inheriting from Regex or any other class in the System.Text.RegularExpressions namespace. It would be easy, once the code is freely distributed with comments and everything, to change it in order to allow for custom Go or ForwardCharNext methods that would read from a stream when reaching the end of the buffered string or cause a mismatch once the runmatch exceeds a certain maximum length. Actually, this last point is the reason why regular expressions cannot be used so freely as my original post idea suggested, since trying to parse a completely different file than the one intended would result in huge time consumption.

Strike that! I've compiled a regular expression into an assembly (in case you don't know what that is, check out this link) and then used Reflector on it! Here is how to make your own regular expression object:

Step 1: inherit from Regex and set some base protected values. One that is essential is base.factory = new YourOwnFactory();

Step 2: create said YourOwnFactory by inheriting from RegexRunnerFactory, override the CreateInstance() method and return a YourOwnRunner object. Like this: class YourOwnFactory : RegexRunnerFactory
{
protected override RegexRunner CreateInstance()
{
return new YourOwnRunner();
}
}

Step 3: create said YourOwnRunner by inheriting from abstract class RegexRunner. You must now implement FindFirstChar, Go and InitTrackCount.

. You may recognize here a Factory design pattern! However, consider that the Microsoft normal implementation (the internal sealed RegexInterpreter) has like 36Kb/1100 lines of highly optimised code. This abstract class is available to poor mortals for the single reason that they needed to implement regular expressions compiled into separate assemblies.

I will end this article with my X-mas wish list for regular expressions:

An option to match in parallel two or more regular expressions on the same string. This would allow me to check for a really complicated expression and in the same time validate it (for length, format, or whatever)
Stream support. This hack in the above code works, but does not real tap in the power of regular expressions. The support should be included in the engine itself
Extensibility support. Maybe this would have been a lot more easy if there was some support for adding custom expressions, maybe hidden in .NET (?#comment) syntax.

Regular expression extravaganza

Comments

Post a comment