Monday, November 9, 2009

Speed improvements with compiled regex

During a code review I was told that a compiled regex would work faster. I had no doubt that this was true but I wanted to know how much faster and at what cost. I setup and ran the following test.

static void TestCompiledRegex()
{
    string regexString = @"(\{{0,1}([0-9a-fA-F]){8}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){12}\}{0,1})";
    Regex compiledRegex = new Regex(regexString, RegexOptions.Compiled);
    Regex uncompiledRegex = new Regex(regexString, RegexOptions.None);

    double totalFaster = 0d;
    int numIterations = 0;
    for (int j = 10; j <= 100; j += 10)
    {
        TimeSpan uncompiledTime = RunRegex(uncompiledRegex, j);
        TimeSpan compiledTime = RunRegex(compiledRegex, j);
        double timesFaster = uncompiledTime.TotalMilliseconds / compiledTime.TotalMilliseconds;
        Console.WriteLine("For {0} GUIDS compiled takes {1} and non-compiled takes {2}. Compiled is {3:0.00} faster", j, compiledTime, uncompiledTime, timesFaster);
        totalFaster += timesFaster;
        numIterations++;
    }
    Console.WriteLine("Average times faster: {0:0.00}", totalFaster/(double)numIterations);
    Console.ReadLine();
}

static TimeSpan RunRegex(Regex regex, int numGuids)
{
    int x;
    string input = GetStringWithGuids(numGuids);
    DateTime startTime = DateTime.Now;
    for (int i = 0; i < 10000; i++)
    {
        MatchCollection mc = regex.Matches(input);
        x = mc.Count;
    }
    return DateTime.Now - startTime;

}

static string GetStringWithGuids(int numGuids)
{
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < numGuids; i++)
    {
        sb.Append(Guid.NewGuid().ToString() + " spacer ");
    }
    return sb.ToString();
}

The result was that the compiled regex ran 1.6 times faster than the uncompiled version.

I then split the RunRegex function into two functions and moved the creation of the Regex object into these functions making one of them compiled and the other not.

On text with 10 GUIDs in it the uncompiled version ran 42 times faster than the compiled version. The number of times faster diminished as the number of GUID's (matches) in the string increased until the uncompiled version was six times faster when we had 100 matches in the string.

The next test was to decrease the number of matches to non-matches in the text so I adjusted the GetStringWithGuids() function to add 100 "spacers" between each GUID. Remember that the listed GetStringWithGuids() above has 1 match (GUID) per non-match(" spacer "). The new function looked like this:

static string GetStringWithGuids(int numGuids)
{
    StringBuilder sb = new StringBuilder();
    string spacer = String.Join(" ", Enumerable.Repeat("spacer", 100).ToArray());
    for (int i = 0; i < numGuids; i++)
    {
        sb.Append(Guid.NewGuid().ToString() + " " + spacer + " ");
    }
    return sb.ToString();
}

For 10 GUIDs the uncompiled version performed 2.7 times better but at 50 GUIDs the compiled version started performing better through to 100 GUIDs.

So the only test left was the one that was truly representative of the data that I was going to run this against which was a block of text with a single GUID in it.

The new GetStringWithGuids() function with redundant parameter looked like this:

static string GetStringWithGuids(int numGuids)
{
    StringBuilder sb = new StringBuilder();
    string spacer = String.Join(" ", Enumerable.Repeat("spacer", 100).ToArray());
    sb.Append(spacer + " " + Guid.NewGuid().ToString() + " " + spacer);
    return sb.ToString();
}

This showed the uncompiled version to be 10 times faster than the compiled version.

4 comments:

  1. How is it possible that the same regular expression can be *faster* when not compiled? When the two are compared after being run once, I can get. But these are being run in loops that would spread out the compilation hit. Something doesn't smell right: it's either the methodology, the measurement, or the algorithm.
    Towards the second piece, are you familiar with the System.Diagnostics.Stopwatch? It'd be more accurate than date comparisons on milliseconds. But I don't think that'd resolve the cognitive dissonance here.

    ReplyDelete
  2. What I was doing was compiling the regular expression each time in order to measure the time taken to compile the regex versus using it uncompiled. e.g. as if the function had only been called once.
    I was not aware of System.Diagnostics.Stopwatch. Thanks for the heads up - I'll take a look at that.

    ReplyDelete
  3. When doing timing exercises in Unix, the <a href="http://unixhelp.ed.ac.uk/CGI/man-cgi?time">time</a> utility is useful (its also available in windows through cygwin). You wrap the call to your program in time:
    $> time my-test
    When your program has completed, time dumps the wall time (real time), user time and system (kernel) time to the console. This is useful for removing any bias contention on the system might cause. On a CPU bound test, user and system time will generally be the same, while wall time can fluctuate with transient scheduling patterns.

    ReplyDelete
  4. As tempting as Unix is, :), I,ve discovered that the Measure-Command cmdlet in Powershell appears to do the same thing. I had CygWin installed on a machine a while back but don't have it installed at the moment and will probably try and stick to Powershell until someone sells me on the LAMP stack.

    ReplyDelete