Thursday, November 5, 2009

Optimizing a custom Trim() function in C#

I needed to write a Trim() function that removed anything that wasn't a digit from the beginning and end of a string. What usually happens when I write a blog entry like this is that someone posts a comment a few days later saying "hey, you didn't need to write that, it's already in the .NET library." If it is, I haven't been able to find it.

Here is my TrimNonDigits() function first attempt:

protected string TrimNonDigits_v1(string text)
{
    while (text.Length > 0 && !Char.IsDigit(text, 0))
    {
        text = text.Substring(1);
    }
    while (text.Length > 0 && !Char.IsDigit(text, text.Length - 1))
    {
        text = text.Substring(0, text.Length - 1);
    }

    return text;
}

Now this function works and for what I needed it for (trivially short strings) it would do the job. But it is my intention to add it to a string extension library and I had a fear that someone would pass in an enormous string and with all the Substring() operations creating new strings this may become extremely inefficient.

To see how inefficient this function was I wrote another unit test to specifically measure the speed at which the code ran. Here is that unit test:

[TestMethod]
public void trim_non_digits_big_text()
{
    int repeat = 10000;
    UnitTestDerived utd = new UnitTestDerived();
    string lotsOfCharacters = String.Join("", Enumerable.Repeat("abcdef", repeat).ToArray());
    string text = lotsOfCharacters + "1" + lotsOfCharacters;
    string expected = "1";
    string actual = utd.TrimNotDigits(text);
    Assert.AreEqual(expected, actual);
}

The unit test passes but takes 21 seconds to run. Although extremely unlikely that this amount of non-digit-text on either side of a number might appear it's still a valid use case so the code needed reworking so that the Substring() function was only called once. This is the resulting change to the code:

protected string TrimNonDigits(string text)
{
    int start = 0, end = text.Length - 1;
    while (text.Length > start && !Char.IsDigit(text, start))
    {
        start++;
    }
    if (start != end)
    {
        while (end > start && !Char.IsDigit(text, end))
        {
            end--;
        }
    }
    return text.Substring(start, 1 + end - start);
}

The unit test runs against this new function in under 1 second. That's a dramatic difference and good demonstration for the case of not using Substring() repeatedly in a loop if you can help it.

A side note. I also had several other unit tests that checked the validity of the results of the algorithm before I did the optimization. These included the testing of edge cases and extremes. This made the optimization refactor a cinch because I was fairly confident at the end that the algorithm will work in all situations.

5 comments:

  1. When I read your problem, I immediately said that this is regular expression territory. As I understand it, you want just numbers and everything else removed stripped out. Here's my first pass at a suitable regex:
    ^[^\d]*(\d+)[^\d]*$
    Replace the string with $1 and you're set. Here's the string I tested it on: "gdfg 234234234234dfg dfg fdgwwefjlsk!"
    If there's numbers mixed in between the last alphabetical characters, then it won't match and it'll fail. But I think that's outside of the scope of your problem and could be accounted for if necessary.

    ReplyDelete
  2. Based on Bill's comments above I tested another function:
    protected string TrimNonDigits(string text)
    {
    string regex = @"^[^\d]*(\d+)[^\d]*$";
    Match m = Regex.Match(text, regex);
    text = m.Groups[1].Captures[0].Value;
    return text;
    }
    In speed tests my second function runs between 2 and 3 times faster than this regex based function. Of course that may not be the best way to write this regex function and perhaps that can be improved but it was an interesting exercise.
    I also ran all the other unit tests in the project and many of them failed with the regex based function because of other numbers mixed in between as Bill mentioned in the last paragraph.

    ReplyDelete
  3. Not to labor the point, but I would have gone with a regex constructor so you can pass in RegexOptions.Compiled. I also would have used regex.Replace(text, "$1");.
    Finally, your solution will fall down in the same way as I specified if the input is something like "sdfasdfasd345345sdfg995sdfgsdg" Your output should be "345345sdfg995"
    The benefit of regular expressions is that I could easily account for this failure while lopping off the ends could never get rid of that "sdfg" in the middle. It just depends on how rigorous you want your trimming to be.

    ReplyDelete
  4. Thanks for the input Bill - much appreciated. I'm a fan of regex although I hardly ever use it.

    ReplyDelete
  5. Very useful code! I've based a method that trims all control and whitespace characters on the code,
    A test with a 115,000 characters long string says that my trim method is 11 times faster than C# Trim() with a char array. Amazing!
    Thanks a lot!

    ReplyDelete