Wednesday, December 8, 2010

Speed of Contains() on string list and hash set in C#

Although I was 99.9% certain that a Contains() on a HashSet would outperform the same on a List I needed to see it happen before I could sleep again.

Here is the code that I used to test it.

int iterateNum = 100000000;
string longString = "this is a very very very long string that is in the list";
string searchString = "this is the search string";

List<string> stringList = new List<string>();
for (int i = 0; i < iterateNum; i++)
{
    stringList.Add(longString);
}
stringList.Add(searchString);

Stopwatch stopwatch = Stopwatch.StartNew();
bool contains = stringList.Contains(searchString);
Console.WriteLine("contains on list took: " + stopwatch.Elapsed.ToString());

HashSet<string> hashSet = new HashSet<string>();
for (int i = 0; i < iterateNum; i++)
{
    hashSet.Add(longString);
}
hashSet.Add(searchString);
stopwatch.Reset(); // should be stopwatch.Restart();
contains = hashSet.Contains(searchString);
Console.WriteLine("contains on hashset took: " + stopwatch.Elapsed.ToString());

And here are the results:

contains on list took: 00:00:01.2830928
contains on hashset took: 00:00:00

i.e. Was not possible to measure how long the hash set took.

In the code I'm adding the string to find as the last string in the list to make sure that the entire list is searched before it finds the string.

Edit: As per Bill's suggestion in the first comment I changed the code to populate a smaller and more realistic list and then exercised the Contains() method multiple times to get an "average" of its performance. I also replaced the Reset() method on the Stopwatch object with Restart(). Here is the new code:

int iterateNum = 4000;
int testNum = 100000;
string longString = "this is a very very very long string that is in the list";
string searchString = "this is the search string";

List<string> stringList = new List<string>();
for (int i = 0; i < iterateNum; i++)
{
    stringList.Add(longString);
}
stringList.Add(searchString);

Stopwatch stopwatch = Stopwatch.StartNew();
for (int i = 0; i < testNum; i++)
{
    bool contains = stringList.Contains(searchString);
}
Console.WriteLine("contains on list took: " + stopwatch.Elapsed.ToString());

HashSet<string> hashSet = new HashSet<string>();
for (int i = 0; i < iterateNum; i++)
{
    hashSet.Add(longString);
}
hashSet.Add(searchString);

stopwatch.Restart(); // .NET 4 method on Stopwatch
for (int i = 0; i < testNum; i++)
{
    bool contains = hashSet.Contains(searchString);
}
Console.WriteLine("contains on hashset took: " + stopwatch.Elapsed.ToString());

And the results are:

contains on list took: 00:00:04.6986760
contains on hashset took: 00:00:00.0056837
 

 

4 comments:

  1. Your stopwatch.Reset() should be stopwatch.Restart() since the former stops the stopwatch and resets it to zero whereas the latter does that and also starts it again.
    Also, a more representative test would be to populate the list and hashset with 100MM different strings and then find one of them. Do 10K of those requests and then average the times and you'll get some representative numbers.
    As it stands now, the list is hobbled because the value is always at the end. The hash of the search string could very well place it close to the beginning or to wherever the algorithm begins its search. Realistically, the search string could be located anywhere within either collection and that could affect the performance of either.
    Quibbles aside, there is no question in my mind that Contains on a hash-based data structure will perform better than on any list-based one. Thanks for putting some numbers behind it!

    ReplyDelete
  2. Thanks for the comments Bill - I fixed the code as per your recommendations and bug finds and have rerun it and added the new results.

    ReplyDelete
  3. If all you want is to get a true or false value, I suggest using switch. Sample code:
    public bool isNumber_hashSet(char c)
    {
    bool b = false;
    HashSet<char> hs = new HashSet<char> { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', };
    if (hs.Contains(c))
    {
    b = true;
    }
    return b;
    }
    public bool isNumber_switch(char c)
    {
    bool b = false;
    switch (c)
    {
    case '0': b = true; break;
    case '1': b = true; break;
    case '2': b = true; break;
    case '3': b = true; break;
    case '4': b = true; break;
    case '5': b = true; break;
    case '6': b = true; break;
    case '7': b = true; break;
    case '8': b = true; break;
    case '9': b = true; break;
    }
    return b;
    }
    I ran each of the methods 1000000 times. Result:
    isNumber_hashSet: 1700 ms
    isNumber_switch: 35 ms
    In other words, switch us about 50(!) times faster than HashSet.
    Happy coding!

    ReplyDelete
  4. @matsolof --> On a real application you would cache your instance of hashset.
    The instantiation in each time you call the function is causing this difference in time. I didn't test it, but I'm pretty sure it would make a huge diff in time.
    Kind regards!

    ReplyDelete