Guy Ellis' Tech Blog: November 2009

Monday, November 30, 2009

Hard Disk speed comparison

These are five hard disks that I have in three machines that I frequently access.

Machine	Drive	MB/s	Model	Description
Pavilion	C	35/35	ST3250823AS	Seagate Barracuda 7200.8 ST3250823AS 250GB 7200 RPM 8MB Cache SATA 1.5Gb/s 3.5" Hard Drive -Bare Drive
Pluto	C	225/225	PFZ128GS25SSDR	Patriot Torqx PFZ128GS25SSDR 2.5" Internal Solid state disk (SSD)
Pluto	E	60/120	ST32000542AS	Seagate Barracuda LP 2TB 3.5" SATA 3.0Gb/s Hard Drive -Bare Drive
Monster	C	150/200	WD3000GLFS-01F8U0	Western Digital VelociRaptor 300GB 3.5" SATA 3.0Gb/s Internal Hard Drive -Bare Drive
Monster	D	90/110	ST31500341AS	Seagate Barracuda LP 1.5TB 3.5" SATA 3.0Gb/s Hard Drive -Bare Drive

I was thinking of replacing the 250GB Seagate in the Pavilion to improve performance on that machine but wanted to get an idea of what sort of performance boost I thought I might be able to get so I copied some hard disk benchmarking software (ATTO) onto each of the machines and ran it.

The MB/s column shows the write/read speeds measured for each disk. This is a rough average for each disk.

My surprises were as follows:

The solid state drive (SSD) was not much faster than the velociraptor.
The seagate 1.5TB had a faster write time than the 2TB. (It doesn't feel faster, it feels slower.)
The 2TB driver is less that twice as fast as the old 250GB Seagate. (Again, the 2TB drive feels much faster.)

I've decided to replace the Pavilion's 250GB Seagate with a 2TB Seagate. The bottleneck with that machine at the moment is the hard disk and I'm confident that I'm going to get at least a triple speed improvement even though the numbers don't give that.

Since I installed my Windows Home Server nothing has crashed on my network so I have been unable to try out recovering a hard disk from its backup. I plan on using this new disk as an opportunity to try this with the Pavilion machine. I will add the new hard disk and then set the bios to treat this disk as the boot drive. I think that I then have to generate a recovery CD from which this machine will boot and restore the OS etc. to the new disk.

I'll post a link to my results here when done.

Destroying a Hard Disk

A number of years ago I got myself a Maxtor 250GB external hard disk which at the time I used for all my backups. It cost me a fair penny then (2003) and I can now in 2009 buy 4TB of space for the same amount. Don't think that I'm complaining though, I think that's great progress.

Like all good hard drives it eventually died but it died a slow bit by bit corrupting death. It started to become inconsistent (that's when I should have taken an image of it) and then just started to corrupt and not be able to read data. Luckily I manage to recover all my important data and just lost a few albums of songs and movies that I could easily buy again.
The drive had only lasted about 2 years and Maxtor refused to entertain the idea of replacing and knowing how much I'd spent on it I didn't want to through it away it so it went into the box of computer parts that every geek has.
Years passed and during a cleanup I agree with myself that it had to go. I really couldn't remember what else was on it but I was pretty sure that a good hardware recovery technician could get data off it so I wanted to destroy it before disposing of it.
To this end I pulled out a drill and started drilling holes in it. The white box around the photo below shows a broken drill bit and the other end of the bit is lying on top of the circuit board.

With a larger and stronger drill bit I was able to get some deeper and more crippling holes into the disk before I decided that I had limited data recover-ability from it.

When you read the instructions on how to handle these devices they always tell you to discharge any static before handling and work in a static free environment etc. I ignore all of that when I was doing this.

Wednesday, November 25, 2009

Age Analyzer

As part of the text classification work that I do I follow the uClassify blog who have just release a new web site called Age Analyzer. which guesses the age of the author of a blog. I plugged in the URL of this site and it guessed me in the 26-35 bracket. This would be flattering if you were to look at me but I find it insulting that my writing is considered this immature. It should have put me in the 36-50 bracket.

My friend Rob Manderson's eloquent prose has aged his Ultramaroon Rises Again blog in the 65-100 range (I'm sure you're not that old Rob). Dan Esparza's Esta Nublado puts him at a very young 18-25, and Bill Brown's New Clarion puts him in the 65-100 bracket while his bblog looks a little more accurate at 26-35.

The Evil Solution User Option SUO file

I have a Visual Studio solution that's started taking longer and longer to load. In fact it was taking up to 5 minutes to load this solution which meant that I'd leave the solution open for days just so that I didn't have to reload it. Another symptomatic clue that all was not well in Denmark was that it was taking several minutes for the IDE to be usable after I stopped a debug session with Shift+F5.

My first stab at a solution was to create a new one in a different directory and add all my projects to it. Worked like a charm and the new solution loaded lightening faster (couple of seconds) and I could exit the debug session in about 3 seconds.

Although happy that I'd solved the problem I was then lucky enough to notice that in the original folder next to the solution's .sln file there was a Solution User Option .suo file which was a whopping 11Mb in size. I deleted this file and opened the original solution and discovered that my solution is now loading and exiting debug lickety-split again.

It might be my unorthodox way of exiting debug sessions with Shift+F5 that's causing this file to grow in size. I'll keep an eye on it's future growth and report back here if I learn anything new.

Monday, November 23, 2009

Even Faster Web Sites

Just been reading Even Faster Web Sites. I had never heard of comet before I read this but it makes sense and I'm looking forward to trying it.

Comet is a long-held HTTP request allows a web server to push data to a browser, without the browser explicitly requesting it.

I was a little bit surprised that the book only gave two pages to CSS Sprites. My testing shows that sprites can significantly improve web site load times if the site has multiple small images coming from the server.

Web app efficiently is going to start playing a more and more significant role as we return to the mainframe model with cloud computing. Many of the cloud computing providers are charging by processor cycles so the fewer of these your app uses the less expensive it is to run in the cloud. On top of that because you are processing your requests faster your site will be more competitive.

Sunday, November 22, 2009

SQL Injection Attack part 2

Since I last wrote about a SQL Injection Attach that one of my sites received I took measures to prevent it and now reject a URL with @(cast in it immediately and don't process it any further. This has worked well over the last year and a half and no further attacks of that type have made it into the logs.

I have now started to see URL requests with the following pattern:

...&whichpage=3%20and%20char(124)%2Buser%2Bchar(124)=0

The significant part comes after the =3:

and |+user+|=0

No idea what they're trying to achieve with this...

Friday, November 20, 2009

Rotate or flip image through Y-axis

I've been trying to work out how to rotate an image through the Y-axis and display its mirror image on a web page. One of the problems that I faced was trying to find the right word(s) to search on as rotate (so I discovered) is used to refer to rotation through the Z-axis most of the time.
I finally found a solution in Silverlight 3. Here's the code...
<Image Source="Jellyfish.jpg">
    <Image.Projection>
        <PlaneProjection RotationY="180" />
    </Image.Projection>
</Image>
...which does exactly what I'm looking for.
My favorite image editing program (Paint.Net) refers to this as Flip Horizontal which I think is completely misnamed.

<PlaneProjection RotationY="0" />

<PlaneProjection RotationY="180" />

Hard disk capacity is getting smaller

I'm just about to go to Fry's Electronics and get myself a Seagate Barracuda LP 2TB 3.5" SATA 3.0Gb/s Hard Drive -Bare Drive (that link is to the same drive but at NewEgg). I was curious to know how big this 2TB drive would be so I did a quick calculation and discovered that it will be 1.82 TB. As most people know, disk drive manufacturers calculate disk sizes to make them appear bigger than they actually are. They measure a Kb as 1,000 bytes instead of 1,024 bytes.

This caused me to do a little calculation to see what were were getting when sold drive space:

1MB (advertised) = 0.95 MB (actual)

1GB = 0.93 GB

1TB = 0.91 TB

1PB = 0.89 PB

Do you see the progression? they're getting smaller. Eventually that number will be 0.00 Xb.

A PB is a Petabyte which I have always considered to be 1,024^5. However, according to wikipedia I'm wrong and it's 1,000 terabytes and not 1,024 terabytes and a Pebibyte abbreviated as Pi is what I should be using. Also all my other references are apparently wrong and I should be using Mi, Gi, Ti, and Pi as abbreviations.

Tuesday, November 17, 2009

Get Powershell Version Number

I wish that they'd aliased the old DOS "ver" command to Powershell's Get-Host command.

If you type "alias" at the Powershell command prompt you'll see a ton of aliases that they've setup. "dir", "copy", "cls", "compare", "md", "cd" etc. These greatly ease the transition to Powerhsell's commands if you want to quickly drop in to a command prompt and have a PowerDOS environment.

I know that I can easily configure Powershell to alias "ver" but the only time I usually need this is when I'm on a new machine and don't know what version is running. After that I don't need it anymore. Even "Get-Ver" would be more helpful and logical than Get-Host.

Anyway, if you haven't figured it out by now:

ver = Get-Host

Windows Management Framework Core is already installed on your system

I've been trying to install Windows Update KB968930 on my XP SP3 machine and kept on getting this error:

Windows Management Framework Core Setup Error
Setup cannot proceed. Windows Management Framework Core is already installed on your system.

This update, I believe, is Powershell 2.0 (final)

To install Powershell 2.0 you first have to uninstall Powershell 1.0

I finally solved it and got it installed by uninstalling some stuff that has to be removed before you uninstall Powershell 1.0. This is what I uninstalled and the order in which I did it:

Update for Microsoft Windows (KB971513)
Update for Windows Internet Explorer (KB975364)
Microsoft Base Smart Card Cryptographic Service
Windows Powershell 1.0 MUI Pack
Wndows Powershell 1.0
Windows Management Framework Core

UPDATE 12/22/2009: You may find that after you've installed Powershell 2 you see an exception from mscorelib.exe when you reboot your computer and this error repeats three times. If you look at the Event Viewer and under System you will see an error from Service Control Manager: The .NET Runtime Optimization Service v2.0.50727_X86 service terminated unexpectedly.

To solve this problem you need to run the following in the new version of Powershell that you've just installed:

Set-Alias ngen (Join-Path ([Runtime.InteropServices.RuntimeEnvironment]::GetRuntimeDirectory()) ngen.exe)
ngen update

There will be pages and page of informational text displayed including errors and warnings and this will take a fair amount of time to run. Be patient and don't panic.

Source: https://connect.microsoft.com/PowerShell/feedback/ViewFeedback.aspx?FeedbackID=494515

Friday, November 13, 2009

Win7 slam the sides feature

I frequently need to view two web pages side by side but I find that I have those two web pages open in separate tabs in the same browser.

What I used to do would be the following:

Copy the URL from one of the tabs.
Minimize all windows on the desktop (Win + D)
Open a second browser.
Paste in the URL and hit enter.
Restore the original browser.
Right click the task bar and select "Show windows side by side" (or whatever the XP equivalent was).

I've found two shortcuts to improve this process. One of them from Windows 7 and the other seems to be a new(ish) feature on modern browsers that I just stumbled across.

Click the title bar of the browser with the mouse and slam it onto the right or left edge of the screen. (This will cause the browser to fully occupy one vertical half of your screen.
Drag the tab off the browser and it will immediately create a new new window.
Slam that title bar into the other edge.

I've discovered that dragging the tab away from the tab bar will create a new browser window in FireFox, Chrome and Opera but I haven't been able to get this to work in IE8.

In Firefox and Chrome you can drag the tab down onto the current page and it will pop up a new windows. However, in Opera, you need to drag the tab off the browser to get it to create a new window. That's why it's better to slam the side with the browser before dragging the tab because then you know that you have space to drag the tab onto.

Monday, November 9, 2009

Speed improvements with compiled regex

During a code review I was told that a compiled regex would work faster. I had no doubt that this was true but I wanted to know how much faster and at what cost. I setup and ran the following test.

static void TestCompiledRegex()
{
    string regexString = @"(\{{0,1}([0-9a-fA-F]){8}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){12}\}{0,1})";
    Regex compiledRegex = new Regex(regexString, RegexOptions.Compiled);
    Regex uncompiledRegex = new Regex(regexString, RegexOptions.None);

    double totalFaster = 0d;
    int numIterations = 0;
    for (int j = 10; j <= 100; j += 10)
    {
        TimeSpan uncompiledTime = RunRegex(uncompiledRegex, j);
        TimeSpan compiledTime = RunRegex(compiledRegex, j);
        double timesFaster = uncompiledTime.TotalMilliseconds / compiledTime.TotalMilliseconds;
        Console.WriteLine("For {0} GUIDS compiled takes {1} and non-compiled takes {2}. Compiled is {3:0.00} faster", j, compiledTime, uncompiledTime, timesFaster);
        totalFaster += timesFaster;
        numIterations++;
    }
    Console.WriteLine("Average times faster: {0:0.00}", totalFaster/(double)numIterations);
    Console.ReadLine();
}

static TimeSpan RunRegex(Regex regex, int numGuids)
{
    int x;
    string input = GetStringWithGuids(numGuids);
    DateTime startTime = DateTime.Now;
    for (int i = 0; i < 10000; i++)
    {
        MatchCollection mc = regex.Matches(input);
        x = mc.Count;
    }
    return DateTime.Now - startTime;

}

static string GetStringWithGuids(int numGuids)
{
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < numGuids; i++)
    {
        sb.Append(Guid.NewGuid().ToString() + " spacer ");
    }
    return sb.ToString();
}

The result was that the compiled regex ran 1.6 times faster than the uncompiled version.

I then split the RunRegex function into two functions and moved the creation of the Regex object into these functions making one of them compiled and the other not.

On text with 10 GUIDs in it the uncompiled version ran 42 times faster than the compiled version. The number of times faster diminished as the number of GUID's (matches) in the string increased until the uncompiled version was six times faster when we had 100 matches in the string.

The next test was to decrease the number of matches to non-matches in the text so I adjusted the GetStringWithGuids() function to add 100 "spacers" between each GUID. Remember that the listed GetStringWithGuids() above has 1 match (GUID) per non-match(" spacer "). The new function looked like this:

static string GetStringWithGuids(int numGuids)
{
    StringBuilder sb = new StringBuilder();
    string spacer = String.Join(" ", Enumerable.Repeat("spacer", 100).ToArray());
    for (int i = 0; i < numGuids; i++)
    {
        sb.Append(Guid.NewGuid().ToString() + " " + spacer + " ");
    }
    return sb.ToString();
}

For 10 GUIDs the uncompiled version performed 2.7 times better but at 50 GUIDs the compiled version started performing better through to 100 GUIDs.

So the only test left was the one that was truly representative of the data that I was going to run this against which was a block of text with a single GUID in it.

The new GetStringWithGuids() function with redundant parameter looked like this:

static string GetStringWithGuids(int numGuids)
{
    StringBuilder sb = new StringBuilder();
    string spacer = String.Join(" ", Enumerable.Repeat("spacer", 100).ToArray());
    sb.Append(spacer + " " + Guid.NewGuid().ToString() + " " + spacer);
    return sb.ToString();
}

This showed the uncompiled version to be 10 times faster than the compiled version.

Sunday, November 8, 2009

Is this string numeric in C# part 3

Following on from Is this string numeric in C# part 2...

Curiosity got the better of me and I wanted to know what the IsDigit and TryParse functions did. There's a post here by Shawn Burke that details how to setup Visual Studio to allow you to step into the .NET source. From that I found the IsDigit() source to be exactly what we'd surmised:

public static bool IsDigit(char c) {
if (IsLatin1(c)) {
return (c >= '0' && c <= '9');
}
return (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.DecimalDigitNumber);
}

In both scenarios you'd expect Jon's function to perform better and even more so in the numeric string. When checking the numeric string every single character needs to be checked so the additional overhead for the IsLatin1() call should slow down Guy's function more. However, the results show Guy's function performing better when the string is numeric.

When the string is non numeric from the first character then you'd expect Jon's function to perform better but only just.

I can't explain why Guy's function is currently out-performing Jon's at the moment. There must be a flaw in my testing setup.

I was, however, completely wrong about the TryParse function. It does not throw an exception but it is a much larger and more nested function than I was expecting and executes dozens of code paths and this is what's slowing this down. Also, if you think about it the TryParse is converting from one type to another so the complete conversion is taking place while in our IsDigit function we are just checking to see if it's possible to convert the string to a number and not actually doing the conversion. So in addition to doing the check on each character, the TryParse is also writing those characters to a memory space to create a numeric value. That's where the performance hit comes in.

Jon also suggested modifying his function to assigning the char to a local variable before using it as the index operator makes a function call so I tried by modifying his function to:

Func<string, bool> isNumeric_Jon =
    delegate(string s)
    {
        for (int i = 0; i < s.Length; i++)
        {
            char c = s[i];
            if (c > '9' || c < '0')
            {
                return false;
            }
        }
        return true;
    };

However, this made no difference in the tests. My guess is that the compiler had already optimized the two identical references into one.

Saturday, November 7, 2009

Up or down for version number on C# 4.0?

Not sure if you were up-to-speed on the compiler version numbering that's been used with C# but just in case you weren't this is what version number the compiler calls itself when run from the \windows\microsoft.net\framework\ folders...

We went from version 7 to 8 and then to 3.5...
That's not the most logical progression of numbers that I've ever seen. I haven't installed the C# 4.0 beta compiler yet so I don't know what the version number will be. My thinking is that if I ever went for a job interview with Microsoft I am definitely going to find out what it is because on that job IQ test they'll have a number progression question and it will be what comes after 7, 8, 3.5 __ and I will then know the answer.

Standing on the shoulders of giants

As arrogant as my domain name is "Guy Ellis Rocks Dot Com", I only chose it because Standing On The Shoulders Of Giants Dot Com was already taken. I've always attributed this quote to Sir Isaac Newton but just learned that it was originally coined by Bernard of Chartres.

I take a moment to be humble and acknowledge that almost everything that you see on this blog is the synthesis of giants who have helped me and who's work I have read.

Is this string numeric in C# part 2

I heard back from Bill and Jon with some suggestions about improving the algorithm for speed in the IsNumeric(string) function documented at Is this string numeric in C#?

Bill's idea was to use the return value from Int64.TryParse() and Jon's idea was to compare the values of each character to their digital character equivalents and check the greater or less than in the ascii sequence using if (s[i] > '9' || s[i] < '0').

My initial gut-feel reaction was that Jon's would be faster and Bill's slower. The reason that I thought that Jon's would be faster is because I am under the impression that IsDigit() checks other digits outside of the 0 to 9 range to take into account other digit sets such as Thai. Bill's I thought would be slower because I think that TryParse() catches a thrown exception when returning false and so the expense of exception handling would cause it to be slow.

The first problem that I hit was that the original numeric string that I used:

string numericString = "12345678901234567890123456";

was too large for Int64. I didn't pick this up at first and when I ran the first set of tests they completed turned my expectations on their head with Bill's being fastest and then mine and then Jon's. However, these results were invalid as the Int64.TryParse() call was not returning the correct results.

I fixed the test code to look like this and re-ran it all.

static string numericString = "1234567890123456";
static string nonNumericString = "abcdefghijklmnopqrstuvwxyz";
static void DigitTesting()
{
    Func<string, bool> isNumeric_Guy =
        delegate(string s)
        {
            for (int i = 0; i < s.Length; i++)
            {
                if (!Char.IsDigit(s[i]))
                {
                    return false;
                }
            }
            return true;
        };

    Func<string, bool> isNumeric_Jon =
        delegate(string s)
        {
            for (int i = 0; i < s.Length; i++)
            {
                if (s[i] > '9' || s[i] < '0')
                {
                    return false;
                }
            }
            return true;
        };
    Func<string, bool> isNumeric_Bill =
        delegate(string s)
        {
            Int64 number;
            if (Int64.TryParse(s, out number))
            {
                return true;
            }
            return false;
        };

    Func<string, bool>[] functions = {isNumeric_Guy, isNumeric_Jon, isNumeric_Bill };

    foreach (Func<string, bool> function in functions)
    {
        DateTime startTime = DateTime.Now;
        for (int i = 0; i < 100000000; i++)
        {
            function(numericString);
        }
        Console.WriteLine("numberic: {0} time taken: {1}",
            function.Method.Name, DateTime.Now - startTime);
    }
    Console.ReadLine();
}

Against a numeric string IsNumeric_Guy() turned out to be about 15% faster than IsNumeric_Jon() and about 31% faster than IsNumeric_Bill().

When testing against a non-numeric string IsNumeric_Jon() showed up 16% faster than IsNumeric_Guy() and 87% faster than IsNumeric_Bill(). This is how I would have expected the results to be in both cases.

This is an interesting conundrum to end up at with two algorithms both beating each other by the same amount under different data types. Luckily I know that most of the data that this algorithm will be looking at will be non-numeric strings and so I will go with Jon's algorithm.

Thanks Bill and Jon - your ideas and help are much appreciated.

Continued in Is this string numeric in C# part 3.

Friday, November 6, 2009

Is this string numeric in C#?

On a completely unrelated project to my previous post about Optimizing a custom Trim() function in C# where I was stripping non-numeric characters from either end of a string that may have digits in it I found myself battling with a slow running console application that needed profiling.

Someone had recommended EQATEC Profiler to me and so I ran it against this console app to find where the slowness was coming from. It turned out that the bottleneck was in an IsNumeric function:

private readonly static Regex NumericRegex =
new Regex(@"^[0-9]+$", RegexOptions.Compiled);
bool IsNumeric(string Text)
{
return NumericRegex.IsMatch(Text);
}

When I discovered how frequently this was being called I went about optimizing it and came up with:

bool IsNumeric(string s)
{
    for (int i = 0; i < s.Length; i++)
    {
        if (!Char.IsDigit(s[i]))
        {
            return false;
        }
    }
    return true;
}

I then tested the two functions on two different strings:

string numericString = "12345678901234567890123456";
string nonNumericString = "abcdefghijklmnopqrstuvwxyz";

To see how they would fare against each other. On the numeric string the new function ran 3 times faster than the old regex one. On the non-numeric string it was a factor of nine. The reason for the huge difference in performance gain between the two types of strings should be fairly obvious. For the non-numeric string the new function bails out of the call immediately after inspection of the first character while the regex would still examine the whole string. For the numeric string both functions would need to examine every single character before returning.

More tests with 2 new algorithms from comment and email: Part 2

Delegates and Lambdas in C#

I was just explaining delegates and lambdas to a friend and thought that I'd write up an example for him.

Here is an IEnumerable<> that we want to interrogate to get all values greater than or equal to 60:
int[] scores = { 45, 55, 59, 65, 66, 67, 68 };

Here is how we could do it using a delegate. We declare a Func<T, TResult> type which is the same as a pointer to a function in C. In our following LINQ statement we pass that function pointer (delegate) as the "predicate" in the Where "clause." The Where() extension method will be passing in each int one at a time and expect a bool true/false result back. That's why we declared the delegate as a Func<int, bool>, the int is what we're passing in and the bool is the return type.
Func<int, bool> predicate = delegate(int score) { return score >= 60; };
IEnumerable<int> passingScores1 = scores.Where(predicate);

LINQ makes the above syntax much easier for us. In the parameter to the Where() extension method we can declare the body of the function so long as it adheres to the prototype of int parameter and bool return.
IEnumerable<int> passingScores2 = scores.Where((int score) => score >= 60);

The compiler can infer the type that's being passed in as the parameter so we can drop the (int score) and replace it with score if you want even more brevity.
IEnumerable<int> passingScores3 = scores.Where(score => score >= 60);

Func<> has 5 overloads:

Func<TResult>
Func<T, TResult>
Func<T1, T2, TResult>
Func<T1, T2, T3, TResult>
Func<T1, T2, T3, T4, TResult>

In best coding practices, what are the maximum number of parameters a function should have? Well the designers of the Func<> delegate believe that it's four and I don't disagree with them. I am guilty of using more than four but that's mostly out of laziness of writing and populating a class to pass instead of the parameters.

JavaScript library management in browsers

Yesterday Google introduced its Closure Tools which includes its Closure Library. Google's AJAX Libraries API page lists a bunch of common JavaScript libraries which include jQuery, Prototype, Scriptaculous, MootTools, Dojo, SWFObject (never heard of this one), and YUI.

One of the advantages of using Google's content distribution network (CDN) for JavaScript libraries is that if someone visiting your site had previously visited another site which used the same JavaScript library from Google's CDN then that library will already be cached and your site will load faster. This is great but I believe that things could be even better. The problem is that you might be using a different CDN to other sites and will not benefit from this caching.

Browsers should be able to manage a library of common JavaScript libraries along with their versions. Your JavaScript code should be able to ask for a particular library and version and the browser should have had it installed when the browser was installed or if released after the initial installation then installed with regular browser updates.

Thursday, November 5, 2009

Optimizing a custom Trim() function in C#

I needed to write a Trim() function that removed anything that wasn't a digit from the beginning and end of a string. What usually happens when I write a blog entry like this is that someone posts a comment a few days later saying "hey, you didn't need to write that, it's already in the .NET library." If it is, I haven't been able to find it.

Here is my TrimNonDigits() function first attempt:

protected string TrimNonDigits_v1(string text)
{
    while (text.Length > 0 && !Char.IsDigit(text, 0))
    {
        text = text.Substring(1);
    }
    while (text.Length > 0 && !Char.IsDigit(text, text.Length - 1))
    {
        text = text.Substring(0, text.Length - 1);
    }

    return text;
}

Now this function works and for what I needed it for (trivially short strings) it would do the job. But it is my intention to add it to a string extension library and I had a fear that someone would pass in an enormous string and with all the Substring() operations creating new strings this may become extremely inefficient.

To see how inefficient this function was I wrote another unit test to specifically measure the speed at which the code ran. Here is that unit test:

[TestMethod]
public void trim_non_digits_big_text()
{
    int repeat = 10000;
    UnitTestDerived utd = new UnitTestDerived();
    string lotsOfCharacters = String.Join("", Enumerable.Repeat("abcdef", repeat).ToArray());
    string text = lotsOfCharacters + "1" + lotsOfCharacters;
    string expected = "1";
    string actual = utd.TrimNotDigits(text);
    Assert.AreEqual(expected, actual);
}

The unit test passes but takes 21 seconds to run. Although extremely unlikely that this amount of non-digit-text on either side of a number might appear it's still a valid use case so the code needed reworking so that the Substring() function was only called once. This is the resulting change to the code:

protected string TrimNonDigits(string text)
{
    int start = 0, end = text.Length - 1;
    while (text.Length > start && !Char.IsDigit(text, start))
    {
        start++;
    }
    if (start != end)
    {
        while (end > start && !Char.IsDigit(text, end))
        {
            end--;
        }
    }
    return text.Substring(start, 1 + end - start);
}

The unit test runs against this new function in under 1 second. That's a dramatic difference and good demonstration for the case of not using Substring() repeatedly in a loop if you can help it.

A side note. I also had several other unit tests that checked the validity of the results of the algorithm before I did the optimization. These included the testing of edge cases and extremes. This made the optimization refactor a cinch because I was fairly confident at the end that the algorithm will work in all situations.

Tuesday, November 3, 2009

MultiThread testing for speed

I recently wrote this little test program to validate that multi-threading was working the way I expected it to.

class Program
{
    private delegate void funcs();

    static void Main(string[] args)
    {
        funcs[] functions = new funcs[]
        {
            new funcs(MultiThreadTest),
            new funcs(SingleThreadTest)
        };
        foreach (funcs function in functions)
        {
            DateTime startTime = DateTime.Now;
            for (int i = 0; i < 10; i++)
            {
                function();
            }
            while (ThreadWorker.ThreadCount > 0)
            {
                Thread.Sleep(1);
            }
            Console.WriteLine("{0} time taken: {1}",
                function.Method.Name, DateTime.Now - startTime);
        }
        Console.ReadLine();
    }

    private static void MultiThreadTest()
    {
        ThreadWorker threadWorker = new ThreadWorker();
        ThreadStart threadDelegate =
            new ThreadStart(threadWorker.Runner);
        Thread newThread = new Thread(threadDelegate);
        newThread.Start();
    }
    private static void SingleThreadTest()
    {
        ThreadWorker threadWorker = new ThreadWorker();
        threadWorker.Runner();
    }
}

class ThreadWorker
{
    public static int ThreadCount = 0;

    public ThreadWorker()
    {
        ThreadCount++;
    }
    public void Runner()
    {
        IEnumerable<string> strings =
            Enumerable.Repeat(
            @"the not so quick brown fox was
caught by the hen and eaten with eggs", 50);
        string text = String.Join(" ",
            strings.SelectMany(a => a.Split()).ToArray());
        for (int i = 0; i < 1000; i++)
        {
            // Do some random LINQ stuff to occupy the processor
            string[] list = text.Split(' ');
            string[] l2 = list.Distinct().ToArray();
            l2 = list.OrderBy(a => a).ToArray();
            l2 = list.OrderByDescending(a => a).ToArray();
        }
        ThreadCount--;
    }
}

I tested this bit of code by running it four times on each of two machines.

The first was a dual-core and the single-threaded-test took an average of 33.4 seconds and the multi-threaded-test 17.8 seconds. This makes the multi-threaded part 1.9 times faster which is about what you'd expect if you allow two processors to work on the problems in parallel.

The second machine was a quad core with hyper-threading so essentially eight cores. This took an average of 29.3 seconds for the single-threaded-test and 4.6 seconds for the multi-threaded-test. An improvement factor of 6.4, not as close to 8 as I was expecting but not that far off. If anybody knows why the eight cores do not come as close to an eight-times factor as the two cores came to a two-times factor I'd love to hear from you in the comments.

The two processors were:

Intel Core 2 CPU 6400 @ 2.13 GHz
Intel Xeon CPU L5410 @ 2.33GHz

Something that I found interesting was that the slower processor ran 1.65 times faster than the fast processor when taking advantage of multi-threading. This has important implications for the software that you write. The single-threaded test ran 1.14 times (14%) faster on the faster processor. However, the multi-threaded code on the slower processor runs 65% faster than the single-threaded code on the faster processor.

If you're looking for a performance boost there may be more performance in multi-threaded code than in a faster processor. In fact, processor speed is probably not what you're looking for. The best combination would be multi-threaded code on multi-core boxes.

Monday, November 2, 2009

Using LINQ to join two string lists without repeats

I have a list of members that have to play a game against each other and I want to generate a complete list of all the members against every other member without repeating any games. My list looks like this:

string[] members = {"Alphie", "Jerome", "Silky", "Buzz" };

My first attempt at generating a list using LINQ was this:

IEnumerable<string> q = from one in members
from two in members
select one + " plays " + two;

which resulted in:

Alphie plays Alphie
Alphie plays Jerome
Alphie plays Silky
Alphie plays Buzz
Jerome plays Alphie
Jerome plays Jerome
Jerome plays Silky
Jerome plays Buzz
Silky plays Alphie
Silky plays Jerome
Silky plays Silky
Silky plays Buzz
Buzz plays Alphie
Buzz plays Jerome
Buzz plays Silky
Buzz plays Buzz

This is not exactly what I was looking for. I wonder who the winner would have been in Buzz versus Buzz?

The secret is to put a where clause before the select statement:

where one.CompareTo(two) < 0

This will eliminate duplicates when CompareTo(two) == 0 and also alphabetically sort the two players eliminating them playing against each other a second time. This is the complete code snippet:

IEnumerable<string> q = from one in members
                        from two in members
                        where one.CompareTo(two) < 0
                        select one + " plays " + two;

and here are the revised results:

Alphie plays Jerome
Alphie plays Silky
Alphie plays Buzz
Jerome plays Silky
Buzz plays Jerome
Buzz plays Silky