Regular Expressions in C# (including a new comprehensive email pattern)

Of course C# supports regular expressions. I happen to have learned regular expressions in my dealings with FreeBSD, shell scripting, php, and other open source work. So naturally I would want to add this as a skill as I develop in C#.

What is a Regular Expression?

This is a method in code or script to describe the format or pattern of a string. For example, look at an email address:

someuser@somedomain.tld

It is important to understand that we are not trying to compare the email string against another string, we are trying to compare the string against a pattern.

To verify the email was in the correct format using String functions, it would take dozens of different functions running one after another.  However, with a regular expression, a proper email address can be verified in one single function.

So instead regular expression is a language, almost like a scripting language in itself, for defining character patterns.

Most characters represent themselves.  However, some characters don’t represent themselves without escaping them with a backslash because they represent something else.  Here is a table of those characters.

Expression Meaning
* Any number of the previous character or character group.
+ One of more of the previous character or character group.
^ Beginning of line or string.
$ End of line or string.
? Pretty much any single character.
. Pretty much any character, zero characters, one character, or any number of characters
[ ... ] This forms a character class expression
( … ) This forms a group of items

You should look up more regular expression rules. I don’t explain them all here. This is just to give you an idea.

Example 1 – Parameter=Value

Here is a quick example of a regular expression that matches String=String. At first you might think this is easy and you can use this expression:

.*=.*

While that might work, it is very open. And it allows for zero characters before and after the equals, which should not be allowed.

This next pattern is at least correct but still very open.

.+=.+

What if the first value is limited to only alphanumeric characters?

[a-zA-z0-9]=.+

What if the second value has to be a valid windows file path or URL? And we will make sure we cover start to finish as well.

^[0-9a-zA-Z]+=[^<>|?*\"]+$

See how the more restrictions you put in place, the more complex the expression gets?

Example 2 – The email address

The pattern of an email is as follows: (Reference: wikipedia)

See updates here: C# – Email Regular Expression

  1. It will always have a single @ sign
  2. 1 to 64 characters before the @ sign called the local-part. Can contain characters a–z, A–Z, 0-9, ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~, and . if it is not at the first or end of the local-part.
  3. Some characters after the @ sign that have a pattern as follows called the domain.
    1. It will always have a period “.”.
    2. One or more character before the period.
    3. Two to four characters after the period.

So a simple patterns of an email address should be something like these:

  1. This one just makes sure there are characters before and after the @
    .+@.+
  2. This one makes sure the are characters before and after the @ as well as a character before and after the . in the domain.
    .+@.*+\..+
  3. This one makes sure that there is only one @ symbol.
    [^@]+@[^@]+\.

This are all quick an easy examples and will not work in every instance but are usually accurate enough for casual programs.

But a comprehensive example is much more complex.

  1. I wrote one myself that is the shortest and gets the best results of any I have found:
    ^[\w!#$%&'*+\-/=?\^_`{|}~]+(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)*@((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$
    
    
  2. Here is another complex one I found: [reference]
    ^(([^<>()[\]\\.,;:\s@\""]+(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$
    

So let me explain the first one that I wrote as it passes my unit tests below:

The start
[\w!#$%&'*+\-/=?\^_`{|}~]+ At least one valid local-part character not including a period.
(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)* Any number (including zero) of a group that starts with a single period and has at least one valid local-part character after the period.
@ The @ character
( Start group 1
( Start group 2
([\-\w]+\.)+ At least one group of at least one valid word character or hyphen followed by a period
[\w]{2,4} Any two to four valid top level domain characters.
) End group 2
| an OR statement
( Start group 3
([0-9]{1,3}\.){3}[0-9]{1,3} A regular expression for an IP Address.
) End group 3
) End group 1

Code for both examples

Here is code for both examples. My email regular expression is enabled and the one I found on line is commented out. To see how they work differently, just comment out mine, and uncomment the one I found online.

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace RegularExpressionsTest
{
    class Program
    {
        static void Main(string[] args)
        {
            // Example 1 - Parameter=value
            // Match any character before and after the =
            // String thePattern = @"^.+=.+$";

            // Match only Upper and Lowercase letters and numbers before
            // the = as a parameter name and after the equal match the
            // any character that is allowed in a file's full path
            //
            // ^[0-9a-zA-Z]+    This is any number characters upper or lower
            //                  case or 0 thru 9 at the string's beginning.
            //
            // =                Matches the = character exactly
            //
            // [^<>|?*\"]+$     This is any character except < > | ? * "
            //                  as they are not valid in a file path or URL

            String theNameEqualsValue = @"abcd=http://";

            String theParameterEqualsValuePattern = "^[0-9a-zA-Z]+=[^<>|?*\"]+$";
            bool isParameterEqualsValueMatch = Regex.IsMatch(theNameEqualsValue, theParameterEqualsValuePattern);
            Log(isParameterEqualsValueMatch);

            // Example 2 - Email address formats

            String theEmailPattern = @"^[\w!#$%&'*+\-/=?\^_`{|}~]+(\.[\w!#$%&'*+\-/=?\^_`{|}~]+)*"
                                   + "@"
                                   + @"((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$";

            // The string pattern from here doesn't not work in all instances.
            // http://www.cambiaresearch.com/c4/bf974b23-484b-41c3-b331-0bd8121d5177/Parsing-Email-Addresses-with-Regular-Expressions.aspx
            //String theEmailPattern = @"^(([^<>()[\]\\.,;:\s@\""]+(\.[^<>()[\]\\.,;:\s@\""]+)*)|(\"".+\""))"
            //                       + "@"
            //                       + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])"
            //                       + "|"
            //                       + @"(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$";

            Console.WriteLine("Bad emails");
            foreach (String email in GetBadEmails())
            {
                Log(Regex.IsMatch(email, theEmailPattern));
            }

            Console.WriteLine("Good emails");
            foreach (String email in GetGoodEmails())
            {
                Log(Regex.IsMatch(email, theEmailPattern));
            }
        }

        private static void Log(bool inValue)
        {
            if (inValue)
            {
                Console.WriteLine("It matches the pattern");
            }
            else
            {
                Console.WriteLine("It doesn't match the pattern");
            }
        }

        private static List GetBadEmails()
        {
            List emails = new List();
            emails.Add("joe"); // should fail
            emails.Add("joe@home"); // should fail
            emails.Add("a@b.c"); // should fail because .c is only one character but must be 2-4 characters
            emails.Add("joe-bob[at]home.com"); // should fail because [at] is not valid
            emails.Add("joe@his.home.place"); // should fail because place is 5 characters but must be 2-4 characters
            emails.Add("joe.@bob.com"); // should fail because there is a dot at the end of the local-part
            emails.Add(".joe@bob.com"); // should fail because there is a dot at the beginning of the local-part
            emails.Add("john..doe@bob.com"); // should fail because there are two dots in the local-part
            emails.Add("john.doe@bob..com"); // should fail because there are two dots in the domain
            emails.Add("joe<>bob@bob.come"); // should fail because <> are not valid
            emails.Add("joe@his.home.com."); // should fail because it can't end with a period
            emails.Add("a@10.1.100.1a");  // Should fail because of the extra character
            return emails;
        }

        private static List GetGoodEmails()
        {
            List emails = new List();
            emails.Add("joe@home.org");
            emails.Add("joe@joebob.name");
            emails.Add("joe&bob@bob.com");
            emails.Add("~joe@bob.com");
            emails.Add("joe$@bob.com");
            emails.Add("joe+bob@bob.com");
            emails.Add("o'reilly@there.com");
            emails.Add("joe@home.com");
            emails.Add("joe.bob@home.com");
            emails.Add("joe@his.home.com");
            emails.Add("a@abc.org");
            emails.Add("a@192.168.0.1");
            emails.Add("a@10.1.100.1");
            return emails;
        }
    }
}

16 Comments

  1. Rhyous says:

    I am going to have to watch for the new Top Level Domains (TLDs) as they may have TLDs with more than four characters coming in 2013.
    http://newgtlds.icann.org/en/program-status/application-results/strings-1200utc-13jun12-en

  2. NRK says:

    I am also using MSDN RegEx, it's working fine checking single-character domain.

  3. Igor says:

    How come email like 'aпп@dgh.com' is passing validation?

  4. Quantbuff says:

    Domain names does not start with a hyphen so you regex may not invalidate that ?

  5. Will says:

    Great Post. But I think there's one mistake in the validation of the top level domain portion of the email. As written, it limits the TLD to 1-3 characters. But, there are TLD's that are more than 3 characters (museum and info just to name two, for a full list see http://data.iana.org/TLD/tlds-alpha-by-domain.txt). As of now, I don't think there are any 1 character TLD's, but I'm not sure if this is limited by specification, or just custom. A safer test might be {2,}.

  6. MadQ says:

    You can make your regular expression even shorter by using zero-width negative assertions to ensure that the email address does not begin with period, and that the local part does not end with a period:

    ^(?!\.)[\.\w!#$%&'*+\-/=?\^_`{|}~]{1,64}(?<!\.)@((([\-\w]+\.)+[a-zA-Z]{2,4})|(([0-9]{1,3}\.){3}[0-9]{1,3}))$

    Note that this will also allow two or more consecutive periods in the local part, which your regular expression does not. I'm not sure if consecutive periods in the local part are valid, and I'm too lazy to look it up right now ;-)

  7. Paul says:

    Thanks for posting this. I was using the one published by MSDN: (http://msdn.microsoft.com/en-us/library/01escwtf.aspx) which would not allow a single-character subdomain such as foo@a.com. Yours appears to work correctly.

Leave a Reply

Powered by sweet Captcha