[UPHPU] Email address format in regular expression

Mac Newbold mac at macnewbold.com
Wed Dec 28 16:54:50 MST 2005


Today at 1:40pm, Cabot Nelson said:

> Using the ereg() function in PHP (http://us2.php.net/manual/en/function.ereg.php), I'm using this lengthy expression to determine a properly formatted email address:
>
> 
>^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((\[(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))\]))|((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$
>
> However, I know very little about regular expressions and don't know if 
>this is accurate. It seems to work. Is this the best expression for an 
>email address? Is there an *official* format as determined by the email 
>protocol?

Yes, there is a standard, I believe it is in RFC 822, but it might be in 
one of the related or followup ones. That covers the whole address, how to 
put it in a message, etc., but it mostly matters for the username part 
(before the @), because the hostname part of the address (after the @) is 
determined by the DNS standards. I don't know the RFC, but it says 
basically that DNS names are case insensitive, allow only a-z, 0-9, and a 
dash (-), in one or more parts separated by periods (.).

That's a pretty crazy one, but looks mostly accurate. Most of the extra 
baggage in that expression is due to the fact that they want to accept 
emails with IP addresses as the host name, i.e. uphpu at 12.174.39.4 which I 
don't personally think matters much, since nobody does that.

If you take that part out, you get this:

^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$

That's pretty accurate, but still longer than it needs to be. I posted an 
article many moons ago to the uphpu site about the way I do it:

http://uphpu.org/article.php?story=20050106152516690

I'd recommend using preg instead of ereg, because it's more powerful, more 
portable, more widely used, and faster, but it's not a big deal to use 
ereg instead, other than the slight changes in syntax. (We discussed these 
in a presentation months ago, too: "Strings and regular expressions in 
PHP" http://uphpu.org/article.php?story=20050128093053370 See the second 
half.)

The expression I use is this:

preg_match("/^[a-z0-9.+-_]+@([a-z0-9-]+(.[a-z0-9-]+)+)$/i",$e, $grab)

This is different in a few ways. It combines a-zA-Z into just a-z by usig 
the /i flag for case insensitive matching. It doesn't treat a . magically 
in the username part of the address, meaning mine allows . to come at the 
beginning or end of the username, or two in a row, where theirs does not. 
(I think the standard agrees with my way, but I'm not positive.) There is 
a similar difference in the last half, where they don't allow - at the 
beginning or end of any part of the dns name, which would be very odd, but 
I'm not sure that it is truly illegal. Technically you only need to have 
one part after the @, but since that typically only applies within a local 
network, I always require two or more parts in the domain name.

The article on the site also covers another thing I do to validate email 
addresses. Besides passing the regular expression test, I take the host 
name and look it up in DNS to make sure it exists. For example, when I put 
in mac at macnewbold.com as my address, my checker would look up 
macnewbold.com and make sure that it was real. This is really helpful for 
catching bogus addresses and typos, and can prevent a lot of bounced mail 
and broken accounts and lockouts. (With a bad email address on file, they 
never get a password reset email :) ). The domain name check catches 
things like "dont.bug.me at go.away.please" that people might put in, but 
that (correctly) pass the regular expression. Don't be tempted to just 
check for a dot com/net/org/gov/mil/us/biz/info/etc. at the end, because 
the list of top level domains is always growing. Doing a real dns check 
handles it beautifully.

One other thing to make sure of is to note that the beginning of the 
regular expression should have a ^ in it, and the end should have a $ in 
it. These make sure that it matches the whole string, not just part of it. 
Without those, it will accept any string that _contains_ a valid email 
address instead of one that _is_ a valid email address.

Thanks,
Mac

--
Mac Newbold		MNE - Mac Newbold Enterprises, LLC
mac at macnewbold.com	http://www.macnewbold.com/


More information about the UPHPU mailing list