[UPHPU] Email address format in regular expression
Mac Newbold
mac at macnewbold.com
Wed Dec 28 16:54:50 MST 2005
Today at 1:40pm, Cabot Nelson said:
> Using the ereg() function in PHP (http://us2.php.net/manual/en/function.ereg.php), I'm using this lengthy expression to determine a properly formatted email address:
>
>
>^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((\[(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5])))\.(((([0-1])?([0-9])?[0-9])|(2[0-4][0-9])|(2[0-5][0-5]))\]))|((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$
>
> However, I know very little about regular expressions and don't know if
>this is accurate. It seems to work. Is this the best expression for an
>email address? Is there an *official* format as determined by the email
>protocol?
Yes, there is a standard, I believe it is in RFC 822, but it might be in
one of the related or followup ones. That covers the whole address, how to
put it in a message, etc., but it mostly matters for the username part
(before the @), because the hostname part of the address (after the @) is
determined by the DNS standards. I don't know the RFC, but it says
basically that DNS names are case insensitive, allow only a-z, 0-9, and a
dash (-), in one or more parts separated by periods (.).
That's a pretty crazy one, but looks mostly accurate. Most of the extra
baggage in that expression is due to the fact that they want to accept
emails with IP addresses as the host name, i.e. uphpu at 12.174.39.4 which I
don't personally think matters much, since nobody does that.
If you take that part out, you get this:
^([a-zA-Z0-9_\-])+(\.([a-zA-Z0-9_\-])+)*@((([a-zA-Z0-9])+(([\-])+([a-zA-Z0-9])+)*\.)+([a-zA-Z])+(([\-])+([a-zA-Z0-9])+)*))$
That's pretty accurate, but still longer than it needs to be. I posted an
article many moons ago to the uphpu site about the way I do it:
http://uphpu.org/article.php?story=20050106152516690
I'd recommend using preg instead of ereg, because it's more powerful, more
portable, more widely used, and faster, but it's not a big deal to use
ereg instead, other than the slight changes in syntax. (We discussed these
in a presentation months ago, too: "Strings and regular expressions in
PHP" http://uphpu.org/article.php?story=20050128093053370 See the second
half.)
The expression I use is this:
preg_match("/^[a-z0-9.+-_]+@([a-z0-9-]+(.[a-z0-9-]+)+)$/i",$e, $grab)
This is different in a few ways. It combines a-zA-Z into just a-z by usig
the /i flag for case insensitive matching. It doesn't treat a . magically
in the username part of the address, meaning mine allows . to come at the
beginning or end of the username, or two in a row, where theirs does not.
(I think the standard agrees with my way, but I'm not positive.) There is
a similar difference in the last half, where they don't allow - at the
beginning or end of any part of the dns name, which would be very odd, but
I'm not sure that it is truly illegal. Technically you only need to have
one part after the @, but since that typically only applies within a local
network, I always require two or more parts in the domain name.
The article on the site also covers another thing I do to validate email
addresses. Besides passing the regular expression test, I take the host
name and look it up in DNS to make sure it exists. For example, when I put
in mac at macnewbold.com as my address, my checker would look up
macnewbold.com and make sure that it was real. This is really helpful for
catching bogus addresses and typos, and can prevent a lot of bounced mail
and broken accounts and lockouts. (With a bad email address on file, they
never get a password reset email :) ). The domain name check catches
things like "dont.bug.me at go.away.please" that people might put in, but
that (correctly) pass the regular expression. Don't be tempted to just
check for a dot com/net/org/gov/mil/us/biz/info/etc. at the end, because
the list of top level domains is always growing. Doing a real dns check
handles it beautifully.
One other thing to make sure of is to note that the beginning of the
regular expression should have a ^ in it, and the end should have a $ in
it. These make sure that it matches the whole string, not just part of it.
Without those, it will accept any string that _contains_ a valid email
address instead of one that _is_ a valid email address.
Thanks,
Mac
--
Mac Newbold MNE - Mac Newbold Enterprises, LLC
mac at macnewbold.com http://www.macnewbold.com/
More information about the UPHPU
mailing list