Skip to content

Valid characters in attribute names in HTML/XML

This has been bugging me for a while, because I do a fair bit of HTML and XML custom parsing code, and kind of wondered what would be the valid characters for an attribute name in a HTML tag, e.g.

<a href="..." name="...">thing</a>

So, what are the valid characters in HTML (or XML) for “href” and “name”, the attribute names in an HTML tag?

I finally found this, here:

http://www.w3.org/TR/2000/REC-xml-20001006#NT-Name

In short, a HTML attribute name can be:

  • First character is a letter, the underscore “_”, or colon “:” (oddly!)
  • Additional (optional) characters can be: a letter, a digit, underscore, colon, period, dash, or a “CombiningChar” or “Extender” character, which I believe allows Unicode attributes names.

So, the following are all valid HTML attributes names:

:
_
_0:funky
:.:valid-_-tag-really:.:
_.:._

Note that the W3C suggests only using colon for namespaces, so you should use it sparingly.

The regular expression, therefore, for parsing an HTML attribute is as follows:

[a-zA-Z_:][-a-zA-Z0-9_:.]

This, obviously, leaves the CombiningChar and Extender characters out of the mix.

And, a final note: When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?

I swear, when I’m looking for something simple, the mountains of documents I have to wade through makes it impossible to find anything with any accuracy.

My $0.02.

Be Sociable, Share!

7 Comments

  1. Doug Whitney wrote:

    This is exactly what I is looking for, plus the regex to boot!

    Thanks for sharing.

    Wednesday, May 6, 2009 at 4:50 pm | Permalink
  2. JP wrote:

    This has changed to a significantly larger character set, I think (from http://www.w3.org/TR/REC-xml/#NT-NameChar):
    [4] NameStartChar ::= “:” | [A-Z] | “_” | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
    [4a] NameChar ::= NameStartChar | “-” | “.” | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

    Thursday, June 18, 2009 at 8:09 am | Permalink
  3. Thanks for finding that!

    Friday, June 19, 2009 at 6:12 pm | Permalink
  4. Peter wrote:

    ^[a-zA-Z_:][-a-zA-Z0-9_:.] else 3Desc is not false!!!!

    Monday, September 28, 2009 at 8:39 am | Permalink
  5. uchikoma wrote:

    “When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?……..”

    ANSWER

    W3C specs are obviously designed to make people go insane. I, for example, am a little egg….

    Monday, March 29, 2010 at 6:36 am | Permalink
  6. Lars wrote:

    Hell yeah, you just said what I always thought while looking for specs. It is exactly the opposite of their initial intention… I usually prefer to search the web for people (like u) who thankfully spend their time finding the right answer (for me). Once again, thank you! ;-)

    Friday, July 1, 2011 at 4:35 am | Permalink
  7. Steve Krause wrote:

    All well and good but the .(period) is getting converted to an _(underscore) on the POST array in PHP so that has to be taken into account when trying to match the name in the POST array. Probably worth including in the article.

    Monday, December 12, 2011 at 8:23 pm | Permalink