This has been bugging me for a while, because I do a fair bit of HTML and XML custom parsing code, and kind of wondered what would be the valid characters for an attribute name in a HTML tag, e.g.
<a href="..." name="...">thing</a>
So, what are the valid characters in HTML (or XML) for “href” and “name”, the attribute names in an HTML tag?
I finally found this, here:
http://www.w3.org/TR/2000/REC-xml-20001006#NT-Name
In short, a HTML attribute name can be:
- First character is a letter, the underscore “_”, or colon “:” (oddly!)
- Additional (optional) characters can be: a letter, a digit, underscore, colon, period, dash, or a “CombiningChar” or “Extender” character, which I believe allows Unicode attributes names.
So, the following are all valid HTML attributes names:
: _ _0:funky :.:valid-_-tag-really:.: _.:._
Note that the W3C suggests only using colon for namespaces, so you should use it sparingly.
The regular expression, therefore, for parsing an HTML attribute is as follows:
[a-zA-Z_:][-a-zA-Z0-9_:.]
This, obviously, leaves the CombiningChar and Extender characters out of the mix.
And, a final note: When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?
I swear, when I’m looking for something simple, the mountains of documents I have to wade through makes it impossible to find anything with any accuracy.
My $0.02.
7 replies on “Valid characters in attribute names in HTML/XML”
This is exactly what I is looking for, plus the regex to boot!
Thanks for sharing.
This has changed to a significantly larger character set, I think (from http://www.w3.org/TR/REC-xml/#NT-NameChar):
[4] NameStartChar ::= “:” | [A-Z] | “_” | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | “-” | “.” | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Thanks for finding that!
^[a-zA-Z_:][-a-zA-Z0-9_:.] else 3Desc is not false!!!!
“When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for?……..”
ANSWER
W3C specs are obviously designed to make people go insane. I, for example, am a little egg….
Hell yeah, you just said what I always thought while looking for specs. It is exactly the opposite of their initial intention… I usually prefer to search the web for people (like u) who thankfully spend their time finding the right answer (for me). Once again, thank you! ;-)
All well and good but the .(period) is getting converted to an _(underscore) on the POST array in PHP so that has to be taken into account when trying to match the name in the POST array. Probably worth including in the article.