dive into mark

You are here: dive into markArchivesAugust 2003Won’t somebody please think of the gerbils?

Friday, August 29, 2003

Won’t somebody please think of the gerbils?

I can’t think of a single topic that could be considered a more trivially unimportant form of navel-gazing than arguing about the semantics of HTML. (Oh wait, here’s one.) I personally think Greg Knauss’ definition of web standards is all you need to know on the subject, but all the weird markup-obsessed Alpha Geek freaks in the world seem to have decided that I’m worth reading because every few months or so I have a semi-enlightening rant on the subject. This is not one of those rants, but out of sympathy for those members of my audience who keep hitting refresh every few minutes like gerbils on crack, hoping for a repeat of Semantic Obsolescence, here is some light reading on the fascinating subjects of syntax, semantics, structure, validation, CSS, accessibility, and markup:

Now then. There is a veritable plethora of overlapping concepts here, and enough snake oil to choke a crack-addled gerbil.

XHTML

The successor to HTML, XHTML 1.0 is a tag-for-tag reformulation of HTML 4 in XML, and comes in several flavors (just like HTML 4). XHTML 1.1 comes in only one flavor (strict), adds a few things you’ll never use, and removes several things you use all the time (like the name attribute on A tags). XHTML 2.0 is not done yet, but current drafts are not backwardly compatible with either XHTML 1.0 or 1.1. Despite what you might have heard, XHTML is not treated as XML by modern browsers, unless you use the proper MIME type (which only about 2 dozen people do).

Validation

Means that your markup declares itself as a particular markup specification (using a DOCTYPE), and that it conforms to the rules of that specification. No more, no less. A validator will tell you that the structure of your markup is correct (or not); it says nothing about whether you’re using tags properly, or semantically, or accessibly, or whatever.

Validation is completely independent of whether you’re using XHTML or HTML, or whether you’re using the XHTML MIME type correctly. You can write valid XHTML 1.1, or XHTML 1.0 Strict, or XHTML 1.0 Transitional, or HTML 4.01 Strict, or HTML 4.01 Transitional, or HTML 3.2, or HTML 2.

The W3C Validator is the most widely-used (X)HTML validator, although the upcoming beta version goes beyond validation-against-the-spec with something called fussy parsing, and checks for common problems which, while technically valid, are known to cause problems in popular browsers. This crosses the line into being more of a linter than a pure validator. (The Feed Validator also crosses this line by flagging things like SCRIPT elements, ambiguously relative URLs, and invalid date values.)

The biggest reason to validate your markup are that, after you do it once, you can use the validator as a debugging tool to catch stupid mistakes. The second biggest reason is that if you don’t, we won’t help you. Yes, we’re elitist snobs, but we increasingly have a monopoly on talent, and we’ve all decided that valid markup is a baseline.

CSS

Means that you are separating the structure of your document from the presentation of your document. It is technically possible to have a perfectly valid XHTML Transitional document that uses tables for layout, FONT tags for styling, and spacer GIFs for pixel-perfect positioning. As long as you put ALT attributes on your all spacer images and close all your FONT and TABLE tags, it can be valid. But it sucks, because table-based layouts and FONT tags and spacer GIFs are an ongoing nightmare, while CSS is only an up-front nightmare.

Also, you can do cool things like dynamic style switchers, ala CSS Zen Garden, and printer stylesheets and so forth, if you care about that sort of thing.

But am-I-CSS-or-not is independent of whether your markup validates. You can use CSS and not validate your markup, or you can validate your markup and not use CSS, or you can do both, or neither. The primary reason CSS and validation have historically been conflated is that the sort of people who have taken the time to learn about validation are also generally the sort of people who have taken the time to learn about CSS, and who use CSS for their own designs, and who advocate CSS to others. (To further confuse the issue, there is a CSS validator which checks whether your stylesheets conform to the CSS specification. Whether your (X)HTML is valid is completely independent of whether your CSS is valid, since they’re separate specifications.)

CSS is almost completely independent of XHTML-vs-HTML. The rules for parsing CSS in an HTML environment are slightly different from parsing CSS in an XHTML environment, but keep in mind that unless you’re using the proper XHTML MIME type, you’re not in an XHTML environment anyway.

Semantic markup

Means (among a variety of definitions) that you are using tags that have specific meaning assigned to them, either in the (X)HTML specification, or in generally accepted use. As opposed to generic tags like DIV and SPAN, or Tag Soup whose result happens to look the same in one class of browsers.

HTML is flexible enough that there’s usually more than one way to do something. If you want a simple vertical list of items, you can use UL and LI tags (plus some CSS to eliminate the default bullets), or you can simply put BR tags at the end of each item. Both accomplish the same result (in a popular class of visual browsers), and both are valid, and both may or may not be styled with CSS, but one is semantically better. UL means a list of things; BR doesn’t. P means a paragraph; H1, H2, H3 etc. mean different levels of nested headings.

In visual browsers it doesn’t really matter whether you use semantically correct markup or generic markup that happens to look the same, because the meaning of your page is going to be determined by the human being with eyes who reads it on a screen. But in other environments, it matters a great deal. (More on this in a second.)

Semantic markup is independent of XHTML-vs-HTML (which makes sense, since there aren’t any new tags in XHTML that could provide new meaning). Semantic markup is independent of validation; you can produce shitty non-semantic markup that validates. Semantic markup and CSS are loosely joined, as described below.

Accessibility

Means that your content is accessible to a wide range of browsers, platforms, and users using assistive technology. Most accessibility discussion focuses on the blind and partially sighted, and the #1 important accessibility feature is ALT attributes on images, since blind people can’t see them. But there are lots of types of disabilities that intersect the web; I would say that the #2 important accessibility feature is complete keyboard navigability, since not everyone reading your page can use a mouse.

Accessibility also has nothing to do with validation; no assistive technology requires valid markup. But it does have some overlap with CSS and semantic markup. You think I’m going to say table-based layouts suck for accessibility, but I’m not; table-based layouts suck for maintenance, but they can be perfectly accessible. No, the way that accessibility intersects CSS is that CSS allows you to use proper semantic markup but make it look like what you wanted it to look like in the first place, and some of that semantic markup is specifically interpreted by assistive technology better than the non-semantic (but visually identical) alternatives.

Example: if you use real H1, H2, H3 tags to build an outline of nested headers on your page, the Home Page Reader screen reader has an option to present that outline to a blind user, to read just the header tags and let the user skip to a particular one within the page. If you wanted to get a sense of the overall structure of a page, you would visually scan it, and the bolder/larger headers and surrounding whitespace would jump out at you, and you would start reading where you wanted to start reading. Blind people can’t just scan the whole page at once; they need to accomplish the same thing in other ways, and the assistive technology they use relies on (among other things) good semantic markup to mimic the things we can do at a glance.

Another example: some screen readers have an option to announce the number of items in a list before they start to read it, so the user can know how long it is before they listen to each item being read. If the list is really just text separated by BR tags, this feature won’t work.

I take an extremely pragmatic view of semantic markup. Semantic markup is useful as long as I can pinpoint a specific use for it, in a specific tool. Otherwise I don’t care. Proper header tags are useful for accessibility. It’s an actual menu item in Home Page Reader; if you use real header tags, that menu item works, and otherwise it doesn’t. ALT attributes are important because I’ve heard what JAWS sounds like when it tries to read images without them (it reads the filename instead, which is generally meaningless).

ALT attributes can also increase search engine relevance. After all, Googlebot is just another blind user with 100 million friends. People search by typing in keywords. If you don’t tag your images with text, Google can’t see them and match them up with those keywords, and they may as well not be there. This isn’t rocket science, but apparently most people think that Google operates by loading up your page in IE and taking screenshots.

Then there’s some semantic markup that I personally make use of, even if there isn’t a wide market for it. I mark up names of people I link to (like in the list above) with the CITE tag, and I have a script that runs every night that aggregates those tags and creates posts by citation. I do a similar thing with the cite attribute of BLOCKQUOTE and Q, and create posts by quotation. (Some people also use CSS and Javascript tricks to automatically format the cite URL, if CSS and Javascript are available. A cute trick that helps some people and doesn’t harm anyone else.) I use the ACRONYM to mark up acronyms and then pull them out and list them on my accessibility statement. This is all very geeky and not of general interest, and some of it could probably be replicated with smarter code and dumber markup, but this is the balance I’ve found that works for me.

However, there are lots of recurring discussions about semantics that I have no interest in whatsoever. I do not, for example, care about the distinction between ACRONYM and ABBR. I do not care about the distinction between STRONG and B, or EM and I. I know of no mainstream tool that supports one and not the other (except Internet Explorer for Windows, which brilliantly only supports ACRONYM and not ABBR). I also don’t care about the distinction between UL and OL, since general use has contradicted the spec definition for so long that the distinction is meaningless. A UL is an ordered list with bullets, because that’s how everyone uses it.

So where’s the problem? Well, there’s a lot of snake oil out there. Anyone who tells you that validation buys you semantics is selling you snake oil. You can have an entire page full of DIV and SPAN tags and be perfectly valid, and it may look perfectly good to the human eye, but it doesn’t mean anything, and any tool that relies on semantic markup won’t be able to make heads or tails of it.

Anyone who tells you that CSS guarantees you accessibility is selling you snake oil. Most accessibility techniques have nothing to do with CSS (remember, the #1 accessibility technique is marking up images with text that’s normally invisible). And where accessibility and CSS do overlap, it’s still easy to screw up if you don’t know what you’re doing, or simply go through a lot of pain for no real-world gain.

Anyone who tells you that XHTML is easier to parse, consume, or validate because it’s XML is selling you snake oil (or isn’t using the right tools). The stuff I do with citations, quotations, and acronyms, I do it with HTML 4, a standard that has been around since 1997. Headers and lists have been around since HTML 1. Anyone who tells you that XHTML buys you anything at all is most likely selling you snake oil. The only possible use I’ve seen for it is directly embedding it in syndicated feeds (in other words, in another XML vocabulary). Whether this idea has legs or not is an open question.

A common misconception is that XHTML is better because you can use XML tools (such as XSLT) to generate it (from a database, from another XML source, whatever). Unless you are using XHTML as input to some further processing, this is a bogus argument. XSLT can output HTML as easily as it can output XML. It’s only when you want to take XHTML and use it as input to some other transformation that it matters that it’s XML (and even then, it only matters to the extent that you want to use existing XML tools instead of existing SGML tools). (It has been pointed out to me that there is a growing developer community that is doing exactly this: providing add-on tools that take XHTML as input and do interesting things with it. Like the design community that simply won’t talk to you unless your markup validates, the developer community may all collectively decide not to talk to you unless you’re using XHTML. This may end up being the strongest argument for using XHTML.)

And finally, anyone who tells you that any of these concepts will make your web site look better on mobile devices is selling you snake oil. Older mobile devices only supported a weird fucked up subset of HTML 3.2, and newer mobile devices have ultra-smart browsers that reflow even the most rigid designs and parse even the most fucked up Tag Soup markup. Every new mobile device that comes out seems to trip up on CSS in its own way, and apparently nobody told the mobile vendors about XHTML Basic (don’t ask).

So there are a lot of overlapping concepts here, and if you are the sort of person who is trying to push one of them, you’re probably going to try to push all of them. Despite the fact that it shares 100% of its tags with HTML 4, people are pushing XHTML as a fresh start, a better way of doing things. There are people who are (intentionally or not) conflating all of these issues, advocating XHTML but then trying to slip validation, CSS, accessibility, and semantics in under the radar at the same time. Everything is loosely related anyway, and if you’re going to make a fresh start and make the leap to standards-based design (and it is quite the leap, if you’ve been doing Tag Soup design all your professional life), you may as well go whole hog.

And there’s nothing wrong with this argument, per se. HTML has been historically branded with the stigma of cowboy coders, anything-goes, forcing round pegs into square Netscape browsers, and just generally being a wild woolly mess. If XHTML can be branded so as to create the association with validation, separating structure and presentation, accessibility, and other techniques that have worth in their own right, then maybe they can all get traction together. But that’s not a technical argument, it’s a social one, and anyone who claims that these loosely coupled concepts are really tightly coupled is either misinformed or lying.

OK, so I guess this was one of those rants. I promise not to speak of such things for another six months. Save the gerbils.

Filed under , , , , ,

31 comments

  1. And I thought I was the only one that thought the world was becoming quite so sickeningly focused on something that doesn’t require nor deserve quite so much focus.

    Well said Mark, quite well said.

    Comment by Noel D. Jackson — Friday, August 29, 2003 @ 2:52 am

  2. Interesting is also this rant at USS Clueless about being forced to use XHTML when updating one’s CMS.

    http://denbeste.nu/cd_log_entries/2003/08/Intrusivetools.shtml

    Comment by Scott — Friday, August 29, 2003 @ 3:24 am

  3. Trackback by Never Give Up
  4. NE China is being devastated by Giant Gerbils (http://news.bbc.co.uk/1/hi/world/asia-pacific/3162743.stm)

    If these things discover crack, (or semantic markup) they could take over the world.

    I for one, welcome our new masters.

    Comment by Julian Bond — Friday, August 29, 2003 @ 3:34 am

  5. Methinks this hole AmISemanticOrNot discussion is a very healthy sign in the community and should be welcomed. It is the sign of ambitious young minds (and old farts too) taking pride in their job, striving to improve the quality of their work.

    I say, who cares if the discussion is a little at times, It’s the thought that really counts. I’d much rather want to work in a culture of quality-obsessed navel-gazing, than the who-cares-lets-make-tag-soup culture we’ve had so far. :-)

    P.S. Nice rant though, Mark.

    Comment by Mar — Friday, August 29, 2003 @ 6:02 am

  6. Anyone who teaches you HTML tables to control design or layout is selling you snake oil. Anyone who also claims that it’s the only way and will work everywhere is selling you bad quality snake oil for three times the price.

    What we need to do is to teach people about <div>content</div>+CSS and not <table cellpadding=”1″ cellspacing=”0″ width=”100%” height=”100%” bgcolor=”black”><tr><td>table cellpadding=”4″ cellspacing=”0″ width=”100%” height=”100%” bgcolor=”white”<>td>content</table></table>.

    Comment by Jesper — Friday, August 29, 2003 @ 7:09 am

  7. Trackback by Links
  8. Trackback by Arthur is verweg!
  9. Trackback by Random Stuff
  10. > I do not, for example, care about the distinction between ACRONYM and ABBR. I do not care about the distinction between STRONG and B, or EM and I. I know of no tool that supports one and not the other

    http://harvest.sourceforge.net/
    http://www.thunderstone.com/site/texisman/
    http://www.namazu.org/

    All local search engines that rank emphasised/strongly emphasised text higher than text that is in italics or bold. Of course, they still “support” i and b elements, but they treat them differently (since they _are_ different).

    > Anyone who tells you that XHTML is easier to parse, consume, or validate because it’s XML is selling you snake oil (or isn’t using the right tools).

    Can you suggest the _right_ tool to replace this:

    http://www.throwingbeans.org/tech/postgresql_and_xml.html

    Part of the reason XML is easier to parse is because it is popular - people write tools for it when they might not when faced with SGML. Is there an SGML parser that can be embedded in postgresql?

    Comment by Jim Dabell — Friday, August 29, 2003 @ 7:49 am

  11. Trackback by Rodent Regatta
  12. To me, choosing HTML or XHTML depends on the requirements of the project. I wrote the Web Design Postcards in HTML 4 because it was (marginally) more lightweight, and I didn’t need any of the features that XHTML offered, like the ability to add RDF metadata, or do one of the small list of things you can do with XHTML and not with HTML.

    As far as I’m concerned, XHTML is a specialist tool that’s rarely necessary. I like to see governments adopting it and including metadata (the UK government’s e-GIF policy requires this), and there are a few other places it’s useful. It’s good for software that wants to spit out — or intermix with — web code (e.g., databases), but pointless for just about anything else.

    Comment by Matt Robinson — Friday, August 29, 2003 @ 9:06 am

  13. Mark, you don’t think you’re an “weird markup-obsessed Alpha Geek freak”?

    Comment by Anonymous — Friday, August 29, 2003 @ 9:58 am

  14. Trackback by Third Culture Design | Weblog
  15. Trackback by Brian's Life
  16. “I also don’t care about the distinction between UL and OL, since general use has contradicted the spec definition for so long that the distinction is meaningless. A UL is an ordered list with bullets, because that’s how everyone uses it.”

    A UL is an ordered list with bullets because that’s what the (original) spec actually says it is. HTML 4.0 apparently dropped all mention of “bullets”, but in HTML 2.0 and Tim BL’s original informal spec, the intention was clear.

    Comment by Joe English — Friday, August 29, 2003 @ 11:18 am

  17. ABBR vs ACRONYM:

    In an aural browser, the latter is meant to be READ, whereas the former is meant to be SPELLED-OUT. JAWS 4.51 is purported to support these tags properly now, so someone who cares about such things would want to distinguish between them. (And let’s leave aside modern, Standards-based screenreaders, like EmacsSpeak.)

    B and I vs STRONG and EM:

    As Comment 10 points out, these, too, are distinguishable, both in aural browsers and at that blind behemoth, Google.

    HTML vs XML Parsing:

    Using off-the-shelf XML parsing tools makes things like Comment-Validation trivially easy (see http://www.agresticism.org/furrow/2003/8/24/valid_comments_script/ for a demo).

    If, however, Validation is bullshit, then Schema are irrelevant and Comment-Validation is a crock.

    Web authors should not waste their time with such irrelevancies. But they shouldn’t come whining to Pilgrim either ( http://diveintomark.org/archives/2003/05/05/why_we_wont_help_you ).

    Comment by Jacques Distler — Friday, August 29, 2003 @ 11:53 am

  18. Jacques may actually have a good point, underneath all that condescension. While you can transform/convert/do cool things to XHTML with XML tools and HTML with SGML tools, *more people* are doing interesting things with XML tools. If the development community that can write such add-on tools all decides that XML is the baseline, then you’ll be forced into using XHTML in order to take advantage of those tools.

    In other words, you can use HTML, but don’t come whining to Jacques.

    Comment by Mark — Friday, August 29, 2003 @ 12:09 pm

  19. re: “And let’s leave aside modern, Standards-based screenreaders, like EmacsSpeak”

    User base: 2.

    Yes, let’s.

    Comment by Mark — Friday, August 29, 2003 @ 12:15 pm

  20. On the if-you-can’t-say-something-cogent-say-soemthing-trivial principle, I agree with Kottke that most people are writing weblogs, abbreviated blog, and not web logs. On the other pseudopod, I believe http://www.dm.net/~lnh/Log/log.html really is a web log and not a weblog (and contra the navigation don’t get to call it a blog).

    —L.

    Comment by lnh — Friday, August 29, 2003 @ 12:32 pm

  21. Let me take a crack at semantic pragmatism: when it comes to parsing, what cool things can you do with XHTML that you /can’t/ do with HTML?

    Comment by Ken Walker — Friday, August 29, 2003 @ 12:38 pm

  22. Ken,

    There are plenty of cool things you can do with XHTML that you cannot do with HTML. XHTML+MATHML+SVG, for example.

    Semantics has nothing to do with parsing though. That was one of Mark’s points.

    Comment by Jim Dabell — Friday, August 29, 2003 @ 1:44 pm

  23. Jim, I’m referring to Mark’s statement in comment 19:

    “While you can transform/convert/do cool things to XHTML with XML tools and HTML with SGML tools, *more people* are doing interesting things with XML tools.”

    I’d really like to see some concrete examples of what cool stuff can be done with SGML tools vs. what cool stuff can be done with XML tools. Use cases would interest me most. When I read “XHTML+MATHML+SVG,” it brings to mind vague impressions of dynamically graphing 3D calculus–pretty nebulous and unexciting for me.

    My apologies for mixing concepts that are pretty, well, you know…

    Comment by Ken Walker — Friday, August 29, 2003 @ 2:37 pm

  24. Trackback by Lethargic Ramblings
  25. Comment 5 sums it up for me. Attempting to code with valid XHTML and CSS is an effort to try and add a touch of class and mark of quality to my work. I want to produce something that I personally can be proud of. When I do a single-pixel hack I feel dirty.

    I tend to agree with Mark that the issues above are indeed separate, but they are not mutually exclusive, which is why people might confuse them.

    Comment by Phillip Harrington — Friday, August 29, 2003 @ 5:05 pm

  26. Well, you know, it was a pretty good rant, so I guess you’ll get more visits from the markup geeks indeed… But gerbil wannabees in search for a quick fix might want to get off your bandwidth and look at my extremely fresh (the pixels haven’t quite dried yet) rant on the Eolas/Microsoft suit.

    To cut a long story short, Eolas sued Microsoft over plugins and they might have to remove (or change) plugin support on IE, RIght now, it looks like as good an incentive as any to force people to adhere to proper web standards - too bad such a blunt weapon as the US patent legislation had to be used…

    Comment by Rui Carmo — Friday, August 29, 2003 @ 6:45 pm

  27. Oops. Missed the rant link (sorry): http://mac.against.org/space/blog/2003-08-29.22%3A26

    Comment by Rui Carmo — Friday, August 29, 2003 @ 6:47 pm

  28. Good stuff Mark.

    The thing that facinated me about this was how the discussion spread like forrest fire to other blogs most of which were basically repeating the same point(s) of view and then linking to each other.

    Comment by Paul Michael Smith — Friday, August 29, 2003 @ 6:51 pm

  29. Trackback by Quarter Life Crisis
  30. “spread like forrest fire to other blogs most of which were basically repeating the same point(s) of view and then linking to each other.”

    Sounds about right. That’s blogging for you.

    Comment by MikeyC — Friday, August 29, 2003 @ 10:07 pm

  31. Trackback by Leaves Rustle

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim