Mark Pilgrim: The Road to XHTML 2.0: MIME Types. First in a series.

Here’s a dirty little secret: browsers aren’t actually treating your XHTML as XML. Your validated, correctly DOCTYPE’d, completely standards compliant XHTML markup is being treated as if it were still HTML with a few weird slashes in places they don’t belong. Why? The answer is MIME types.

I’m positioning myself to be Mr. XHTML 2.0. Because if there’s one thing the world really needs right now, it’s more irony.

§

Forty seven comments here (latest comments)

  1. Heh; the answer is *always* “MIME types”.

    — Kendall Clark #

  2. “Most current browsers don’t handle the application/xhtml+xml MIME type correctly, so you’ll need to make provisions for serving up your XHTML the old-fashioned way (as text/html) to these browsers.”

    After toying around with the OBJECT tag for a few minutes I discovered a way to get around Internet Explorer’s “prompt to download” behaviour when encountering XHTML files sent out with the application/xhtml+xml MIME type.

    Here’s what you do: use “application/xml” or “text/xml” for the TYPE attribute in the OBJECT tag. Tested in IE6, you will not be prompted to download the file. This way you can send your files out with the correct MIME type to Internet Explorer (and possibly other browsers).

    Unfortunately, your browser does not support XHTML 2.0!

    You can make the OBJECT tag fill the entire screen so the user will not realize they are viewing the page through an OBJECT tag.

    I am half asleep right now, so If any of the above is dumb, then that’s my excuse.

    — MikeyC #

  3. My mark-up was stripped out of the last message lets try it again…

    example:

    <object data=”http://www.w3.org/People/mimasa/test/xhtml/media-types/test3.xhtml“ type=”application/xml” width=”100%” height=”100%”>your browser doesn’t support XHTML 2.0!</object>

    — MikeyC #

  4. Unfortunately, IE object support is broken enough to not support this.

    — Anonymous #

  5. The article says “Mozilla and its derivatives are the only major browsers that can handle the XHTML MIME type”.

    I don’t know about your definition of “handle” or “major browser”, but Opera will also parse pages differently for MIME types of application/xml or application/xhtml+xml, stopping when it comes to malformed syntax with an error message.

    See http://www.opera.com/docs/specs/#html

    — Kevin W #

  6. Opera does application/xhtml+xml - but keep in mind that the script element isn’t supported in this mode.

    — Arve #

  7. “Unfortunately, IE object support is broken enough to not support this.”

    It seems to work in IE6. Are you talking about IE4/IE5?

    — MikeyC #

  8. I base that statement on these results:

    http://www.w3.org/People/mimasa/test/xhtml/media-types/results

    — Mark #

  9. Mark, there’s a quite serious bug in the code snippets and discussion of the http accept header. There is no discussion or handling of the weighting values that may be part of the accept header.

    For example, there are a few user-agents (such as pocket ie) that display xml processing instructions as part of the content. I know of at least one person who sends an accept header that explicitly says he does not want xhtml - which would show up in your code snippets as one of the cases where xhtml would send.

    How about a short one-liner saying “watch out for…”?

    — Jim #

  10. I believe the mod_rewrite rule takes this into account, but if you’re correct, then the PHP and Python snippets should be updated.

    — Mark #

  11. “Unfortunately, IE object support is broken enough to not support this.”

    Follow-up: Just tested it in IE5 (Win) and it works! On IE5 (Mac) you get the disclaimer “Your browser does not support XHTML 2.0″. In both cases, this means that the Object tag is working well enough in IE to accomplish the goal at hand.

    So the two browsers that make up about 95% of the market (IE5 and IE6 on Windows) display the page content instead of prompting a download box.

    So you can send out one MIME type to all browsers if you are willing to accept that older browsers will simply get the disclaimer to upgrade their browser.

    Extensive testing hasn’t been done so I might have missed a crucial detail, but I think everything is kosher.

    — MikeyC #

  12. Could you remind why we should bother?

    It seems like things got ugly for you when you went from XHTML1.0 to 1.1.

    http://diveintomark.org/archives/2002/11/21/a_warning_to_others.html

    So why not just stick with 1.0 for now? (Having to do server tweaks to change mime-types based on browser seems like a significant barrier for lots of people.)

    — BillSeitz #

  13. Short answer: beats me. But I know a lot about it, so I may as well get paid for it.

    Long answer:

    I really don’t know why you should bother with the proper MIME types (in this specific case — of course in general they’re very useful). Serving XHTML as application/xhtml+xml is a lot of hassle for (as far as I can tell) no actual payoff. Which leads to the questions “will it ever get any easier?” and “will the payoff ever get any bigger?” If you answer both of these questions “no”, then the question becomes “why bother with XHTML 1.x at all?”

    I have heard a variety of answers to this — there are a few things the XHTML 1.x offers that HTML 4 does not (Ruby annotations, embedding MathML), but I’m not using any of those and have no plans to. Other answers like “well, it keeps me honest” are bogus; take your sado-markupchism elsewhere.

    The other answer that we (I) used to give is “XHTML 1.x provides a migration path to 2.0.” But we now all know that’s not true; XHTML 2.0 isn’t backwardly compatible with XHTML 1.x. (Admittedly, some people knew this sooner than others, but we all figured it out when the first proposal draft of XHTML 2.0 came out.) I guess it’s true to a limited extent, in the sense that you need to close your P and LI tags and so forth, but you *could* have been doing that in HTML 4 all along. (Maybe this is where the “well it keeps me honest” sentiment comes from.)

    The rest of the series will be about ways you can limit your markup and practices now (stop using IMG, for example) so you’ll have fewer headaches when the time comes to migrate your app/CMS/documents to XHTML 2.0. I still don’t know why anyone would *want* to do this (and the XHTML authors seem to think that no one will), but people spent a lot of time migrating systems from HTML 4 to XHTML 1, so I’m assuming there will be a similar demand once XHTML 2 goes final.

    Meanwhile, I’m sticking with HTML 4. Oh, the irony.

    — Mark #

  14. “MUST be served with a MIME-type of application/xhtml+xml.”

    Haha! That’s a funny one.

    I *do* use XHTML1.1+MathML2.0 on my blog. So, indeed, I *do* send application/xhtml+xml to Mozilla. But I don’t want to send it to other standards-compliant browsers (like Safari or, ironically, Camino) which can handle the application/xhtml+xml MIME-type, but which don’t understand MathML. They *must* get text/html.

    This “MUST” specification is simply divorced from reality.

    — Jacques Distler #

  15. Re: OBJECT.

    http://diveintomark.org/public/object-test.html

    Source code:

    http://diveintomark.org/public/object-test.txt

    Does not work in IE6SP1 on WinXP (I get “your browser does not support” message).

    Works on Mozilla (makes sense) but URL in location bar never changes (also makes sense — this is like “framing” an entire web site, only with MIME types thrown in for added confusion).

    Bottom line: this is not a viable solution.

    — Mark #

  16. Re: MathML not supported by all browsers that support application/xhtml+xml. This is a known problem; once you start mixing and matching, Content-type is woefully inadequate. See:

    http://www.xml.com/pub/a/2002/01/23/tag.html

    Don’t know if there has been further discussion or a solution (the article is over a year old), but I’m sure an enterprising soul can point us in the right direction.

    — Mark #

  17. Test comment.

    — Mark #

  18. Test comment #2.

    — Mark #

  19. Interesting article. A fascinating explanation of why XML is a rat’s nest.

    But, in my situation, the problem isn’t about XML namespaces versus media-types, but about whether Camino (say) should be envoking its XML parser at all. I actually *want* it to treat my document as “tag soup” and *ignore* the tags from the MathML namespace.

    In Camino’s case, what I really want is for them to grab an updated gecko engine from the Mozilla trunk. Non-gecko-based browsers are unlikely to understand MathML in the forseable future. So I don’t think I will ever want them using their XML parser to handle my blog.

    — Jacques Distler #

  20. “Re: OBJECT.

    http://diveintomark.org/public/object-test.html

    Source code:

    http://diveintomark.org/public/object-test.txt

    Does not work in IE6SP1 on WinXP (I get “your browser does not support” message).”

    Mark, in my example I wrote: either use application/xml or text/xml, not application/xhtml+xml for the type attribute. It works in my example.

    — Anonymous #

  21. Another problem with your example is that the page you have set in the data attribute is being sent out with a text/html MIME type.

    <object data=”http://diveintoaccessibility.org/“ type=”application/xml” width=”100%” height=”100%”>your browser doesn’t support application/xhtml+xml MIME type</object>

    Try my example exactly as I have it in #3.

    — MikeyC #

  22. On my fledgeling blog I currently have Moveable Type generating neither HTML nor XHTML, but rather an XML “blog” vocabulary of my design. I then use AxKit (http://www.axkit.org/ ) and a few XSLT stylesheets to turn those xml files into XHTML 1.1.

    Why? Well, as a sandbox for playing with the tech mostly. The XHTML 1.1 compliance was just another thing to try to hit, and it wasnt really that hard. It’s easier to generate XML from XSLT than non XML, so it was simpler to create some form of XHTML than it was to create non validating HTML4.

    I do, however, send everything as text/html. The MIME type is set as a function of the XSLT stylesheet, set in the tag. Eventually I may work up an AxKit output filter to do the correct type.

    The only problem I had with XHTML 1.1 is that it doesnt like many things as container tags. Any li or form tags need to be wrapped in their own div or p.

    The description of my setup is at http://cthompson.com/entries/2003/03/15/ and if you wanted to see the raw html of that file with no stylesheet applied, it’s http://cthompson.com/entries/2003/03/15/?passthru=1 though you’ll probably have to view source to see it.

    — Chris Thompson #

  23. For the MathML problem: According to http://www.ietf.org/rfc/rfc3236.txt (not a standard) you could use
    application/xhtml+xml;profile=”http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd

    — Sjoerd Visscher #

  24. Chris, I don’t understand quite what you mean about container tags; the rules for block-level elements in XHTML are well-defined, and haven’t changed since HTML 4. Also, that page you linked doesn’t validate. It appears to be well-formed XML (sent as text/html), but it’s not valid XHTML 1.1.

    No wonder writing browsers is hard.

    — Mark #

  25. I think XHTML has been adopted as a publishing format well before it was sensible to do so. Why?

    All the cool kids are doing it. That’s all I can figure. Imagine everyone replacing all their JPGs and GIFs with SVG and PNG -now-, and you’d get about the same mess. Well– except that there are reasonable tools for converting amongst -those- formats.

    The platform wasn’t ready, and people started hacking about. Calamity ensued.

    Content negotiation is a key to the long-term health of the web, but it gets no attention. It sits uncomfortably in that space only readily visible to publishers, but too technical and abstract for J. Random HTML-slinger to “get”, much less do anything about.

    Web server architecture doesn’t support conneg well at all. Yes, I’m aware Apache does conneg. But not well enough.

    MIME is insufficient in namespaced XML. Conneg has provisions for “feature”-based negotiation, but features have never been implemented anywhere.

    This is likely because the weight of fine-grained conneg based on UA feature capabilities in every request would overwhelm HTTP. Imagine a token for every little “recommendation” * version * module supported being included in each request. That’s what you need to properly do server-side conneg, and that’s a -lot- of data up to the server.

    My feeling is the following is needed to unravel this mess:

    1) Decoupling of URIs and File-paths. It’s a clever hack, but there’s no law stating that a URI=a file.

    2) Widespread implementation of feature-based conneg. Client-side would be better, but server-side is more technically possible (that is, there’s less impetus on the server).

    3) Better web server infrastructure to identify and serve alternates, plus web development tools to render the same content in various negotiated formats.

    …Or we could just stick with HTML. Really.

    There’s sort of an obsession with making the Semantic Web sit atop rather than alongside the Dumb Web. I fear this approach will diminish the value of both. Time will tell. :)

    — Jeremy Dunck #

  26. Jeremy, I agree with you. However, you win the modified Godwin’s Law award for being the first to mention the Semantic Web, which means that rational conversation will now cease.

    — Mark #

  27. Re: #3. I tried it exactly as you typed it, and I get a big old blank OBJECT in IE6/SP1/WinXP. That is, I see the border and the scrollbar of where the OBJECT should be, but there’s nothing inside it.

    http://diveintomark.org/public/object-test-3.html
    http://diveintomark.org/public/object-test-3.txt

    Further investigation by someone else may determine the difference between your test and mine.

    Also, this doesn’t solve the problem of the location bar never changing (making it impossible to bookmark). And what happens if someone right-clicks a site link and opens it in a new window? If the file itself is being served as application/xhtml+xml, then wouldn’t non-aware browsers offer to download it?

    In other words, you’ve reduced your entire site to a single unbookmarkable window. If I wanted to do that, I’d just use Flash.

    — Mark #

  28. <object data=”http://diveintoaccessibility.org/“ type=”application/xml” width=”100%” height=”100%”>your browser doesn’t support application/xhtml+xml MIME type</object>

    and

    <object data=”http://www.w3.org/People/mimasa/test/xhtml/media-types/test3.xhtml“ type=”application/xml” width=”100%” height=”100%”>your browser doesn’t support XHTML 2.0!</object>

    never load anything (data or error message). When the mime types are changed to application/xhtml+xml I get the error messages.

    IE6 WIN2K

    — gilli #

  29. “Re: #3. I tried it exactly as you typed it, and I get a big old blank OBJECT in IE6/SP1/WinXP. That is, I see the border and the scrollbar of where the OBJECT should be, but there’s nothing inside it.”

    Occasionally I’ve noticed this does happen (like one out of ten times). But when it does, I reload and it works once again.

    I’ve tried this at home and at work (3 separate computers in total) and its worked every time for me. Something special about Toronto (besides being the home of Joe Clark)?

    Its not really a solution to anything, but I just thought that it was kind of interesting that I was able to get IE to display a file with an Application/XHTML+XML MIME type instead of prompting me to download it and may have provided a stepping stone towards an actual solution.

    — MikeyC #

  30. Jeremy, I thought a few comments were in order:

    "Imagine a token for every little ‘recommendation’ * version * module supported being included in each request."

    Obviously what’s needed here is a little indirection. a UA string can be associated on the server with a profile of features, and only if the UA wanted to override the defaults for the UA would additional information need to be supplied. UA strings can be extended with a sub-profile identifier for a little more flexibility, as well.

    "Decoupling of URIs and File-paths. It’s a clever hack, but there’s no law stating that a URI=a file."

    Some server technologies already do this. Zope, my personal favorite, generally maps path elements to objects-contained-within-objects, although you can override the path traversal code to interpret some path elements as arguments instead. This let’s you do some pretty cool (and occasionally strange) things when developing web applications, while still keeping the URLs ‘normal’ looking.

    "…Or we could just stick with HTML. Really.

    There’s sort of an obsession with making the Semantic Web sit atop rather than alongside the Dumb Web. I fear this approach will diminish the value of both."

    I’m pretty sure that both semantic and dumb web philosophies will continue to be used. As long as the HTTP protocol itself isn’t *improved* to make it smarter, smart and dumb clients will be able to interact with smart and dumb servers, in any combination.

    — Michael Bernstein #

  31. I use to be very much into the whole XHTML bandwagon. Its clean - I can be purist - someone 8000 miles away may think I’m cool. But in all honesty what it the real point ?

    1. It’s accepted that serving up XHTML with an incorrect MIME type is wrong. I’d prefer to wait 10 years until browsers actually understand it before going back to it.

    2. XHTML surely should be kept to being the product of transforming XML
    documents when development necessitates this and not as a web community ’status’ thing.

    3. Accessibility (text and braille browsers) is a good thing, but this can be achieved with well structured HTML and skilled use of CSS. Tag Soup is optional with HTML which people seem to forget.

    — Marz #

  32. If we’re going to have a long-winded conversation on the topic, perhaps email would be better?
    Then again, if Mark doesn’t mind hosting here, it might be more generally useful.

    ===
    Michael,
    I disagree with the indirection idea.  I mean, yes you get away from the big-header problem, but you introduce another one.  Namely, versioning of UAs, and propogation of server-side knowledge of the universe of UAs.

    Only the UA really knows what the UA can do.  It must have the choice over alternates, either by passing request information to the server and letting the server decide, or by asking the server for just any old thing, and the server returning a list of alternates, allowing the UA to decide which alternate it can handle or is best for that UA.

    Also, in my dreamworld, open-source browsers near perfect implementations of standards (by selfish fixing of bugs, rather than authoring workaround hacks), meaning that browsers would (hopefully) have a new, better version every day.

    Regarding the meshing of SW and DW, yes, you’re right, as long as the underlying infrastructure stays the same, it’s possible that both will flourish.  Dandy.

    My issue here is the attempt to combine the two ideas.  That is, all this talk of sematic markup (XHTML, don’t use tables for layout, etc etc), in terms of “XHTML is more semantic than HTML” is often a bunch of hopeless angst and gnashing of teeth, because it’s all not terribly semantic in useful ways. 

    All the geeks out in hand-coding land devote an awful lot of energy “doing the right thing”, and it gets in the way of people saying what they need to say.

    And it turns out not to be the right thing.  It’s terribly frustrating.

    There’s an awful lot of assumptions made by a lot of people about the “most semantic” way to do something, and there’s been a lot of really smart people displaying a complete lack of understanding in the process.

    Some people will want to annotate their stuff w/ RDF and be quite semantic.  Some people want text on the world’s screens, yesterday, and RDF doesn’t serve their needs at all.

    As in nearly all important things, there’s no easy answer.  Do what’s right for you.

    The tool should serve the people… not the other way around.

    Gah.. I sound like Mark.  I gotta get outta here.  ;)

    — Jeremy Dunck #

  33. Mark,

    Sorry about bringing up SW.  I did, in fact, mean to be helpful.  :)

    — Jeremy Dunck #

  34. Just a few comments on the MIME-types:

    For the moment, I see no reason why the mimetype for xhtml is ‘application/xhtml+xml’. Actually, because of xml’s lack of script inclusion mechanism (except the controversial PIs), ‘text/html’ is more likely to be an application.

    And, xhtml is human readable to at least as high level as html. There’s user agents that handle mimetypes by what the first part of the mime-type is. ‘application’ gets sent to the default handler for that format. ‘image’ get rendered, or sent to the default handler, depending on whether the user agent is able to render it. ‘text’ gets parsed if the user agent has a parser for it, otherwise it’s simply displayed.

    So, if your xhtml document is actually a document to be read, not an application to run, using ‘application/xhtml+xml’ is fooling the user agent. It would be better if there existed a ‘text/xhtml+xml’, but for documents, I feel there’s good reason to avoid ‘application/xhtml+xml’. Perhaps ‘text/xml’ might be a more appropriate alternative to ‘text/html’, though.

    (Just a side note, in IE, try sending the document with ‘application/xhtml+xml’ with a multi-line comment on the line after the xml declaration but before the doctype.)

    — liorean #

  35. phoukka's thoughts (trackback)
  36. Mark, just goes to show that writing validating pages is not a one shot deal, but an ongoing process. The pages HAD validated, and I went and broke them by adding Javascript to the page in a way that 1.1 didn’t like. The page validates now.

    As to container tags, perhaps it has been that way since HTML4 and I just didn’t know as I was never validating my ugly pages.

    What I was refering to was the apparent requirements of nesting. For example, the validator groaned at me when I tried to do <form>…</form> at the same level as my <p>…</p> blocks. The only way it was happy was with <p><form>…</form></p>. (Or with <div> instead of <p>).

    If that is, in fact, a “must” in HTML 4 and not a “should”, then I was completely unaware of that. I had assumed that was a new feature of XHTML 1.x.

    — Chris Thompson #

  37. Chris,
    I’m gonna teach you to fish, rather than giving you a fish.

    Understanding DTD definitions is (sadly) still important to the author wishing to produce valid markup.

    The [HTML DTD] describes what elements can be directly contained within a given element in HTML.

    Specifically:
    <!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) — document body –>

    This says, among other things, that the body element can contain element described by %block;, or a SCRIPT element.

    So you go look at the definition of the %block; entity:

    <!ENTITY % block
    “P | %heading; | %list; | %preformatted; | DL | DIV | NOSCRIPT |
    BLOCKQUOTE | FORM | HR | TABLE | FIELDSET | ADDRESS”>

    This means that %block; expands to one of the listed elements.

    Notice that there’s three more entities in the list. Namely, %heading;, %list;, and %preformatted;. That means we need to go see what these mean, before we can really understand what %block; means, and therefore, before we can understand what the body element can contain.

    If you’re not lost yet, you should see the pattern: when you see an entity, you can look up that entity to expand it into hard references.

    You can think of entities in DTDs as roughly analagous to function calls– that is, they are a concise way of describing (and reusing) something more verbose.

    Going through the needed expansions, then, you should be able to determine all of the tags that the body element can contain.

    If you want validation that you’ve done it correctly (and learned how to read at least part of DTDs), post your answer here. :)

    [HTML DTD]
    http://www.w3.org/TR/html401/sgml/dtd.html#body

    — Jeremy Dunck #

  38. Jim Mangan's Weblog (trackback)
  39. I don’t know why the MIME type is “application/xhtml+xml”, rather than “text/xhtml+xml”, but there it is.

    It seems to me that (since the core of XHTML 1.x *is* HTML 4) the only reason to use this MIME type in preference to “text/html” is if you really are making use of the “+xml” part. That is, if you are using extensions which go beyond HTML.

    Are these extensions (MathML, SVG, …) supported by user-agents? And if a user-agent doesn’t support the extension in question, what is it supposed to do? Give up? Ignore those tags and render the rest of the document (a la ‘tag soup’)?

    About the only extension which has any significant support in current user-agents is MathML (and that support is pretty pitiful).

    I’m *stuck* using “application/xhtml+xml”. Most everyone else (IMHO) should just relax and send out “text/html” because that’s
    a)easier
    b)a more accurate description of their actual content.

    — Jacques Distler #

  40. slouching toward bethlehem (trackback)
  41. Mark, fair enough. Simply a bad assumption on my part. While in the past I attempted to write reasonable HTML (Not doing things like skipping trailing </td> or </li> as some people do, and using <p> along with $lt;/p> as a container, not as a standalone vertical spacer) I never tried to validate. So when I actually did validate I was attributing the error to big bad XHTML being too strict. Mea culpa.

    I have just checked and all of my pages now validate. My skeleton templates were fine, it was mostly content in a few of the stories that were causing problems.

    — Chris Thompson #

  42. I just tested the MikeyC object code in Mozilla and IE6 on Windows XP and it worked for me in both. But why is IE6 rejecting half the formatting? Forms aren’t shown as forms but text for instance. And simple links, like the email address at the bottom, aren’t clickable!

    Also, not only do you not see the page address in the address bar, but the browser still shows the title of the previous page you visited.

    Worse, try clicking on the first “Option Selector” form drop-down menu. It flashes the screen black and white in a most horrid way every time you click. Although this might be due to a black background applied to the main body.

    Doing anything with an XML declaration is difficult anyway at this moment in time because IE6 requires the DOCTYPE to be on the first line to produce XHTML! Otherwise it acts as if the document is HTML (by serving it up in “Quirks Mode”). It’s a known bug.

    — Chris Hester #

  43. “I just tested the MikeyC object code in Mozilla and IE6 on Windows XP and it worked for me in both.”

    Thanks for at least proving I’m not crazy and that the example (while admittedly useless) does indeed work.

    — MikeyC #

  44. - “why bother with XHTML 1.x at all?”

    Because it is valid XML.

    This is especially usefull when all my html is built using XSL stylesheets. XHTML provides me with a namespace that not only lets the code I produce be valid, but the code that generates it as well.

    — Peter #

  45. The XSL argument is bogus. This is standard XSL:

    <xsl:output method=”html” version=”4.0″/>

    http://www.w3.org/TR/xslt#section-HTML-Output-Method

    — Mark #

  46. Padawan.info (trackback)
  47. Nice article, but I think you really should be less cavalier about throwing away the qvalues. An example of why, and code that handles them properly, is at: http://www.klio.org/marks/2003_04_archive.html#entry-40

    — Mike Kozlowski #

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodemusicplanet

© 2001–8 Mark Pilgrim