Even the experts can’t get it right 100% of the time.

[screenshot of xml error on intertwingly.net]

screenshot taken at 10:29 PM on March 8, 2008.

For the record, my site is valid HTML 5, except the parts that aren’t. My therapist says I shouldn’t rely so much on external validation.

§

Fifty three comments here (latest comments)

  1. Oh, Mark, but it’s better than that.

    The offending <meta> tag, which made the page ill-formed?

    <meta http-equiv=”x-ua-compatible” content=”ie=7″>

    Microsoft: the root of all evil.

    (Or maybe it’s Sam’s new Mac-mini.)

    — Jacques Distler #

  2. Surely you mean “the love of Microsoft is the root of all evil.” :)

    — Mark #

  3. I was about to write a comment saying “I wonder how it’s possible that you caught that, I can’t believe Sam’s site would be broken for more than 10 seconds…”—only to visit intertwingly.net and discover it still broken, as of now.

    — Justin Watt #

  4. It’s just data.

    — Arien #

  5. Today’s selected reads « Dustman’s Weblog (pingback)
  6. The great thing about XML’s well-formedness requirements is that this kind of thing can’t happen, because the author would catch this kind of error straight away. With HTML, and its lax parsing rules, this kind of error isn’t caught unless the author runs a conformance checker, which is why the Web is in such a horrendous state (93% syntactically non-conforming content, according to my multibillion file study).

    — Ian Hickson #

  7. Heh. That reminded me of a similarly ironic screen shot I made a few years ago (not from a celebrity’s web site, but I still find it very funny): http://boinkor.net/misc/terrible-xml-error.png

    — Andreas Fuchs #

  8. @Arien: Wow, that really made me laugh.

    — Noah Slater #

  9. Ian Hickson, surely you jest?

    http://diveintomark.org/archives/2004/01/14/thought_experiment

    — Noah #

  10. lol, that closing line is epic.

    — Firas #

  11. Typical Hixie… :) I mean, it’s never happened, ever.

    — Geoffrey Sneddon #

  12. Snideness aside, what Ian says is exactly true… if IE handled XHTML, that is.

    — Jeff Schiller #

  13. Heh. Mark, I’ll change camps if anyone ever catches me with a well-formedness error. (Not having implemented comments yet makes this easier for me to say, admittedly, but the way my weblog works does completely rule out mistakes of the the sort that Sam made in this instance. And I fully intend to preserve that property across any features I add.)

    As an aside, the way that Gecko deals with XML parsing errors is unnecessarily user hostile. Other browsers demonstrate that it’s very possible to bow out just a little more gracefully. Of course, the kind of screenshot they yield is lousy for arguing a point…

    — Aristotle Pagaltzis #

  14. > Of course, the kind of screenshot they yield is lousy for arguing a point…

    In this case, wouldn’t it have been the same result? The error was within the head element, so they wouldn’t render any of the body.

    That said, Mozilla’s error handling of this case is like a vestige of another age, when some True Believer thought exactly what Ian said above (but without the sarcasm) and intentionally made the error display as unfriendly as possible. “Fail early and ugly,” and all that.

    — Mark #

  15. Maybe someone can put some focus on Bug 418305

    — Jeff Schiller #

  16. Hey, at least in the previous age, when Mozilla failed on bad XML they would fail ugly.

    In the current era, if your live bookmark points at an invalid feed, you get neither data nor an error message.

    — Kevin H #

  17. > I’ll change camps if anyone ever catches me with a well-formedness error

    Really?

    — Philip Taylor #

  18. Bb's RealTech | Run for the Web (pingback)
  19. Philip:

    Yeah, the error pages are completely outside the code that can prevent malformed content from getting out. I should probably serve them as text/html as they’re just SSIs. I could get mod_include to URI-encode the variable instead of entity-encoding it, but that would just let you trip it with bare ampersands et al; I see no way to make mod_include encode both ways. Pointers welcome. Whenever I implement comments I’ll have the necessary code on the server and I’ll take care of it properly then.

    (There are analogous known bugs at other strata too; f.ex. you can try to bring up /log/999999/ and it will return an empty page with status 200 OK instead of throwing a 404 back at you.)

    Mark:

    > In this case, wouldn’t it have been the same result?

    True.

    — Aristotle Pagaltzis #

  20. @Aristotle: so, you’ll change camps if anyone ever catches you with a well-formedness error on the pages that you already know are well-formed? That doesn’t seem as impressive as your original claim.

    — Mark #

  21. If you limit it to static content, then clearly it’s trivial to ensure well-formedness – just generate all the pages offline, and only switch them into the live system after testing with a validator.

    Most people want to generate content dynamically, and then it gets harder to ensure well-formedness in all cases.

    Looking at 130K randomly selected pages from dmoz.org, 39 are served as application/xhtml+xml. Looking at the first few of those, every single site gets it wrong.

    Some of these sites just need to remember to escape any user-input strings, the same as in text/html. The rest could be fixed by updating their HTML escaping function to do s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/\x{FFFD}/g, but that appears to be too much for anyone to manage.

    — Philip Taylor #

  22. Thanks Philip.

    Yes, it’s true that my WP blog does not yet support invalid XML characters in comments or search terms. I hope that either I will have time to patch my WP install or that it will be fixed in a near-future version of the WP product. Until that time, I’ll just continue to manually monitor my blog comments as I do on a day-to-day basis.

    I believe such incidents have only been people trying to break my blog and not people really trying to get access to the content. Could someone describe a scenario where a normal user might encounter such an error? Are there scenarios where these search queries might be used maliciously? Thanks.

    — Jeff Schiller #

  23. Mark:

    Yeah, anything I say now is not going to sound very convincing. The site is a hodgepodge; the weblog part is XML-based, but the rest is old crud that I never got bothered enough to integrate (I’d have to shave a Matryoshka yak to bring it all together). I didn’t even think of those vestiges when I wrote the previous comment. Sucks.

    Let’s put it this way: I am confident that I’ll never have a well-formedness error in the part of the site that I constantly change.

    And I continue to be confident about getting the parts right that take user input; whenever I write them…

    Philip:

    It’s not really any different from static content when you accept user input: make sure the user input passes muster before accepting it. If it doesn’t, ask the user to fix it while they’re still around to do so. To be less hostile, grind it through HTMLTidy/TagSoup/whatever and ask them if the result looks correct. Someone will have to error-correct it; best do that as far upstream as possible, ideally where the source has a chance to either consent to that interpretation or provide the right fix.

    Seems obvious to me… we’ll see if I can walk the walk.

    — Aristotle Pagaltzis #

  24. > Could someone describe a scenario where a normal user might encounter such an error?

    When submitting feedback? Easily: Trackbacks have no defined encoding. If you sent an ISO-8859-1 trackback which happens to contain what would be a non-character if it were treated as UTF-8, then you could hit it then. That is: “?>. Why might someone write this? Well, let’s have a completely valid line of PHP: .

    — Geoffrey Sneddon #

  25. Thanks, I currently hide my trackback and xmlrpc PHP files. Let me modify the question:

    Could someone describe a scenario where a normal user might encounter such an error when entering a comment on the website or search query? Is there any way to maliciously inject PHP via the comment mechanism within WordPress?

    Sorry, I’m still very new to this encoding stuff… and sorry to be hi-jacking Mark’s blog for WordPress-specific questions.

    — Jeff Schiller #

  26. Yeah, SixApart really screwed the pooch on Trackback. Modern implementations send encoding information in HTTP headers (what a concept!), but if it’s missing or wrong, you’re still screwed. http://chardet.feedparser.org/ may help, but it’s best at Eastern European and CJK languages, not distinguishing win-1252 from iso-8859-1.

    @Jeff: don’t worry, you’re totally on-topic. Aristotle said his system is based on XML; dunno if that means actually serializing a DOM though. Wordpress and MT (and Sam’s site, for that matter) are based on string templates, escaping functions, and faith. Pit that against XML’s error handling, and XML wins every time.

    — Mark #

  27. Venus is based on serializating a DOM. Even with that approach, you can have well formedness errors. Invalid characters in a string, double dashes in a comment, element or attribute names with spaces in them…

    P.S. I would not say faith. I have a very effective feedback loop mechanism in place. :-)

    — Sam Ruby #

  28. Lucumr Cogitations » Blog Archive (pingback)
  29. Mark:

    > dunno if that means actually serializing a DOM though.

    It does. The weblog resides in a master Atom feed that’s ground through XSLT to generate various views. Anything I build on top of that will work similarly.

    As for how to handle Trackbacks, I think the correct answer is mu.

    — Aristotle Pagaltzis #

  30. Philip said:

    Most people want to generate content dynamically, and then it gets harder to ensure well-formedness in all cases.

    Boooooring!

    A two line fix in MT::Util->encode_html .

    Surely, one can find a more challenging example of ill-formedness.

    — Jacques Distler #

  31. Labnotes » Rounded Corners - 194 (If it helps you feel better …) (pingback)
  32. Dagnabit, I look away from the trainwreck for four years and absolutely nothing has changed. Why the hell are people[*] in 2008 *still* trying to deliver XHTML to the client?

    Nurse, where are my pills?

    [*] people who are not physicists

    — Evan Goer #

  33. I wrote

    Boooooring!
    A two line fix in MT::Util->encode_html .

    OK. I partially retract that.

    My dividing line between boring and interesting is whether only the miscreant in question is exposed to a YSoD (in this case, an ill-formed error message page), or others are exposed to a potential YSoD.

    In this case, search queries are recorded in the MT logs, and are viewable by someone with Admin access.

    So this passes the “not boring” test, and requires an additional 1-line fix in MT::App::Search->_straight_search .

    My apologies.

    Evan wrote:

    Dagnabit …!

    Well, one thing that’s different now is that there’s extensive client support for inline SVG. Which is to say, there are — at least in principle — non-physicists who might pass your “Why are you doing this?” test.

    Other than that, nothing has changed …

    — Jacques Distler #

  34. My dividing line between boring and interesting is whether only the miscreant in question is exposed to a YSoD (in this case, an ill-formed error message page), or others are exposed to a potential YSoD.

    Does the “Testing &?” link here count as exposing others to a potential YSoD? (Incidentally, that suggests an XSS hole, though I’m not sure how to exploit it except in text/html browsers – XHTML wins on security.)

    — Philip Taylor #

  35. > though I’m not sure how to exploit it

    Add quotes around the attribute value and a matching iframe end tag.

    — Mark #

  36. Add quotes around the attribute value and a matching iframe end tag.

    That would be the quite obvious thing to try, but the difficulty is that an end tag typically requires a slash, and adding slashes to that URL causes a 404, and I can’t find any way around either of those issues. But it looks like the site is now fixed anyway, so I must find an alternative mechanism for any nefarious schemes.

    — Philip Taylor #

  37. Now that was good!

    I’m not 100% happy with my 3-line fix (see Revision 226), but that was a real issue (both were instances of the same issue).

    Thanks for pointing it out.

    — Jacques Distler #

  38. By the way, I’m unsure that the XSS issue was exploitable, even in text/html. What does an <iframe> inside the <head> do?

    — Jacques Distler #

  39. An inside the simply gets moved to the body, and then rendered.

    Ian Hickson’s Live DOM Viewer is helpful for visualizing what transformations your browser’s internal representation of any given HTML would be.

    — Sam Ruby #

  40. And of course, an unescaped <iframe> inside a comment on this weblog simply gets eaten by a grue.

    — Sam Ruby #

  41. An <iframe> inside the simply gets moved to the body, and then rendered.

    That’s true if it’s a child of <head>. I should have specified that I was asking about a child of <title> (which is, in turn, a child of <head>)

    Ian Hickson’s Live DOM Viewer is helpful for visualizing what transformations your browser’s internal representation of any given HTML would be.

    According to that tool, such an <iframe> does not get moved.

    But one would not want to rely on the crazy behaviour of browsers in this regard.

    — Anonymous #

  42. <title> can’t contain tags (< gets treated as a text character, though with extra magic when it’s followed by “!–”), and nothing can close a <title> except for the string “</title”, so I think that part is reasonably safe. But that page printed the user-input string in a heading too, and IE6 successfully executed the injected script there.

    Of course there are problems like this too.

    — Philip Taylor #

  43. But that page printed the user-input string in a heading too, and IE6 successfully executed the injected script there.

    Which is to say that you successfully executed the script against yourself.

    On a /show/ page, the string in question appears in the <title> element.

    No matter. It needs to be escaped, and now it is.

    Of course there are problems like this too.

    Very impressive!

    How did you insert alphanumeric characters into request.remote_ip ? I didn’t think that was possible. Evidently, it is, which explains Revision 228.

    — Jacques Distler #

  44. By the way, Sam, I expect that your Weblog (which displays remote commenters’ IP addresses) may be vulnerable to the same attack.

    — Jacques Distler #

  45. By the way, Sam, I expect that your Weblog (which displays remote commenters’ IP addresses) may be vulnerable to the same attack.

    I don’t believe so, but both you and Phillip are welcome to try. I don’t consider my weblog mission critical. Breaking a page on my weblog is cause for a good laugh, and interesting discussion.

    In any case, the reason I don’t think so is that I take all the user contributed content and pass it through a single function that deals with things like XML’s limitations and Python’s ideosyncracies in dealing with Unicode. If you find a way that breaks my code based on the REMOTE_ADDR, it will probably break my code in the same way when put in the content itself.

    — Sam Ruby #

  46. you successfully executed the script against yourself.

    and against anyone who innocently clicked on a link I posted anywhere on the web, letting me execute my code on their computer with the security privileges of your domain (hence letting me access their session cookies and transmit them back to me, or letting me send HTTP requests (e.g. to admin pages) with their authentication status, if they are already logged in to your site). And it’s usually not too hard to make someone just click on a link.

    How did you (…removed word to make mod_security happy…) alphanumeric characters into request.remote_ip ?

    Rails’ remote_ip returns @env['HTTP_CLIENT_IP'], and the HTTP Client-IP header can contain anything. (A comment in that function says “Security note: do not use if IP spoofing is a concern for your application.”, but doesn’t make it clear that it might not return an IP address at all.)

    Revision 228

    Getting closer. (Oh, looks like you fixed it before I finished writing this comment.)

    (Of course I hope you’re not unfairly discriminating against people with IPv6 addresses…)

    — Philip Taylor #

  47. Rails’ remote_ip returns @env['HTTP_CLIENT_IP'], and the HTTP Client-IP header can contain anything. (A comment in that function says “Security note: do not use if IP spoofing is a concern for your application.”, but doesn’t make it clear that it might not return an IP address at all.)

    Ah!

    Most interesting to know.

    (Of course I hope you’re not unfairly discriminating against people with IPv6 addresses…)

    By “discriminate”, you mean “not display their IPV6 address next to their name”? Alas, my hastily-written fix does discriminate against them in that fashion.

    I shall have to remedy that … (it’s easy enough).

    — Jacques Distler #

  48. What’s a regexp that matches an IPV6 address (in any of its glorious variations)?

    — Jacques Distler #

  49. I think (but haven’t tested) that the ABNF in RFC 2373 corresponds to ^[0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*|[0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*::([0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*)?|::([0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*)?(:[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})?$ but I’m not sure anyone would complain if you used ^[0-9a-fA-F.:<]+$ instead.

    — Philip Taylor #

  50. PerlMonks to the rescue: http://www.perlmonks.org/?node_id=324473

    which leads to http://search.cpan.org/src/TMONROE/Net-IPv6Addr-0.2/IPv6Addr.pm

    which scares the living shit out of me.

    — Mark #

  51. No need to reinvent the wheel.

    require 'resolv'
    ip.gsub!(Regexp.union(Resolv::IPv4::Regex, Resolv::IPv6::Regex), '') || 'bogus address'

    seems to work quite well.

    — Jacques Distler #

  52. Ha ha ha.

    Not only does WordPress swallow unpredictable bits or markup, it also swallows bakslashes.

    Backslashes!

    Teh funny. You’ll just have to figure out where the ‘backslash-0′ goes in the above expression.

    — Jacques Distler #

  53. I really don’t get the amount HTML info listed. So many compter thingies. How is someone not going to get a file back that was lost? I dont really know english will so help and its
    hard to see my past files. ??? Am really confusd. I trid but cant get the past file back.

    — tom faced #

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodeplanet

© 2001–present Mark Pilgrim