Even the experts can’t get it right 100% of the time.
![[screenshot of xml error on intertwingly.net]](/public/2008/03/draconian-640.png)
screenshot taken at 10:29 PM on March 8, 2008.
For the record, my site is valid HTML 5, except the parts that aren’t. My therapist says I shouldn’t rely so much on external validation.
You are here: dive into mark → Archives → March 2008 → Draconian error handling: still the worst idea ever
Even the experts can’t get it right 100% of the time.
![[screenshot of xml error on intertwingly.net]](/public/2008/03/draconian-640.png)
screenshot taken at 10:29 PM on March 8, 2008.
For the record, my site is valid HTML 5, except the parts that aren’t. My therapist says I shouldn’t rely so much on external validation.
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
| Recent Stuff | For You, Special Price | Stay Here |
|---|---|---|
| Good Stuff | Buy The Cow | Go Away |
|
||
| Powered by Google | Drink The Milk | Don't Steal |
| posts / comments | © 2001-8 Mark Pilgrim |
Oh, Mark, but it’s better than that.
The offending <meta> tag, which made the page ill-formed?
<meta http-equiv=”x-ua-compatible” content=”ie=7″>
Microsoft: the root of all evil.
(Or maybe it’s Sam’s new Mac-mini.)
Comment by Jacques Distler — Sunday, March 9, 2008 @ 11:42 pm
Surely you mean “the love of Microsoft is the root of all evil.” :)
Comment by Mark — Sunday, March 9, 2008 @ 11:53 pm
I was about to write a comment saying “I wonder how it’s possible that you caught that, I can’t believe Sam’s site would be broken for more than 10 seconds…”—only to visit intertwingly.net and discover it still broken, as of now.
Comment by Justin Watt — Monday, March 10, 2008 @ 1:53 am
It’s just data.
Comment by Arien — Monday, March 10, 2008 @ 3:32 am
The great thing about XML’s well-formedness requirements is that this kind of thing can’t happen, because the author would catch this kind of error straight away. With HTML, and its lax parsing rules, this kind of error isn’t caught unless the author runs a conformance checker, which is why the Web is in such a horrendous state (93% syntactically non-conforming content, according to my multibillion file study).
Comment by Ian Hickson — Monday, March 10, 2008 @ 4:42 am
Heh. That reminded me of a similarly ironic screen shot I made a few years ago (not from a celebrity’s web site, but I still find it very funny): http://boinkor.net/misc/terrible-xml-error.png
Comment by Andreas Fuchs — Monday, March 10, 2008 @ 5:07 am
@Arien: Wow, that really made me laugh.
Comment by Noah Slater — Monday, March 10, 2008 @ 5:16 am
Ian Hickson, surely you jest?
http://diveintomark.org/archives/2004/01/14/thought_experiment
Comment by Noah — Monday, March 10, 2008 @ 5:19 am
lol, that closing line is epic.
Comment by Firas — Monday, March 10, 2008 @ 7:15 am
Typical Hixie… :) I mean, it’s never happened, ever.
Comment by Geoffrey Sneddon — Monday, March 10, 2008 @ 8:03 am
Snideness aside, what Ian says is exactly true… if IE handled XHTML, that is.
Comment by Jeff Schiller — Monday, March 10, 2008 @ 8:51 am
Heh. Mark, I’ll change camps if anyone ever catches me with a well-formedness error. (Not having implemented comments yet makes this easier for me to say, admittedly, but the way my weblog works does completely rule out mistakes of the the sort that Sam made in this instance. And I fully intend to preserve that property across any features I add.)
As an aside, the way that Gecko deals with XML parsing errors is unnecessarily user hostile. Other browsers demonstrate that it’s very possible to bow out just a little more gracefully. Of course, the kind of screenshot they yield is lousy for arguing a point…
Comment by Aristotle Pagaltzis — Monday, March 10, 2008 @ 10:01 am
> Of course, the kind of screenshot they yield is lousy for arguing a point…
In this case, wouldn’t it have been the same result? The error was within the head element, so they wouldn’t render any of the body.
That said, Mozilla’s error handling of this case is like a vestige of another age, when some True Believer thought exactly what Ian said above (but without the sarcasm) and intentionally made the error display as unfriendly as possible. “Fail early and ugly,” and all that.
Comment by Mark — Monday, March 10, 2008 @ 10:14 am
Maybe someone can put some focus on Bug 418305…
Comment by Jeff Schiller — Monday, March 10, 2008 @ 10:36 am
Hey, at least in the previous age, when Mozilla failed on bad XML they would fail ugly.
In the current era, if your live bookmark points at an invalid feed, you get neither data nor an error message.
Comment by Kevin H — Monday, March 10, 2008 @ 10:41 am
> I’ll change camps if anyone ever catches me with a well-formedness error
Really?
Comment by Philip Taylor — Monday, March 10, 2008 @ 5:40 pm
Philip:
Yeah, the error pages are completely outside the code that can prevent malformed content from getting out. I should probably serve them as
text/htmlas they’re just SSIs. I could get mod_include to URI-encode the variable instead of entity-encoding it, but that would just let you trip it with bare ampersands et al; I see no way to make mod_include encode both ways. Pointers welcome. Whenever I implement comments I’ll have the necessary code on the server and I’ll take care of it properly then.(There are analogous known bugs at other strata too; f.ex. you can try to bring up
/log/999999/and it will return an empty page with status 200 OK instead of throwing a 404 back at you.)Mark:
> In this case, wouldn’t it have been the same result?
True.
Comment by Aristotle Pagaltzis — Tuesday, March 11, 2008 @ 1:17 am
@Aristotle: so, you’ll change camps if anyone ever catches you with a well-formedness error on the pages that you already know are well-formed? That doesn’t seem as impressive as your original claim.
Comment by Mark — Tuesday, March 11, 2008 @ 4:44 am
If you limit it to static content, then clearly it’s trivial to ensure well-formedness - just generate all the pages offline, and only switch them into the live system after testing with a validator.
Most people want to generate content dynamically, and then it gets harder to ensure well-formedness in all cases.
Looking at 130K randomly selected pages from dmoz.org, 39 are served as application/xhtml+xml. Looking at the first few of those, every single site gets it wrong.
Some of these sites just need to remember to escape any user-input strings, the same as in text/html. The rest could be fixed by updating their HTML escaping function to do
s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/\x{FFFD}/g, but that appears to be too much for anyone to manage.Comment by Philip Taylor — Tuesday, March 11, 2008 @ 8:37 am
Thanks Philip.
Yes, it’s true that my WP blog does not yet support invalid XML characters in comments or search terms. I hope that either I will have time to patch my WP install or that it will be fixed in a near-future version of the WP product. Until that time, I’ll just continue to manually monitor my blog comments as I do on a day-to-day basis.
I believe such incidents have only been people trying to break my blog and not people really trying to get access to the content. Could someone describe a scenario where a normal user might encounter such an error? Are there scenarios where these search queries might be used maliciously? Thanks.
Comment by Jeff Schiller — Tuesday, March 11, 2008 @ 11:51 am
Mark:
Yeah, anything I say now is not going to sound very convincing. The site is a hodgepodge; the weblog part is XML-based, but the rest is old crud that I never got bothered enough to integrate (I’d have to shave a Matryoshka yak to bring it all together). I didn’t even think of those vestiges when I wrote the previous comment. Sucks.
Let’s put it this way: I am confident that I’ll never have a well-formedness error in the part of the site that I constantly change.
And I continue to be confident about getting the parts right that take user input; whenever I write them…
Philip:
It’s not really any different from static content when you accept user input: make sure the user input passes muster before accepting it. If it doesn’t, ask the user to fix it while they’re still around to do so. To be less hostile, grind it through HTMLTidy/TagSoup/whatever and ask them if the result looks correct. Someone will have to error-correct it; best do that as far upstream as possible, ideally where the source has a chance to either consent to that interpretation or provide the right fix.
Seems obvious to me… we’ll see if I can walk the walk.
Comment by Aristotle Pagaltzis — Tuesday, March 11, 2008 @ 11:51 am
> Could someone describe a scenario where a normal user might encounter such an error?
When submitting feedback? Easily: Trackbacks have no defined encoding. If you sent an ISO-8859-1 trackback which happens to contain what would be a non-character if it were treated as UTF-8, then you could hit it then. That is: “?>. Why might someone write this? Well, let’s have a completely valid line of PHP: .
Comment by Geoffrey Sneddon — Tuesday, March 11, 2008 @ 12:26 pm
Thanks, I currently hide my trackback and xmlrpc PHP files. Let me modify the question:
Could someone describe a scenario where a normal user might encounter such an error when entering a comment on the website or search query? Is there any way to maliciously inject PHP via the comment mechanism within WordPress?
Sorry, I’m still very new to this encoding stuff… and sorry to be hi-jacking Mark’s blog for WordPress-specific questions.
Comment by Jeff Schiller — Tuesday, March 11, 2008 @ 12:35 pm
Yeah, SixApart really screwed the pooch on Trackback. Modern implementations send encoding information in HTTP headers (what a concept!), but if it’s missing or wrong, you’re still screwed. http://chardet.feedparser.org/ may help, but it’s best at Eastern European and CJK languages, not distinguishing win-1252 from iso-8859-1.
@Jeff: don’t worry, you’re totally on-topic. Aristotle said his system is based on XML; dunno if that means actually serializing a DOM though. Wordpress and MT (and Sam’s site, for that matter) are based on string templates, escaping functions, and faith. Pit that against XML’s error handling, and XML wins every time.
Comment by Mark — Tuesday, March 11, 2008 @ 1:25 pm
Venus is based on serializating a DOM. Even with that approach, you can have well formedness errors. Invalid characters in a string, double dashes in a comment, element or attribute names with spaces in them…
P.S. I would not say faith. I have a very effective feedback loop mechanism in place. :-)
Comment by Sam Ruby — Tuesday, March 11, 2008 @ 3:43 pm
Mark:
> dunno if that means actually serializing a DOM though.
It does. The weblog resides in a master Atom feed that’s ground through XSLT to generate various views. Anything I build on top of that will work similarly.
As for how to handle Trackbacks, I think the correct answer is mu.
Comment by Aristotle Pagaltzis — Tuesday, March 11, 2008 @ 4:16 pm
Philip said:
Boooooring!
A two line fix in MT::Util->encode_html .
Surely, one can find a more challenging example of ill-formedness.
Comment by Jacques Distler — Tuesday, March 11, 2008 @ 5:57 pm
Dagnabit, I look away from the trainwreck for four years and absolutely nothing has changed. Why the hell are people[*] in 2008 *still* trying to deliver XHTML to the client?
Nurse, where are my pills?
[*] people who are not physicists
Comment by Evan Goer — Wednesday, March 12, 2008 @ 2:45 am
I wrote
OK. I partially retract that.
My dividing line between boring and interesting is whether only the miscreant in question is exposed to a YSoD (in this case, an ill-formed error message page), or others are exposed to a potential YSoD.
In this case, search queries are recorded in the MT logs, and are viewable by someone with Admin access.
So this passes the “not boring” test, and requires an additional 1-line fix in MT::App::Search->_straight_search .
My apologies.
Evan wrote:
Well, one thing that’s different now is that there’s extensive client support for inline SVG. Which is to say, there are — at least in principle — non-physicists who might pass your “Why are you doing this?” test.
Other than that, nothing has changed …
Comment by Jacques Distler — Thursday, March 13, 2008 @ 12:39 pm
Does the “Testing &?” link here count as exposing others to a potential YSoD? (Incidentally, that suggests an XSS hole, though I’m not sure how to exploit it except in text/html browsers - XHTML wins on security.)
Comment by Philip Taylor — Thursday, March 13, 2008 @ 10:48 pm
> though I’m not sure how to exploit it
Add quotes around the attribute value and a matching iframe end tag.
Comment by Mark — Thursday, March 13, 2008 @ 11:08 pm
That would be the quite obvious thing to try, but the difficulty is that an end tag typically requires a slash, and adding slashes to that URL causes a 404, and I can’t find any way around either of those issues. But it looks like the site is now fixed anyway, so I must find an alternative mechanism for any nefarious schemes.
Comment by Philip Taylor — Thursday, March 13, 2008 @ 11:40 pm
Now that was good!
I’m not 100% happy with my 3-line fix (see Revision 226), but that was a real issue (both were instances of the same issue).
Thanks for pointing it out.
Comment by Jacques Distler — Friday, March 14, 2008 @ 1:38 am
By the way, I’m unsure that the XSS issue was exploitable, even in text/html. What does an <iframe> inside the <head> do?
Comment by Jacques Distler — Friday, March 14, 2008 @ 1:52 am
An inside the simply gets moved to the body, and then rendered.
Ian Hickson’s Live DOM Viewer is helpful for visualizing what transformations your browser’s internal representation of any given HTML would be.
Comment by Sam Ruby — Friday, March 14, 2008 @ 3:55 am
And of course, an unescaped <iframe> inside a comment on this weblog simply gets eaten by a grue.
Comment by Sam Ruby — Friday, March 14, 2008 @ 3:58 am
That’s true if it’s a child of <head>. I should have specified that I was asking about a child of <title> (which is, in turn, a child of <head>)
According to that tool, such an <iframe> does not get moved.
But one would not want to rely on the crazy behaviour of browsers in this regard.
Comment by Anonymous — Friday, March 14, 2008 @ 8:56 am
<title> can’t contain tags (< gets treated as a text character, though with extra magic when it’s followed by “!–”), and nothing can close a <title> except for the string “</title”, so I think that part is reasonably safe. But that page printed the user-input string in a heading too, and IE6 successfully executed the injected script there.
Of course there are problems like this too.
Comment by Philip Taylor — Friday, March 14, 2008 @ 10:06 am
Which is to say that you successfully executed the script against yourself.
On a /show/ page, the string in question appears in the <title> element.
No matter. It needs to be escaped, and now it is.
Very impressive!
How did you insert alphanumeric characters into
request.remote_ip? I didn’t think that was possible. Evidently, it is, which explains Revision 228.Comment by Jacques Distler — Friday, March 14, 2008 @ 11:59 am
By the way, Sam, I expect that your Weblog (which displays remote commenters’ IP addresses) may be vulnerable to the same attack.
Comment by Jacques Distler — Friday, March 14, 2008 @ 12:05 pm
I don’t believe so, but both you and Phillip are welcome to try. I don’t consider my weblog mission critical. Breaking a page on my weblog is cause for a good laugh, and interesting discussion.
In any case, the reason I don’t think so is that I take all the user contributed content and pass it through a single function that deals with things like XML’s limitations and Python’s ideosyncracies in dealing with Unicode. If you find a way that breaks my code based on the REMOTE_ADDR, it will probably break my code in the same way when put in the content itself.
Comment by Sam Ruby — Friday, March 14, 2008 @ 12:51 pm
and against anyone who innocently clicked on a link I posted anywhere on the web, letting me execute my code on their computer with the security privileges of your domain (hence letting me access their session cookies and transmit them back to me, or letting me send HTTP requests (e.g. to admin pages) with their authentication status, if they are already logged in to your site). And it’s usually not too hard to make someone just click on a link.
Rails’
remote_ipreturns@env['HTTP_CLIENT_IP'], and the HTTP Client-IP header can contain anything. (A comment in that function says “Security note: do not use if IP spoofing is a concern for your application.”, but doesn’t make it clear that it might not return an IP address at all.)Getting closer. (Oh, looks like you fixed it before I finished writing this comment.)
(Of course I hope you’re not unfairly discriminating against people with IPv6 addresses…)
Comment by Philip Taylor — Friday, March 14, 2008 @ 1:26 pm
Ah!
Most interesting to know.
By “discriminate”, you mean “not display their IPV6 address next to their name”? Alas, my hastily-written fix does discriminate against them in that fashion.
I shall have to remedy that … (it’s easy enough).
Comment by Jacques Distler — Friday, March 14, 2008 @ 1:37 pm
What’s a regexp that matches an IPV6 address (in any of its glorious variations)?
Comment by Jacques Distler — Friday, March 14, 2008 @ 2:00 pm
I think (but haven’t tested) that the ABNF in RFC 2373 corresponds to
^[0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*|[0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*::([0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*)?|::([0-9a-fA-F]{1,4}(:[0-9a-fA-F]{1,4})*)?(:[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})?$but I’m not sure anyone would complain if you used^[0-9a-fA-F.:<]+$instead.Comment by Philip Taylor — Friday, March 14, 2008 @ 2:23 pm
PerlMonks to the rescue: http://www.perlmonks.org/?node_id=324473
which leads to http://search.cpan.org/src/TMONROE/Net-IPv6Addr-0.2/IPv6Addr.pm
which scares the living shit out of me.
Comment by Mark — Friday, March 14, 2008 @ 4:49 pm
No need to reinvent the wheel.
require 'resolv'ip.gsub!(Regexp.union(Resolv::IPv4::Regex, Resolv::IPv6::Regex), '') || 'bogus address'
seems to work quite well.
Comment by Jacques Distler — Friday, March 14, 2008 @ 6:31 pm
Ha ha ha.
Not only does WordPress swallow unpredictable bits or markup, it also swallows bakslashes.
Backslashes!
Teh funny. You’ll just have to figure out where the ‘backslash-0′ goes in the above expression.
Comment by Jacques Distler — Friday, March 14, 2008 @ 6:41 pm
I really don’t get the amount HTML info listed. So many compter thingies. How is someone not going to get a file back that was lost? I dont really know english will so help and its
hard to see my past files. ??? Am really confusd. I trid but cant get the past file back.
Comment by tom faced — Sunday, March 16, 2008 @ 6:47 am