Jon Udell:Oops, broke the News page. Steven Vore pointed out that the previous item [on Jon's weblog] breaks the News page [for Radio users]. I removed the angle brackets surrounding the word “title” as a workaround. If you’ve already received the feed containing that item, though, you should probably unsub/resub. And, I guess, avoid using amp-lt-semi until this gets sorted out.

This is the same bug I inadvertently triggered yesterday, except that readers were confused because I triggered it in a post that, coincidentally, was about a similar (but different) bug about HTML escaping in Manila (not Radio). Also, because mine actually redirected readers from their news page to somewhere else (because that was what the sample SCRIPT in my post was talking about — but the Radio bug actually made it execute the SCRIPT tag instead of display it).

Garth Kidd tried to reproduce the problem but failed, and for a moment, I was questioning whether it really was a Radio bug. But now that Jon Udell is reporting it on a completely unrelated post, I’m convinced. It’s not just SCRIPT tags; it can happen with any post that includes properly encoded tags: the tags are supposed to be displayed, but Radio double-decodes them so they end up getting interpreted instead.

For the 99% of you who have no idea what’s going on, here’s a short primer on HTML entity encoding:

HTML is structured markup, where text is delimited by tags in angle brackets (< and >). Like any sort of delimited data, you inevitably run into the situation where you need to display the delimiter. Comma-separated data has the same problem (what if the data has a comma?) and tab-separated and so forth. HTML is no different, and the solution is both simple and relatively elegant.

To represent a left angle bracket or right angle bracket, you use what’s called an entity: you spell out the character you want, and your browser is smart enough to display the character it represents. So a left angle bracket (”less than”) that is meant to be displayed is written as “&-l-t-;” (without the quotes or dashes, that’s just to show what’s going on). And a right angle bracket (”greater than”) is written as “&-g-t-;”. Simple, no?

Well, no. Because this brings up another problem: we’re using an ampersand to delimit the start of an entity, so now we need another entity to represent an ampersand that is actually supposed to be displayed (instead of being part of an entity). So we spell that out too: “&-a-m-p-;”.

The process of converting angle brackets and ampersands to their entity representations is called “HTML entity encoding”. Every web developer quickly learns how to do this in their chosen web development language, and virtually every web-friendly scripting language has a function to do this in one line. PHP has the htmlentities function; Python has cgi.escape; VBScript has Server.HTMLEncode. Etc. Especially important to web developers is how to display text that comes from untrusted sources. Forms input is the most common example; every message board system in the world takes user input in an HTML form and displays it on screen. In order to make sure that savvy users don’t insert nasty SCRIPT tags or other unwanted tags and hijack the message board, the web developer needs to make sure to properly encode the input. (It’s more complicated than this when the board allows you to use some HTML tags but not others, but never mind that.) Once text is properly HTML encoded, it can safely be concatenated with an HTML template and displayed as part of a web page; there’s no chance of the user sabotaging the page, because all of their input has been “cleansed”: even if they included HTML tags in their input, these tags have been converted to their entity representations, so the tags will be displayed, not interpreted.

(Incidentally, this is the bug I discovered in Manila: it didn’t HTML-encode the referer strings it displayed on the /stats/referers page, so any malicious client could send HTML tags in the referer string. These tags would be interpreted directly by readers who went to the referer page, thus allowing any malicious agent to hijack the page (by inserting SCRIPT tags to redirect the page). Ironically, it was my example of how to trigger this HTML encoding bug in Manila that inadvertently triggered the HTML decoding bug in Radio.)

The reverse process is called “HTML entity decoding”, and it’s what your browser does for you. When the web developer encodes tags into entities and sends them to your browser, your browser is smart enough to decode them back to angle brackets and ampersands to display them. Unfortunately, Radio’s HTML entity decoding function is broken, which causes it to decode HTML entities twice — once from “&-a-m-p-;-l-t-;” to “&-l-t-;”, then again from “&-l-t-;” to “<”. This is very, very bad: a left angle bracket that ought to be displayed as a left angle bracket is instead incorrectly decoded a second time and turned into a real left angle bracket, which your browser then (quite naturally) treats as if it was the start of an actual HTML tag.

Unfortunately, there is no way for bloggers to work around this problem (other than to never talk about HTML tags). In theory, we could double-encode our HTML entities, but then they would look stupid to users using normal web browsers, which act correctly and only decode the entities once. The fix has to come from Userland, the makers of Radio; until then, more and more bloggers will inadvertently trigger this bug and cause havoc on Radio users’ News aggregation page.

§

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodeplanet

© 2001–9 Mark Pilgrim