Tim Bray is learning Python and using my feed parser to parse the feeds at Planet Sun. I am suitably flattered, and I sincerely hope that one of the 57 lines in Tim’s first Python program checks the bozo bit so Tim can ignore the 13 Planet Sun feeds which are not well-formed XML.
One is served as text/plain, which means it can never be well-formed.
Two (a, b) contain invalid XML characters.
Ten (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) are served as text/xml with no charset parameter. Clients are required to parse such feeds as us-ascii, but the feeds contain non-ASCII characters and are therefore not well-formed XML.
On a positive note, it’s nice to see that Norman Walsh has an Atom feed (#10 in that list). Pity it’s not well-formed. I’m sure he’ll fix that in short order. He’s no bozo.
You know what I want for Christmas? Markup Barbie. You pull a string and she says “XML is tough.”


ROTFC&GWPM
I have to have Markup Barbie. Off to google for “hacking talking barbie”…
Comment by Phil Ringnalda — Wednesday, July 7, 2004 @ 3:05 am
IMO, the text/xml content type is one of the most unfortunate things that have happened to XML. Using it instead of application/xml almost warrants raising the bozo bit pre-emptively.
Comment by Henri Sivonen — Wednesday, July 7, 2004 @ 3:24 am
For those wondering what the hell Phil’s unmarked abbreviation means:
http://philringnalda.com/blog/2004/04/meet_the_new_boss.php
Comment by Matt — Wednesday, July 7, 2004 @ 3:56 am
Sounds like the only bad xml are the two with invalid characters. The rest of the problems have nothing to do with XML or its well-formedness. Transport of the XML is the issue (which is the same problem people have been screwing up since the first heterogeneous network).
XML itself is pretty damn trivial if you use the right tools to generate and parse, rather than think “I’ll just use print”.
Comment by Stuart — Wednesday, July 7, 2004 @ 5:18 am
> The rest of the problems have nothing to do with XML or its well-formedness.
You must be new here.
Comment by Mark — Wednesday, July 7, 2004 @ 8:15 am
For reference, RFC 3023 and the author of RFC 3023 on the well-formedness of XML served as text/plain.
Comment by Mark — Wednesday, July 7, 2004 @ 8:24 am
Mark, I think you might have misinterpreted that email judging by what you wrote above. It states:
“So, I think that MIME entities labelled as text/plain are not well-formed XML documents.”
I don’t think he was drawing a distinction between well-formed and malformed XML documents. I think he was just saying “nope, it’s not XML”. If the documents were to be treated as XML, they could indeed be well-formed, it’s just that proper use of HTTP means that they shouldn’t be treated as XML.
So, at least in this case, I agree with Stuart, this is an HTTP issue, not an XML issue. The same applies to anything served via HTTP with the wrong Content-Type header - “HTML” served as text/plain? Not HTML. “JPEG” served as application/xhtml+xml? Not JPEG. And so on. It’s not an XML issue.
Comment by Jim — Wednesday, July 7, 2004 @ 8:39 am
On the atom-syntax mailing list, I asked a specific question: how I should handle XML served as text/plain. The answer came back quite clearly, and from a source I would treat as authoritative: anything served as text/plain is not a well-formed XML document. At that point, the provisions of the XML specification kick in and demand immediate rejection.
Well-formedness is more than bytes; it’s also the transport. The XML specification itself references RFC 3023 to determine the character encoding of a document served over HTTP. And it makes no sense to say that an XML document is well-formed “except for that pesky encoding issue.” Determining the character encoding is a prerequisite for parsing, and therefore well-formedness.
We’ve been grappling with this issue for some time now on atom-syntax. There’s no way around RFC 3023, and believe me we’ve tried. If you’ve been downloading XML and throwing it blindly into your XML parser without looking at the Content-type header, *you are doing it wrong*. There’s no two ways to look at it. You’re just wrong. Stop doing that.
Comment by Mark — Wednesday, July 7, 2004 @ 9:08 am
Actually, the problem with http://blogs.sun.com/roller/rss/rgwk isn’t that it contains illegal characters - it doesn’t.
The problem is that the document appears chopped off in the middle - the last line is:
<guid isPermaLink=”true”>http://blogs.sun.com/roller
Comment by Daniel Martin — Wednesday, July 7, 2004 @ 9:46 am
I stand corrected. I had assumed that that was just the parser’s way of telling me there was an invalid character, and refusing to show me any further characters.
Comment by Mark — Wednesday, July 7, 2004 @ 11:29 am
> The answer came back quite clearly, and from a source I would treat as authoritative: anything served as text/plain is not a well-formed XML document.
So say that instead of “which means it can never be well-formed.” Once more, the issue is that it isn’t an XML document, not that there’s a well-formedness error.
> At that point, the provisions of the XML specification kick in and demand immediate rejection.
No, the rejection happens before then. The XML specification doesn’t even apply. From RFC 2616, section 7.2.1:
“If and only if the media type is not given by a Content-Type field, the recipient MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource.”
The resource should be rejected before it is even parsed. It isn’t XML of any kind, well-formed, malformed, anything. It is plain text. The HTTP specification says so clearly. I’ll take the unambiguous word of a specification over an ambiguous statement by a specification author any day.
As for the character encoding issue, I never argued against your point there. I’m arguing solely that “Content-Type: text/plain” is an HTTP complication and not an XML complication.
> If you’ve been downloading XML and throwing it blindly into your XML parser without looking at the Content-type header, *you are doing it wrong*. There’s no two ways to look at it. You’re just wrong. Stop doing that.
I’m arguing that it’s an HTTP requirement, not that the requirement doesn’t exist.
Comment by Jim — Wednesday, July 7, 2004 @ 12:01 pm
Jim, I think we’re saying the same thing. A document served as “text/plain” can never be considered well-formed XML, because it must never be treated as XML in the first place. If you get as far as handing it to your XML parser to determine its well-formedness, you have already made an error.
Comment by Mark — Wednesday, July 7, 2004 @ 3:48 pm
I agree with that, I just take exception to you holding it up as evidence that “XML is tough” :)
Comment by Jim — Wednesday, July 7, 2004 @ 5:47 pm
HTTP is tough :/
Comment by Robert Sayre — Wednesday, July 7, 2004 @ 6:17 pm
What if you save the text/xml document to a file, and then read that file into an XML parser? Does it then magically become well-formed, because different rules apply?
Comment by Charles Miller — Wednesday, July 7, 2004 @ 10:18 pm
I believe the program doing the downloading is expected to mungle the XML declaration before saving locally, but I could be totally making that up. I would certainly expect that. Consider the case of a charset parameter given in the Content-type HTTP header. Without munging the XML declaration, that information would be lost.
Vaguely related, or perhaps not, but I just thought of it: when IE 6 retrieves a resource (like a JPEG) served as “image/jpeg”, it will change the file extension to “.jpg” before saving it in the local cache. This is to prevent malicious servers from serving files with executable file extensions (like “.exe”) as innocuous media types (like “image/jpeg”) in order to get executables on the file system for later abuse.
In general, it is up to the caching application to preserve the fidelity of the original file. The original media type and character encoding information is a big part of that (especially in the case of XML), so I would expect programs to preserve it within the file itself when saving locally.
Comment by Mark — Wednesday, July 7, 2004 @ 10:38 pm
I really don’t understand why or how anyone can think XML is so tough. It’s hands down easier than HTML for me, and much less of a headache. Always has been.
Comment by Devon — Wednesday, July 7, 2004 @ 10:47 pm
According to RFC 3023, “Unless the charset is UTF-8 or UTF-16, the recipient SHOULD also persistently store information about the charset, perhaps by embedding a correct XML encoding declaration within the XML MIME entity.” I haven’t been able to find an HTTP UA which actually does, yet, in the simple case of text/xml without a charset in the Content-type header, but they SHOULD.
Comment by Phil Ringnalda — Thursday, July 8, 2004 @ 12:32 am
> What if you save the text/xml document to a file, and then read that file into an XML parser? Does it then magically become well-formed, because different rules apply?
When you are loading a resource from an HTTP source, HTTP rules apply when deciding whether to treat it as XML or not. If the Content-Type header is present, this should be the deciding factor.
When you are loading a resource from disk, normal OS rules apply when deciding whether to treat it as XML or not. In many operating systems, the filename extension is the deciding factor. This isn’t standardised behaviour though, for instance, with a filesystem like Reiser4, the Content-Type for files downloaded through HTTP could be saved as metadata.
Whether your HTTP client automatically saves filenames with the correct extension or otherwise preserves Content-Type information, and whether it allows you to override it is an implementation issue and is not standardised.
> I haven’t been able to find an HTTP UA which actually does, yet, in the simple case of text/xml without a charset in the Content-type header, but they SHOULD.
Please remember that not all HTTP UAs are XML UAs. It would be a nightmare if every HTTP UA had to implement every media type RFC.
Comment by Jim — Thursday, July 8, 2004 @ 7:44 am
Mark:
>In general, it is up to the caching application to preserve the fidelity of the original file. The original media type and character encoding information is a big part of that (especially in the case of XML), so I would expect programs to preserve it within the file itself when saving locally.
Jim:
>Please remember that not all HTTP UAs are XML UAs. It would be a nightmare if every HTTP UA had to implement every media type RFC.
–
There’s the rub, no? XML requires data inside the envelope, but not all UAs know that data in the envelope should be updated.
Comment by Jeremy Dunck — Thursday, July 8, 2004 @ 1:21 pm
In the meantime try this Barbie: http://www.gophergas.com/funstuff/t-barbie.htm
Comment by Jens — Thursday, July 8, 2004 @ 11:21 pm
I want a t-shirt with this saying:
“You obviously mistaken me for someone who understands XML.”
Comment by jcwinnie — Friday, July 9, 2004 @ 7:48 am
Some of us continue to believe that the condition where a bag of bits which is intended to be an XML document isn’t because it has fractured syntax can usefully be distinguished from the condition where a bag of bits which in fact is an XML document is delivered with incorrect headers. In particular since the second case is distressingly common and in at least some cases apparently can’t be fixed by the person providing said bag of bits.
Comment by Tim Bray — Friday, July 9, 2004 @ 10:43 pm
> can usefully be distinguished from the condition where a bag of bits which in fact is an XML document is delivered with incorrect headers
I agree that the situations can be distinguished; the second case, in additional to being non-well-formed XML, presents a security risk and goes against the TAG finding on authoritative metadata:
http://www.w3.org/2001/tag/doc/mime-respect.html#override-risks
“A user agent that does not respect protocol specifications can violate user privacy, produce security holes, and otherwise create confusion. For example, a user agent can create a security problem by ignoring a “Content-Type” header with value text/plain”
The “classic” example of such a security risk is misinterpreting a plain text document as a shell script, but misinterpreting it as a feed is just as dangerous, since feeds can contain embedded or inline (X)HTML markup, which can in turn contain dangerous script.
- How to consume RSS safely
- Why sanitize
- Sanitize unit tests
UFP 3.3 will flag text/plain feeds as bozo=1, bozo_exception=NonXMLContentType. You are free to ignore such errors, but you are disrespecting multiple MIME, HTTP, and XML RFCs, as well as a TAG finding.
Comment by Mark — Saturday, July 10, 2004 @ 9:03 am
hehe….’hacking vocabulary barbie’ or better yet….’L33t Speak barbie’ with phrases like “1 w1ll pwn j00r b0x h4w!’ batteries not included!
Comment by ITIL Consultant — Monday, July 12, 2004 @ 8:52 am
Could someone point me toward some information on how to take user input and make sure it is in some particular charset? Perferably a PHP example? I’m dealing with users who write entries in Word and WordPerfect, on Macs or PCs, then copy and paste the text to their web browsers, then input that, and then I have to take the input and make valid XML. But I can’t know the charset of the input ahead of time.
Comment by Lawrence Krubner — Monday, July 12, 2004 @ 11:29 am
> Could someone point me toward some information on how to take user input and make sure it is in some particular charset? Perferably a PHP example?
This UTF-8 to code point array converter can be used for checking if the input is UTF-8 (without any optional PHP extensions).
Recognizing other encodings is hard and requires statistical methods that analyze the spectrum of scalar byte values and guess a plausible encoding.
> I’m dealing with users who write entries in Word and WordPerfect, on Macs or PCs, then copy and paste the text to their web browsers, then input that, and then I have to take the input and make valid XML. But I can’t know the charset of the input ahead of time.
If you serve the form HTML as UTF-8, contemporary browsers submit UTF-8 back.
Comment by Henri Sivonen — Monday, July 12, 2004 @ 2:43 pm