RFC 3023 (XML Media Types
) defines the interaction between XML and HTTP as it relates to character encoding. You see, both XML and HTTP have different ways of specifying character encoding and different defaults in case no encoding is specified, and determining which value takes precedence depends on a variety of factors.
This is insanely complicated, but I believe I finally have this right. Corrections welcome.
In XML, the character encoding is optional and can be given in the XML declaration in the first line of the document, like this:
<xml version="1.0" encoding="iso-8859-1"?>
If no encoding is given and no Byte Order Mark is present (don’t ask), XML defaults to utf-8.
(For those of you smart enough to realize that this is a Catch-22, that an XML processor can’t possibly read the XML declaration to determine the document’s character encoding without already knowing the document’s character encoding, please read Section F of the XML specification and bow in awe at the intricate care with which this issue was thought out.)
HTTP uses MIME to define a method of specifying the character encoding, as part of the Content-Type HTTP header, which looks like this:
Content-Type: text/html; charset="utf-8"
If no charset is specified, HTTP defaults to iso-8859-1, but only for text/* media types. (Thanks, Ian.) For other media types, the default encoding is undefined, which is where RFC 3023 comes in.
According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the encoding is
charset parameter of the Content-Type HTTP header, orencoding attribute of the XML declaration within the document, orutf-8.On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is
charset parameter of the Content-Type HTTP header, orus-ascii.I bring this up to make two points. First, that Really Simple Syndication
is really only simple if you ignore a bunch of stuff that you really should be paying attention to. And second, that beta 17 of my feed parser now supports the above-mentioned logic for determining the character encoding of a feed, and it has 37 test cases to back it up. Which is not to say that my code is right, but only to say that it does what I think it does.
§
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
© 2001–9 Mark Pilgrim