Universal Feed Parser 3.2 is out. You can download it at SourceForge.

The main new feature in version 3.2 is completely revamped handling of character encoding. Previous versions relied on an odd combination of “do it in Python” and “let the XML parser handle it.” This version does everything in Python, then converts the feed to UTF-8 before handing it off to the XML parser. Every XML parser on Earth supports UTF-8.

When I say “do it in Python,” I don’t mean actual Python code. Python has a surprisingly sane API for handling the insanity that is character encoding, and this makes it easy for third-party libraries to extend Python’s built-in encodings module to support additional encodings. One such module, CJKCodecs, adds support for Chinese, Japanese, and Korean encodings. CJKCodeces will be part of Python 2.4, but it is also downloadable for Python 2.1 and above. Another module, iconv_codec, is a Python wrapper for the marvelous libiconv, which supports several hundred encodings. Both are highly recommended, and Universal Feed Parser will use both if available.

Of course, nothing is ever as simple as it sounds. In rare cases, the character encoding of the feed is explicitly specified in the charset parameter of the Content-type HTTP header. But in most cases, you need to look at encoding attribute in the XML declaration in the first line of the feed.

Previous versions of Universal Feed Parser naively used a regular expression on the raw byte stream to find the encoding attribute. This works most of the time, since many character encodings are compatible with the ASCII encoding for ASCII characters. (All the non-ASCII characters are encoded in the upper 128 characters of a byte, or in multi-byte sequences.) However, this assumption fails for multi-byte encodings, such as UTF-16 and UTF-32. It also fails for non-ASCII-compatible encodings, such as EBCDIC.

Section F of the XML specification provides a heuristic for determining whether an XML document is in a non-ASCII-compatible encoding, and which one. The heuristic is actually divided into two parts, because all XML documents are allowed to start with something called a Byte Order Mark (BOM), which is a specific Unicode character (U+FEFF) that looks different depending on the encoding and the byte order used in the document. (BOM FAQ) So one part of the heuristic deals with XML documents with a BOM, and the other part deals with XML without a BOM, but with an XML declaration. It turns out that the first 4 characters <?xm look different in every character encoding too.

I am pleased to announce that Universal Feed Parser now supports both parts of this heuristic. It can reliably detect and parse any feed encoded as UTF-32BE, UTF-32BE+BOM, UTF-32LE, UTF-32LE+BOM, UTF-16BE, UTF-16BE+BOM, UTF-16BE, UTF-16BE+BOM, UTF-8+BOM, or UTF-8. There are several new tests to confirm this.

Also EBCDIC. Did I mention it now supports EBCDIC? I’ve totally sold out to the BigCos. As an adjunct to JWZ’s Law of Computer Envelopment (”every program attempts to expand until it can read mail”), I declare that every aggregator attempts to expand until it can read EBCDIC. You can use this test case to track your aggregator’s progress.

As a bonus, since the entire character encoding determination is finished before the feed is handed off to a real XML parser, it works just as well for non-well-formed feeds. Have you ever wanted to parse an ill-formed CDF feed encoded as UTF-32 Little Endian with a Byte Order Mark? Universal Feed Parser can do that.

§

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodeplanet

© 2001–9 Mark Pilgrim