I am working on a major upgrade to my feed parser, and now is as good a time as any for a public beta release.
Download Universal Feed Parser 3.0 beta 15 (2004-02-11).
Changes from 2.x:
- Uses a real XML parser, if available. This means that all data will now come back as Unicode strings, and the XML library does all the fancy footwork of dealing with character encodings. Most Python distributions ship with an XML library which will work fine. It also works with PyXML. If no XML libraries are available, it automatically falls back to the 2.x-style parser based on regular expressions.
- If XML parsing fails due to well-formedness errors in the feed, it will set the
bozobit in the result, and store the first XML parsing error inbozo_exception. Then it will automatically fall back to the 2.x-style parser based on regular expressions. Some people seem to be laboring under the misapprehensions that (a) well-formedness is an indication of data quality, and (b) the client is the correct place to enforce data quality. It has been my experience that well-formedness is not a strong predictor of data quality; most of the feeds that fail to validate are well-formed crap, and the most of the rest would be valid if not for a single transient well-formedness error. But whatever, if you’re the sort of person who insists on punishing your own users for the mistakes of others, you may now do so by checking thebozobit. - Identification of feed type and version. Note that this is just the declared version, not the actual version; if the feed declares itself as Netscape RSS 0.91 but uses Userland RSS 0.91 syntax and semantics, it’s identified as Netscape RSS 0.91. As of beta 15, the feed parser should correctly identify every version of RSS and Atom.
- Unit tests. Run
feedparsertest.pyto test the feed parser on your system. It’s been tested under Windows, Mac OS X, and Debian Linux, on Python 2.1, 2.2, and 2.3. Previous versions were not well-tested on Python 2.1, much to the dismay of those running Debian stable. This version was tested very well, a process which shook out a surprising number of obscure bugs that probably never affected you. - Complete support for the Atom content model, including rich content in
title,tagline,summary,info, andcopyright. Also support for base64-encoded binary data. - Renamed to
Universal Feed Parser
, to emphasize the parser’s content normalization features and de-emphasize itsparse at all costs
nature. It would be nice if 3.0-final had better documentation, especially on the content normalization features, so you could easily see which data from which feed types and versions ends up where.

