dive into mark

You are here: dive into markArchivesApril 2004Universal Feed Parser 3.0 beta 22

Monday, April 19, 2004

Universal Feed Parser 3.0 beta 22

3.0 beta 22 of my Universal Feed Parser is out. This release fixes all known bugs, and I hope it will be the last beta before 3.0 final. After all, this is getting a bit ridiculous.

The release makes a significant change: if XML parsing fails due to character encoding problems, the parser will attempt to auto-determine the character encoding and re-parse with a real XML parser. This is noted in the results as results['bozo'] = 1 and results['bozo_exception'] = feedparser.CharacterEncodingOverride. results['encoding'] will contain the encoding that was actually used to parse the feed (not the original declared encoding).

This release makes another significant change: Unicode support for ill-formed feeds. All individual data values will be returned as Unicode strings if they can be converted using the document’s character encoding. I had a flash of insight and suddenly the entirety of Python’s Unicode support became clear to me. I coded madly for several hours until it faded. It’s entirely possible that that’s just the LSD talking, but thanks to the magic of open source, everyone can now share in my good trip.

This release also makes significant changes to internal classes. If you were subclassing or accessing these classes, your code will likely break. If you were just using the public parse() function, you will not notice any change.

My change reporting history has been lax throughout the 3.0 beta process, so I went back and recreated it from file timestamps, comments, and judicious use of diff. Full user documentation is coming next.

3.0b3 - 1/23/2004 - MAP
  • parse entire feed with real XML parser (if available)
  • added several new supported namespaces
  • fixed bug tracking naked markup in description
  • added support for enclosure
  • added support for source
  • re-added support for cloud which got dropped somehow
  • added support for expirationDate
3.0b4 - 1/26/2004 - MAP
  • fixed xml:lang inheritance
  • fixed multiple bugs tracking xml:base URI, one for documents that don’t define one explicitly and one for documents that define an outer and an inner xml:base that goes out of scope before the end of the document
3.0b5 - 1/26/2004 - MAP
  • fixed bug parsing multiple links at feed level
3.0b6 - 1/27/2004 - MAP
  • added feed type and version detection, result["version"] will be one of SUPPORTED_VERSIONS.keys() or empty string if unrecognized
  • added support for creativeCommons:license and cc:license
  • added support for full Atom content model in title, tagline, info, copyright, summary
  • fixed bug with gzip encoding (not always telling server we support it when we do)
3.0b7 - 1/28/2004 - MAP
  • support Atom-style author element in author_detail (dictionary of “name”, “url”, “email”)
  • map author to author_detail if author contains name + email address
3.0b8 - 1/28/2004 - MAP
  • added support for contributor
3.0b9 - 1/29/2004 - MAP
  • fixed check for presence of dict function
  • added support for full Atom content model in summary
3.0b10 - 1/31/2004 - MAP
  • incorporated ISO-8601 date parsing routines from xml.util.iso8601
3.0b11 - 2/2/2004 - MAP
  • added ‘rights’ to list of elements that can contain dangerous markup
  • fiddled with decodeEntities (not right)
  • liberalized date parsing even further
3.0b12 - 2/6/2004 - MAP
  • fiddled with decodeEntities (still not right)
  • added support to Atom 0.2 subtitle
  • added support for Atom content model in copyright
  • better sanitizing of dangerous HTML elements with end tags (script, frameset)
3.0b13 - 2/8/2004 - MAP
  • better handling of empty HTML tags (br, hr, img, etc.) in embedded markup, in either HTML or XHTML form (<br>, <br/>, <br />)
3.0b14 - 2/8/2004 - MAP
  • fixed CDATA handling in non-wellformed feeds under Python 2.1
3.0b15 - 2/11/2004 - MAP
  • fixed bug resolving relative links in wfw:commentRSS
  • fixed bug capturing author and contributor URL
  • fixed bug resolving relative links in author and contributor URL
  • fixed bug resolvin relative links in generator URL
  • added support for recognizing RSS 1.0 in results['version']
  • passed Simon Fell’s namespace tests, and included them permanently in the test suite with his permission
  • fixed namespace handling under Python 2.1
3.0b16 - 2/12/2004 - MAP
  • fixed support for RSS 0.90 (broken in b15)
3.0b17 - 2/13/2004 - MAP
  • determine character encoding as per RFC 3023
3.0b18 - 2/17/2004 - MAP
  • always map description to summary_detail (Andrei)
  • use libxml2 (if available)
3.0b19 - 3/15/2004 - MAP
  • fixed bug exploding author information when author name was in parentheses
  • removed ultra-problematic mxTidy support
  • patch to workaround crash in PyXML/expat when encountering invalid entities (MarkMoraes)
  • support for textinput/textInput
3.0b20 - 4/7/2004 - MAP
  • added CDF support
3.0b21 - 4/14/2004 - MAP
  • added Hot RSS support
3.0b22 - 4/19/2004 - MAP
  • map ‘channel’ to ‘feed’, ‘items’ to ‘entries’ in results dict (old keys still work)
  • changed results dict to allow getting values with results.key as well as results[key]
  • work around embedded illformed HTML with half a DOCTYPE
  • work around malformed Content-Type header
  • if character encoding is wrong, try several common ones before falling back to regexes (if this works, bozo_exception is set to CharacterEncodingOverride)
  • fixed character encoding issues in BaseHTMLProcessor by tracking encoding and converting from Unicode to raw strings before feeding data to sgmllib.SGMLParser
  • convert each value in results to Unicode (if possible), even if using regex-based parsing
  • re-added mxTidy support, but off by default; install mxTidy and set feedparser.TIDY_MARKUP=1 to enable it

Filed under , , ,

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim