dive into mark

You are here: dive into markArchivesMarch 2008A dead fish in yesterday’s newspaper

Saturday, March 15, 2008

A dead fish in yesterday’s newspaper

Time to resurface a few good comments I made at Tim’s place last year:

> if an electronic-trading system receives an XML message for a transaction valued at €2,000,000, and there’s a problem with a missing end tag, you do not want the system guessing what the message meant

You [Tim] have used this example, or variations of it, since 1997. I think I can finally express why it irritates me so much: you are conflating “non-draconian error handling” with “non-deterministic error handling”. It is true that there are some non-draconian formats which do not define an error handling mechanism, and it is true that this leads to non-interoperable implementations, but it is not true that non-draconian error handling implies “the system has to guess.” It is possible to specify a deterministic algorithm for graceful (non-draconian) error handling; this is one of the primary things WHATWG is attempting to do for HTML 5.

If any format (including an as-yet-unspecified format named “XML 2.0″) allows the creation of a document that two clients can parse into incompatible representations, and both clients have an equal footing for claiming that their way is correct, then that format has a serious bug. Draconian error handling is one way to solve such a bug, but it is not the only way, and for 10 years you’ve been using an overly simplistic example that misleadingly claims otherwise.

And, in the same thread but on a different note:

I would posit that, for the vast majority of feed producers, feedvalidator.org *is* RSS (and Atom). People only read the relevant specs when they want to argue that the validator has a false positive (which has happened, and results in a new test) or a false negative (which has also happened, and also results in a new test). Around the time that RFC 4287 was published, Sam rearranged the tests by spec section. This is why specs matter. The validator service lets morons be efficient morons, and the tests behind it let the assholes be efficient assholes. More on this in a minute.

> A simpler specification would require a smaller and finite amount of test cases.

The only thing with a “finite amount of test cases” is a dead fish wrapped in yesterday’s newspaper.

On October 2, 2002, the service that is now hosted at feedvalidator.org came bundled with 262 tests. Today it has 1707. That ain’t all Atom. To a large extent, the increase in tests parallels an increase in understanding of feed formats and feed delivery mechanisms. The world understands more about feeds in 2007 than it did in 2002, and much of that knowledge is embodied in the validator service.

If a group of people want to define an XML-ish format with robust, deterministic error handling, then they will charge ahead and do so. Some in that group will charge ahead to write tests and a validator, which (one would hope) will be available when the spec finally ships. And then they will spend the next 5-10 years refining the validator, and its tests, based on the world’s collective understanding. It will take this long to refine the tests into something bordering on comprehensive *regardless of how simple the spec is* in the first place.

In short, you’re asking the wrong question: “How can we reduce the number of tests that would we need to ship with the spec in order to feel like we had complete coverage?” That’s a pernicious form of premature optimization. The tests you will actually need (and, hopefully, will actually *have*, 5 years from now) bears no relationship to the tests you can dream up now. True “simplicity” emerges over time, as the world’s understanding grows and the format proves that it won’t drown you in “gotchas” and unexpected interactions. XML is over 10 years old now. How many XML parsers still don’t support RFC 3023? How many do support it if you only count the parts where XML is served as “application/xml”?

I was *really proud* of those 262 validator tests in 2002. But if you’d forked the validator on October 3rd, 2002, and never synced it, you’d have something less than worthless today. Did the tests rot? No; the world just got smarter.

On a somewhat related note, I’ve cobbled together a firehose which tracks comments (like these) that I make on other selected sites. Many thanks to Sam for teaching me about Venus filters, which make it all possible. If you’ve been thinking “Gee, I just can’t get enough of that Pilgrim guy, I wish there were a way that I could stalk him without being overly creepy about it,” then this firehose is for you.

Filed under , , , , ,

22 comments

  1. Open sourced code for that firehose coming soon? It’s quite awesome.

    Comment by Tom — Saturday, March 15, 2008 @ 3:19 pm

  2. http://firehose.diveintomark.org/filters/

    Here’s how it works:

    Like http://feeds.diveintomark.org/ , it aggregates feeds. Some sites (like Reddit) provide per-user feeds, so that should be easy. However, I generally write a filter for these anyway so I can munge titles or other elements for consistency. In the case of Reddit, the content of each entry doesn’t have Markdown applied (it’s just whatever you typed in, despite the fact that Reddit applies Markdown to every comment). So I wrote a filter that applies Markdown to each entry’s content before outputting it.

    Other sites (like my blog) only have one “recent comments” feed that has comments from all users. So instead of showing you the whole comment feed, the firehose filters them to only output stuff I wrote. Filters are written in Python and act on a normalized Atom entry, which I load into libxml2 and query with XPath. And, if necessary, munge with libxml2 functions as well.

    Comment by Mark — Saturday, March 15, 2008 @ 4:28 pm

  3. Oh, and http://firehose.diveintomark.org/config.ini shows how you associate a filter with a specific feed.

    Comment by Mark — Saturday, March 15, 2008 @ 4:31 pm

  4. Hmmm … part of me wishes I was popular enough to merit a stalker, then again, seeing the effort and frustration involved; probably not.

    Good hack though - I’m sure I’ll find a use for it in some mutated variant when I get the time.

    Comment by Dean P. — Saturday, March 15, 2008 @ 4:39 pm

  5. Great, thanks Mark.

    Comment by Tom — Saturday, March 15, 2008 @ 4:43 pm

  6. Oh, and those filters are open source, MIT license. I’ll add appropriate headers when I get a chance.

    Comment by Mark — Saturday, March 15, 2008 @ 4:46 pm

  7. I’ll gladly add those filters, and any theme, and/or any associated documentation you want, in Venus itself.

    I also think it would be worth brainstorming to see if something like Apache’s config.d directory under Ubuntu would make sense here. If there were a directory with a file per subscription, and the filter source itself could be included inline, this may ease maintenance.

    Comment by Sam Ruby — Saturday, March 15, 2008 @ 5:07 pm

  8. I understood very little of what you just said, but I’d be happy to brainstorm it with you, either here or over IM or whatever. Time permitting, I’d also be willing to write a “case study” of using Venus + filters in this manner, which could become part of the official documentation.

    Comment by Mark — Saturday, March 15, 2008 @ 7:25 pm

  9. Apache 2 allows you to do things like Include conf.d/*.conf.

    Your filters require quite a bit of expertise to develop. Let’s refactor them. Suppose Venus could be told to look at all the files that match a glob and __import__ them.

    What’s the simplest syntax such files could have?

    Comment by Sam Ruby — Saturday, March 15, 2008 @ 8:16 pm

  10. > What’s the simplest syntax such files could have?

    RDF?

    Comment by Mark — Saturday, March 15, 2008 @ 8:34 pm

  11. I haven’t looked too deeply into Venus filters but it seems that Plagger has better identified the problem. Too bad it’s in Perl.

    http://plagger.org

    Comment by elided — Saturday, March 15, 2008 @ 9:13 pm

  12. More seriously, writing filters in Python would be a heck of a lot easier if you passed the relevant entry from the feedparser results dict instead of an XML serialization of same. No libxml2, no XPath, just some if statements and Python string manipulation.

    Comment by Mark — Saturday, March 15, 2008 @ 9:17 pm

  13. On a related note: Man, do I ever wish I could sanely and easily dump my feeds into a SQL database (and not lose any of the semantics of the feed). From there I could reprocess them, output them however I want (e-mail, my own web based feed reader), search them, etc. But it’s a serious pain to try and normalize RSS and Atom feeds and then jam them into a database in normal form.

    Comment by elided — Saturday, March 15, 2008 @ 9:20 pm

  14. Look in planet/spider. Search for “filters”. At that point, data is the entire feed, and entry is the specific entry.

    Comment by Sam Ruby — Saturday, March 15, 2008 @ 9:29 pm

  15. Hmm. Perhaps we could call XSLT filters with an XML serialization, and Python filters with a Python dict? Might get ugly if a feed had both XSLT and Python filters, though. Maybe we could just dictate that each feed can only have one kind of filter?

    Comment by Mark — Saturday, March 15, 2008 @ 10:25 pm

  16. Here is how I follow you in friendfeed http://friendfeed.com/users/ed196fd4-f306-11dc-97e6-003048343a40

    Comment by Shakeel Mahate — Saturday, March 15, 2008 @ 11:30 pm

  17. Plugins (.plugin) differ from filters (other extensions, like .py and .xslt) in that while filters are forked, plugins are imported.

    Immediately before “# apply any filters” could be an “# apply any plugins”. Both sections of code could iterate over config.filters(feed_uri), and both can check to see if filter.endswith(’.plugin’) and act appropriately.

    To date, I’ve only written, and am only aware of, template filters (executed by planet.splice.apply). This means that a change to spider, such as the one described above, would not likely affect anybody.

    Comment by Sam Ruby — Sunday, March 16, 2008 @ 6:13 am

  18. Actually, I was originally thinking about something more fundamental than this.

    With your firehose, adding a subscription requires you to update one file too many in my opinion. Ideally, the python logic would be in the configuration file itself, either that or Venus would be able to query the filter to get the uri, name, Hackergotchi or other subscription data.

    Comment by Sam Ruby — Sunday, March 16, 2008 @ 9:42 am

  19. Having worked on Wall Street it scares the bejeesus out of me how these people communicate. Never mind the automated straight through processing system talking to the other automated exchange system the email that started all this probably reads something (vaguely) like

    “> FORD OK IF YOU COVER THE UNDER 22 ON THE 08/11 CTRCT.

    DONE”

    The net result of which is an order gets placed for a complex six million dollar trade inf Ford debt on the Euro CDS market.

    I’m not kidding. Traders have very short attention spans, thats how they communicate. As I say it scares the crap out of me.

    Comment by Miles Thompson — Tuesday, March 18, 2008 @ 7:23 pm

  20. You have won me over to your side of thinking. I have changed my position on error handling because of your logic (not that I’m some XML guru, but changed minds are unusual for anyone).

    Comment by Tycho Martin Clendenny — Wednesday, March 19, 2008 @ 6:00 pm

  21. What exactly is draconian error handling? My (possibly incorrect) understanding of the term is: reject input that doesn’t follow the specification.

    If you have “a deterministic algorithm for graceful (non-draconian) error handling”, then it’s not really error handling in the traditional sense. You’ve changed the specification to allow, for example, omitting the end tag in certain situations. You still want consumers to follow the spec to the letter and reject input that doesn’t conform. One disadvantage is that you’ve now defined a more complicated spec, which will be harder to implement correctly (though maybe it’s worth it).

    And with these additional “normalizing” rules, it’s more likely that an input will be misinterpreted. I feel a little lame bringing up this example again, but let’s say a program intending to write “200000″ gets its output truncated somehow, which I don’t think is far-fetched. I wouldn’t want a consumer to see “200″ and apply the “implicit close tag” rule here, deterministic or not.

    Maybe implicit close tags are a bad example. Maybe explicit end-of-data delimiters are just one of those good ideas that you shouldn’t screw around with. Then again, it’s a common error recovery technique in HTML and I really don’t know the kinds of rules people have in mind for XML error recovery.

    Comment by Kannan Goundan — Monday, March 24, 2008 @ 12:54 am

  22. Pingback by links for 2008-03-24 at a convenient truth

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim