dive into mark

You are here: dive into markArchivesJanuary 2003Parse at all costs

Wednesday, January 22, 2003

Parse at all costs

My latest article is up at XML.com: Parsing RSS at all costs.

The point of XML is that content producers are supposed to put up with the pain of XML formatting rules so that content consumers can do cool things with off-the-shelf tools. Well, guess what? It’s not happening. Judging by the sad state of affairs in the RSS world, content producers are either ignorant of the error of their ways, or too lazy to fix the errors, or too busy, or locked into inflexible tools whose vendors are too busy… Whatever the reasons, content consumers are rarely in a position to solve the problem. So we must work around it. We need a parse-at-all-costs RSS parser.

I know, I know, this is how HTML got to be tag soup: browsers that never complained. Now the same thing is happening in the RSS world because the same social dynamics apply. End users who can’t even spell XML certainly don’t care about silly little formatting rules; they just want to follow their favorite sites in their news aggregator. When 10% of the world’s RSS feeds are not well-formed — including some high-profile feeds that thousands of people want to read — the ability to parse ill-formed feeds becomes a competitive advantage. (And if you think the same thing won’t happen when RDF and the Semantic Web go mainstream, you’re deluding yourself. The same social dynamics apply. Boy, is that going to be messy.)

As the article notes, there is also a social solution to this problem. You can register at Syndic8 as a fixer and volunteer your time contacting vendors and individual content producers and get them to fix their RSS feeds and the software that produces them. (You can use the RSS validator to help make your case.) It’s a strictly volunteer effort, and I’m sure they would appreciate your help.

But realistically, there will always be invalid feeds (one curly apostrophe is all it takes), and there will always be apathetic end users. Which means that for vendors of RSS-consuming software, the ability to parse tag soup will always be a competitive advantage. You can sit around wishing it were different all you like, but there it is.

I do, however, acknowledge the irony. It takes balls to write an article for XML.com demonstrating how to parse an XML-based format without an XML parser.

On a related note, I have accepted an invitation to join the Web Standards Project, where I will advocate to developers the benefits of adhering to web standards, write tutorials on how to use standards properly, and work directly with vendors to ensure that their products are as standards compliant as possible. This leaves unanswered the obvious question of how many people turned down the invitation before they got to me, but it’s an honor nonetheless, and I’m thrilled to accept it.

Update: some reactions to the XML.com article:

Re: Benefits and harms are not evenly distributed

But if a public newsreader did not parse the RSS instead returning a broken message to the clients of said feed then would this not create direct and immediate pressures on feed authors and sites to produce valid xml.

My response:

End-user perspective

No. You are punishing the wrong people. You are still operating under the mistaken impression that XML, in and of itself, is important. It is not. It is a means to an end. End users don’t care. And they shouldn’t have to care.

Look, I was in this position: I tried several news first-generation aggregators that only used real XML parsers. Feeds would go unreadable for days at a time, and by the time they came back I had missed dozens of articles. I tried to switch to another aggregator that could allow me to follow the sites I wanted to follow, but none satisfied me, so I ended up writing the parse-at-all-costs RSS parser and building a homegrown aggregator around it for my own use.

And I’m technically inclined. I care about XML. Imagine the reaction of an end user who isn’t, and doesn’t. They bought (downloaded/whatever) a program that purports to help them read all the news and follow all the sites that they care about. They like this idea. Then they find out that sometimes it doesn’t work, sometimes sites that worked yesterday don’t work today, and some sites don’t work at all, because of something called XML. They don’t know from XML, they’ve never seen XML, they don’t care about XML, but this stupid POS program is complaining and saying there’s nothing it can do about this XML problem and suggesting, in its infinite wisdom, that the end user should take it upon themselves to work around this problem by sending an email to the site owner and waiting an indeterminate length of time before they can read the news they care about, if ever.

You’re kidding, right?

Then the user hears about another aggregator, a direct competitor, which claims to be able to let them follow all the sites they care about. It doesn’t complain; it doesn’t whine; it doesn’t suggest that they work around the developer’s laziness by firing off emails to random people they’ve never met. It just works.

Which would you choose?

Content producers have a natural monopoly on their own content. If I want to read The Register in an aggregator, there’s only one legitimate source. News aggregators, on the other hand, are close to commodities. They compete on features. Feature #1 is being able to download and parse content. If you can’t parse the content the user wants to read, features 2, 3, and 4 don’t make any difference.

Filed under , , , ,

10 comments

  1. Trackback by Too Much News
  2. Trackback by Dejimah
  3. Trackback by tima thinking outloud.
  4. Trackback by Lost Boy
  5. Trackback by Dejimah
  6. Trackback by Dejimah
  7. Trackback by Vertical Hold
  8. Trackback by Vertical Hold
  9. Trackback by Vertical Hold
  10. Trackback by public virtual MemoryStream

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim