On the road to an RFC for Atom aggregator behavior. The first part (HTTP level) is relatively easy to define since it’s not format-specific. In fact, it’s not even syndication-specific. A news aggregator is fundamentally an HTTP client, and there’s a lot of prior art on how HTTP clients are supposed to work. (Building a news aggregator but don’t want to read all that? Tough! You’re building an HTTP client; you need to read all that. At least read the section on handling HTTP response codes.)

The catch is that aggregators, by design, retrieve the same resources over and over without user intervention, so any errors in HTTP implementation tend to be magnified.

MUSTs:

SHOULDs:

MAYs:

Next up: the XML syntax level. Atom feeds are XML. Do you handle XML properly? Are you sure?

§

Sixty comments here (latest comments)

  1. 410.

    Ahh, it seems there is a use for the “deactivated” feature we added to RSS Bandit a while ago but never enabled in the UI. Thanks for the pointer.

    As for libraries that stupidly consider the less popular HTTP response codes as errors I have a rant about that at http://www.kuro5hin.org/story/2003/5/3/23422/87859

    — Dare Obasanjo #

  2. A quibble: HTTP is not really a transport level protocol – TCP is the transport protocol. HTTP is the application protocol, or, if you really want to sound cool, the coordination language.

    Aside from that, thanks for encouraging good web citizenship.

    More recommended reading: Common User Agent Problems from the W3C (http://www.w3.org/TR/cuap)

    — matt #

  3. That all seems reasonable to me. Any special behaviour for 402s, or will that have to wait for Amazon’s new micropayment service? :)

    — Jim Dabell #

  4. anil dash's daily links (trackback)
  5. This is great info… can you consider this a feature request for your own feedparser.py: It would be damned handy if feedparser.parse() returned the http response code. If it is returned at the same level as ‘channel’, ‘etag’, ‘items’, etc, it shouldn’t break anything.

    — Ross #

  6. Great stuff. One addition for aggregator writers. It’s not enough to just handle these codes. For all the codes that are an error, please provide some obvious feedback to the user as well.

    — Julian Bond #

  7. Nice post. How about exponential backoff for when the server looks in trouble? On a 500 and 503 at minimum. May be overkill if the poll interval is long, like an hour, but if for some awful reason your aggregator is polling once a minute it may be worthwhile.

    — Anonymous #

  8. techno weenie (trackback)
  9. I don’t think HTTP is really a good transport for RSS/Atom items for a few reasons.. Basically the thought of a thousand, or a million, aggregator clients all pounding the same HTTP server looking for new data every hour just seems wrong.

    POP3 and NNTP are better models for this sort of data than HTTP. Especially relevant is the fact that both those protocols forward data to where it eventually lives on a server close to you.

    Personally I think a new “Syndication Transport Protocol”, losely based on NNTP, would be an excellent solution to a number of problems. For one, it would remove the bandwidth burden from the site publisher, spreading it fairly evenly around the Internet.

    With NNTP, I can post a news article using a very small amount of my own bandwidth, that a million people can read all using a relatively small amount of their own bandwidth. The content gets mirrored automatically so reachability isn’t a problem.

    You could do this on top of Atom or RSS now by simply creating a gateway and a new protocol.. I’ve been experimenting with doing this with RSS, my own NNTP server is here: news://food.dhs.org (copy and paste that into the address bar, it works Outlook Express, I don’t fully support NNTP yet).

    Project page and source is here: http://www.gotdotnet.com/Community/Workspaces/Workspace.aspx?id=937b67a8-123d-4e11-ae71-a9d67aec8447

    I plan to put together a real RFC type proposal for something like this at some point, but for now it’s just mostly tossing ideas around. Still, I think it solves a lot of problems with the current “one source of data, a million polling aggregators” model has.

    — Steve Tibbett #

  10. If an aggregator encounters a 404 or 410, it should probably try to load the root URL and then use RSS discovery to see if the file has changed name, and then update the feed URL in its config file automatically to point to the new URL (if it finds one).

    To give an example, say I have a blog at http://myblog.com/ and a feed at index.rdf. Then I change package and now have a feed at index.xml, with the old feed URL (index.rdf) going 404. The aggregator should fetch http://myblog.com/ and then discover the new feed at index.xml, and then update itself so that it uses the new URL.

    Does that make sense?

    — Neil T. #

  11. our server at work seems to use 300 if you type in something *similiar* to an existing page, and there are several similar pages (say, one with an .xml extension, one with a .php3 extension, and you typed in an .html extension, but that page doesn’t exist).

    — Elaine Nelson #

  12. I’d support persistent connections as well.
    Wouldn’t add much overhead for sites with a single rss file, but would make sense with mega hosting sites like radio.weblogs.com, livejournal.com, or sites with multiple feeds (nytimes?).

    — epc #

  13. Steve: have you looked at http://www.methodize.org/nntprss/ ?

    — Manuzhai #

  14. Re: NNTP-like protocol described by Steve Tibbett.

    I’m actually against that kind of idea, simply based on Usenet experiences. The issue I have is that those kinds of local storage techniques on a changing document tend to go out of date, or if there’s a relay mechanism, the update time can keep some users from getting things in a timely manner.

    If their source for something is constantly out of date, they’ll move up to a source closer to the original; eventually, they’ll be right back to pinging the original and the whole chain was a moot point.

    Yes, pinging on the hour is a pain in terms of repeated/unnecessary traffic (another suggestion not by me was a broadcast message when the feed changed instead), but pinging for an RSS/(not)Atom feed is certainly gentler on a server than pinging the web page itself (with all the full text, the graphics, etc) every hour looking for a change.

    — Joe Shelby #

  15. Mark: outstanding and timely post yet again. This has made me change my perspective about RSS clients and web integration. Thank goodness there are people thinking ahead of the curve; I’m learning more and more each day.

    Steve

    — Steve Kirks #

  16. Re: syndication over NNTP (or similar protocols). This has been discussed repeatedly, for instance here:

    http://www.advogato.org/article/651.html

    — Mark #

  17. Neil,
    Aggregators shouldn’t be automatically spidering sites and being abusive bandwidth hogs simply because of a 404 or 410. If the feed moves then the ite owner can use a 301 to redirect the feed or the user can specifically request the aggregator or some other sogftware perform the link autodiscovery.

    — Dare Obasanjo #

  18. Dare,

    That assumes that all site owners have access to and/or the knowledge of how to use a 301 etc.

    Everyone can change a file and change the link text. I don’t have access to do more than that on my host. I don’t know how to if I did. Although I could learn. I suspect it would be beyond many people who just want to add some sort of syndication to their blog.

    — Adrian Sevitz #

  19. warwick@typepad (trackback)
  20. As a user, I don’t want a network of mirrors and servers delivering content. What happens when they lose syncronization (bad spelling)?

    As a web developer, I want a master document set that contains content destined for all browsers, whether that means desktop apps, PDA/handheld devices, phones or syndicated content readers.

    As an app developer, I want one set of web APIs that I can use to write apps for each platform, ensuring that the rendering experience is consistent across devices and platforms.

    ..ranting and rambling…

    — Steve Kirks #

  21. Interesting coincidence that the NNTP discussion is happening here today, ’cause it’s only been a few days since: http://www.flutterby.com/archives/comments/6357.html

    — Dan Lyke #

  22. As far as I know 500 is temporary & 404 is permanent. At least this is how I’ve seen it done with some eprocurement stuff bots.

    — another Mark #

  23. There is no need for “as far as I know” when a definitive resource exists.

    http://www.ietf.org/rfc/rfc2616.txt

    — Anonymous #

  24. epc said: “I’d support persistent connections as well.”

    No no no! The last thing a webserver needs is a ton of connections stuck in TIME_WAIT (or even worse, FIN_WAIT) state.

    I would say “support HTTP pipelining” but maybe that’s what this commenter meant.

    — Other Mark #

  25. “If an aggregator encounters a 404 or 410, it should probably try to load the root URL and then use RSS discovery to see if the file has changed name, and then update the feed URL in its config file automatically to point to the new URL (if it finds one).”

    Maybe, and only maybe, should it try this with a 404, as I can see people having the location changed and not knowing how to redirect. But it should never do this with a 410. 410 means gone. The feed is no longer in existence. This is not a default state, it must be deliberately set.

    — Lach #

  26. I added 2 notes, one giving possible solutions for 404, and added 403. RFC 2616 says this about 403: “Authorization will not help and the request SHOULD NOT be repeated.” That seems clear enough. (And why would a feed be forbidden? Well, Slashdot bans aggressive aggregators that poll more than twice an hour.)

    404 is more problematic; I’d say the client MAY back off the polling for a while (widen the polling interval), MAY unsubscribe immediately, and SHOULD unsubscribe after a certain timeframe with no change in status (but I don’t know how long that timeframe should be).

    — Mark #

  27. RFC 2616 has no opinion on whether 404 is temporary or permanent.

    — Mark #

  28. I’m all for spidering sites looking for feeds that suddenly go 404, as long as (you know this is coming)… the aggregator supports robots.txt when it does it. Hitting one URL over and over again that a user specifically requested is fine; auto-spidering in search of new content sounds to me like it ought to be subject to robots.txt rules. (A prerequisite of this is that the client sets the User-Agent properly.)

    I don’t think it rises to the level of a SHOULD. It’s a nice MAY though.

    — Mark #

  29. Also added a note about 307, which (as far as I can tell) should be treated exactly like 302.

    Anyone who has ever used HTTP code 307, raise your hand…

    — Mark #

  30. I think it is very important to emphasize that as
    more and more spiders start crawling that the 304
    be respected and conditional gets are implemented
    by any spider. For more info check out:

    http://fishbowl.pastiche.org/archives/001132.html

    This is very important for sites that host
    host multiple blogs and I have been thinking of
    creating a black list for those spiders that
    don’t honor the conditional gets requests and
    crawl too many/often.

    This issue crosses over into aggregators as
    the number of people using aggregators increases.
    Imagine several thousand users using aggregators
    that hit a site that hosts thousands of blogs.

    — Carl Garland #

  31. Perhaps support of the Expires or Cache-Control directives should also be required. This would allow an aggregator to determine the next time a feed could be polled for updates and could conceivably give the publisher easy control over how often their site is polled. (See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html for more information on these headers.) The use of the Expires and Cache-Control headers would seem to be in the same spirit as implementing the etag and if-modified-since headers. Expires is a nice header to use because it is simple and backward compatible with HTTP/1.0; cache-control is nice because it has more features.

    — Sam Greenfield #

  32. Off-topic I know but the revision history doesn’t show up the change when the <DEL> tags are added around ‘content-type’. The correction shows up as added, but the new strike-through formatting doesn’t show as changed.

    Just me being picky I suppose.

    — Felix #

  33. HTTP/1.1 is about four years older than Atom, so I don’t think “it was in HTTP/1.0″ is something anybody should be interested in. I suppose Expires might be useful, as Phil Ringnalda put it a while ago, for “those rare sites that are updated by the clock, rather than by when content is available”.

    http://philringnalda.com/blog/2002/11/lies_damn_lies_and_mod_syndication.php

    — Daryl Oidy #

  34. Mark,

    I’ve experimented with 302/303/307s. There are some clients floating about that don’t understand 303/307 (I think Netscape 4.x is one of them), so they aren’t really used on websites.

    There’s no reason not to for Atom. The HTTP RFC states that a note with accompanying link SHOULD be provided for 307s, but makes it clear that this is due to legacy clients. I don’t think Atom needs to be worrying about legacy clients, and so I feel we would be justified in ignoring the SHOULD.

    I’ll also point out that 303 and 307 are better choices for redirects than 302, as they are unambiguous in the face of a changing method, and 302 is not.

    I would say that by serving Atom, you SHOULD provide temporary redirects with 303 or 307, but MAY provide temporary redirects with 302.

    Are there any HTTP implementations that make this difficult? I wouldn’t have thought so, but you never know…

    — Jim Dabell #

  35. Sorry for the phrase “As far as I know” – the rfc may be definitive but it does leave the interpretation of whether 404 & 500 are permanent or temporary up to the user agent.

    What I was trying to say is I have have seen implementations treating 404’s as an immediate failures and retrying 500’s once an hour for 12 hours before giving up permanently. This makes sense to me, as does the reduced polling interval on a 500.

    — another Mark #

  36. The Expires and Cache-Control headers could be very useful for aggregators. Most people are aware of them only as a way to PREVENT caching of pages — whereas with feed aggregators we want to PROMOTE caching, without compromising freshness, and this what these headers were really intended for.

    They tell the reader when the resource in question should rechecked for freshness (that is, a minimum time before bothering to send an If-Modified-Since or If-None-Match request). It does not guarantee that the resource WILL change at that time. (Neither does it forbid you from rechecking anyway, as when the user does Shift+Refresh or whatever.)

    If you post daily on average then you might as well set the Expires header on the feed to, say, 4 hours from your last post. If you post hourly during part of the day, sufficiently wily authoring software could calculate the Expires header automatically to express a shorter freshness interval during the active part of the day.

    If aggregators take note of Expired headers then this allows for most the features of the ‘poll-interval’ or ’syndication info’ modules people have for RSS 1.0., and the skipHours or whatever it is in RSS 2.0.

    — Damian Cugley #

  37. “Transport level”? Heh. Please Do Not Throw Sausage Pizza Away.

    — Jesper #

  38. One question about the User Agent string: “…MAY include URL of program home page.”

    How about also “MAY include URL of program *operator’s* home page”? That is, I’d include the URL to my blog in my user agent in order to leave footprints in the logs of feed owners? Better than the previous hackish practice of dumping that URL into the referer.

    Though, I suppose putting it in the User Agent is pretty hackish as well.

    — l.m. orchard #

  39. according to:

    http://www.w3.org/Protocols/HTTP/HTRESP.html

    302 is actually “Found 302″. 307 is an under-used way to say “this page will redirect temporarily. one would use 302 for something like /log -> /log/2003/07, which changes to /log/2003/08 when it’s august, but will always be /log. you’d use 307, for example, if you’re temporarily redirecting to /index.html to /gone.html or something similar.

    that said, it’d be nice to see people using 3xx redirects per the spec, so my thanks for including them all here.

    — louis bennett #

  40. I meant HTTP/1.1 pipelining/persistency, not the HTTP/1.0 hack.

    I also support cache-control and expires as “shoulds”.

    — epc #

  41. l.m.: Using it in the Referer [sic] is totally going against its uses. User-Agent, though, is MADE to hold information about the program, app or script that’s fetching the information, so that’s a bit more semantically clean.

    — Jesper #

  42. Program name + version number SHOULD go in User-Agent, in the form “ProgramName/version”. e.g. “FooAgg/2.1″. This is straight from RFC 2616.

    http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.43

    Anything beyond that (unique subscriber ID, program home page) is optional. I mention program home page only because many anal-retentive overworked sysadmins will ban unknown user-agents first and ask questions later; giving the aggregator’s home page at least gives them chance to identify it. Don’t assume everyone in the world (even the online world) knows what a news aggregator is, or recognizes your little program by name.

    Stuffing other information in User-Agent (such as subscriber home page or a unique per-subscriber ID) is beyond the scope of this RFC. There are privacy implications, and aggregator vendors and customers can work that out on their own.

    I’m mainly interested in stopping referrer abuse.

    — Mark #

  43. Louis,

    From that page you link to:

    “This is a historic document and is not accurate anymore. For up-to-date details on the HTTP specification, see the latest HTTP/1.1 drafts”

    I’m not even sure what distinction you are trying to make between the two codes. They are both temporary redirects. Directly from the RFC:

    “10.3.3 302 Found

    The requested resource resides temporarily under a different URI.”

    “10.3.8 307 Temporary Redirect

    The requested resource resides temporarily under a different URI.”

    — Jim Dabell #

  44. Mark:

    I’m using your feedparser module. How can I access the HTTP response code when retrieving a feed?

    I’m curious, because I would really want to implement what you suggest here

    — Ricardo Reyes #

  45. Ricardo: currently you can’t. :( But I’m working on version 3.0, which will expose the full set of HTTP headers (among other improvements).

    — Mark #

  46. Privacy issues aside (for the moment), there is a request header called “FROM”, RFC 2616 s14.22 describes it.

    Now, it does say it should, if given, contain an Internet e-mail address for the human user who controls the requesting user agent. SHOULD isn’t MUST though, so what putting the user’s homepage there?

    It also says “In particular, robot agents SHOULD include this header so that the person responsible for running the robot can be contacted if problems occur on the receiving end.”

    — Eric Scheid #

  47. Sam Ruby (trackback)
  48. Reposted from http://www.intertwingly.net/blog/1526.html

    Kinda ironic that a post on application-level semantics has the word transport-level in the title. T is for Transfer, see: http://lists.w3.org/Archives/Public/xml-dist-app/2002Mar/0322.html

    You’re half way there, Mark. :-)

    — Arien #

  49. Pedants…

    — Mark #

  50. I wish that someone had a sample implementation out there in the language of their choice to take a look see at, that would make life much easier.

    — John Beimler #

  51. Rinconcito Sudaca (trackback)
  52. re: 503 – how long to wait before re-polling? That’s what the Retry-After header is for…

    http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.37

    — Eric Scheid #

  53. Notes written on the world's last-standing wall... (trackback)
  54. Pedants?

    — Arien #

  55. > Pedants?

    As in “The Pirates of …”

    — Mark A. Hershberger #

  56. Thanks for the reference. I have been meaning to implement a better HTTP engine in my aggregator, and this is just the type of definition I needed. Awsome!

    MyHeadlines, http://www.jmagar.com

    — Mike Agar #

  57. mnot's weblog (trackback)
  58. A very handy summary of the issues. I’ve linked it from http://rdfweb.org/topic/ScutterSpec since many of the issues are similar for FOAF aggregators.

    — Dan Brickley #

  59. The 4xx base class should have a specified handling, IMHO. The spec says you SHOULD NOT repeat a request that got a 400 without modifying the request, which suggests that the best behaviour already exhibited above for other response codes is to unsubscribe. I’d be interested to hear other people’s views …

    (Also, is there any chance of this list making its way into a permanent quotable document, or something on the Atom wiki or similar?)

    — James Aylett #

  60. Rinconcito Sudaca (trackback)

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodeplanet

© 2001–9 Mark Pilgrim