Continuing yesterday’s discussion, here are some tests for proper HTTP support in aggregators. Comments and bug reports welcome.
§
Semi-related: we’ve setup some example RSS feeds with the various combinations of SSL and HTTP Basic Authentification for anyone who’s testing support in aggregators: http://labs.silverorange.com/archives/2003/july/privaterss
Not really a bug and not relevant for testing, but the Reason Phrase in your Response Status-Line always reads ‘OK’ (did not check all tests).
It would be nice if it made more sense. For example, the Response Status-Line of the 410 test is now ‘HTTP/1.x 410 OK’ and could be changed to ‘HTTP/1.x 410 Gone’. However, aggregetors should not do anything with the field anyway…
(you didn’t print the ‘OK’ on purpose because the tests *should* return these status codes, did you?)
— Martijn ![]()
Very nice. Thanks, Mark.
— jacob ![]()
Related question: What does supporting ETags AND Last Modified buy you over just supporting one?
My MT plugin only implements last-modified checks. It was never clear to me what I was missing not supporting Etags.
Timothy – from an aggregator perspective, it is pretty important to support both Last-Modified and ETag. Some web servers or scripts generate only ETag or Last-Modified headers – not both. If you were to only support one within your HTTP client logic, there’s a great chance that you’ll operate unoptimized on a number of sites.
Some servers support Etag, some support Last-Modified. Apache seems to include both headers on my feeds, but many custom scripts (such as PHP feed-generating scripts) may only include one or the other. Clients should support both.
— Mark ![]()
I think the Status line thing is fixed now.
— Mark ![]()
For those of us who’ve home-brewed their blogs, is there a site where we can test if our feeds are properly tossing out the right headers, etc?
Ian: ah, you want the *server* aggregator HTTP tests. Those are coming up.
— Mark ![]()
Ian: In the meantime, you might be interested in looking at http://www.mnot.net/cacheability/
Ian: there’s also this tool for checking the headers of any url (although it doesn’t tell you the Etag or Last-Modified):
http://www.webmasterworld.com/stickymail.cgi?action=headers
How about 401 for Unauthorized. I have written an aggregator for a client that had feeds in protected directories, one for each department. I would have liked to use ASP.NET’s automatic Windows security to provide authentication, although the client preferred it this way.
Another online tool for checking HTTP headers is http://www.delorie.com/web/headers.html – which also gives you the ETag, if presen
Personally, I use wget for this: wget -S –delete-after http://www.example.com does the job nicely. If you haven’t got a version on your system already, http://www.gnu.org/software/wget/wget.html
— Arve ![]()
Thanks for the thoughtful, effective leadership!
I think there’s a danger here of simply putting an HTTP tutorial into the Atom RFC.
Shouldn’t this kind of stuff be factored out into a “best practices” document of some sort?
A few links to help you.
And the famous unfortunately not well supported Content Location
and
Yes, and an XML tutorial, and an HTML tutorial, and so on…
However, given the historical lack of compliance with even the most basic HTTP, XML, or HTML consumption practices, a few words of advice would not be out of line. And there’s nothing wrong in general with taking general rules (like RFC 2616) and rewriting them as specific rules that go into greater depth about how a specific class of applications should behave. And writing test cases that cater to that specific class.
Likely it would go in its own RFC. But it’s something that needs to be written, and it needs to be written specifically for aggregators, otherwise the entire class of applications will (based on previous experience) simply ignore the rules out of ignorance.
Simple question: how many aggregators pass all these tests today? Based on my logs, very very few. So there’s obviously a need here.
— Mark ![]()
NewsGator’s report card: http://www.rassoc.com/gregr/weblog/archive.aspx?post=629
Greg: Etag and Last-Modified tests are fixed. Thanks.
— Mark ![]()
Cool, working well now, as is your test feed which requires authentication.
Not sure about the language of some of these – I see a lot of MUSTs where SHOULD would be much more appropriate. Some of the musts are valid but you seem a bit keen on them. Ditto SHOULDs and MAYs.
Some feeds return 404 Not Found occasionally, including the BBC News, I guess while a static file is rebuilt, though it could be anything. Polling less often isn’t a great idea in these cases (if anything it should try again at least once a minute or two later).
I guess what I’m saying is, the server isn’t always right, and the instructions here don’t let clients make allowances for that.
Specific suggestions would be more helpful than general complaints. If anything, I would say that SHOULDs ought to be escalated to MUSTs because of the automated, repetitive nature of aggregators.
For example, polling 404s every minute just in case they come back is, without a doubt, the worst idea I’ve ever heard.
— Mark ![]()
Maybe it’s just me, but I read “try again … once a minute or two later” means “poll again _once_ after a minute or two have elapsed”, not “poll _every_ one or two minutes”.
Seems to me the real problem is that there isn’t a dedicated “temporarily absent” code. Maybe it’d be more appropriate for the BBC to issue a 500 in this case? Although ideally it shouldn’t be going absent at all…
(BTW, “the worst idea I’ve ever heard” seems a bit harsh to me.)
“if anything it should try again at least once a minute or two later”
At least once a minute or two later. Yeah. And there should be a record of which feeds are ‘dead’ too (polled atleast five-ten times and not alive).
— Jesper ![]()
re: “worst idea i’ve ever heard”
You should see my logs. Full of crawlers that can’t take a hint. I said 403 and I meant it; no need to try again, folks! Full of aggregators requesting feeds over and over that “permanently” redirected months ago, or that went 404/410 months ago. Now on top of that I should deal with clients second-guessing me?
Clients *have* to assume that servers know what they’re doing. That’s what standards are for. 301 means permanent redirect. 410 means permanently gone. Now, the standard doesn’t say whether 404 is permanent, or how long it might last, but surely the answer on the client side is not *more frequent* polling.
If your ex-girlfriend won’t take your calls, the correct answer is *not* to call her every 2 minutes hoping she changes her mind.
— Mark ![]()
Use mozilla to watch headers with the Live HTTP Headers tool. Its a simple tool to use.
RE: “For example, polling 404s every minute just in case they come back is, without a doubt, the worst idea I’ve ever heard.”
Agreed. This is a very bad idea and is tantamount to building a Denial of Service client.
I’m stunned that anyone could see any way this could be considered a useful feature and not the a glaring bug if an application misbehaved in such a manner.
In your analogy 404 is more like no one answering. You can’t assume she hates you when experience tells you people occasionally leave the house.
Like people have said for me, I’m only suggesting clients should check once or twice a minute or two later, not every minute until it comes back. It’s just in my experience, when a feed returns a 404, if you try again a moment later it’ll be back. What’s wrong with incorporating that experience into my client? Obviously the feed should never do this, but it does, and the client should know how best to deal with it. It’s called robustness.
A specific example of must/should etc:
The HTTP RFC says for 410: “Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval”. You say MUST unsubscribe, and don’t imply user approval. I’d say they MUST inform user asap, and MAY unsubscribe. Users stop using apps that don’t let them be in control.
Be in control? The feed is gone, and the server operator has taken the time to set the appropriate HTTP error code to indicate that it’s gone for good and it’s never coming back. What more do you want? Flowers?
I don’t care what the client actually does in GUI-space, or how it breaks the news to the end user. Just stop hitting my server.
— Mark ![]()
Graham,
So when you call someone’s house and they don’t answer do you keep calling back every minute perhaps with the help of a war dialer or some other automated mechanism?
What you describe is malicious behavior and I’d be sure to ban any client that attacked my server in such a manner.
“If your ex-girlfriend won’t take your calls, the correct answer is *not* to call her every 2 minutes hoping she changes her mind.”
I think the message meant “check one or two times atleast”, which by no means, last time I checked, “and repeat ad naseaum”. It *could* be a high traffic site whose feed is being hit repeatedly and goes 404. I’ve seen it happen. I ofcourse agree with you that there should be some sort of upper limit. Three tries should be more than enough.
— Jesper ![]()
Mark, I agree completely and I am setting my client to stop automatically on a 410, and I’d call it a must in dictionary-defined lowercase terms. But if you’re using RFC-defined uppercase terms, you need to be more careful. From RFC2119: “In particular, [these terms] MUST only be used where it is actually required for interoperation or to limit behavior which has potential for causing harm.” Unnecessary reloading counts as harm, but that’s covered by SHOULD’s “the full implications must be understood and carefully weighed before choosing a different course.” As long as it’s clear that still retrying after a 410 is a boneheaded thing to do, SHOULD is very much appropriate.
Dare, I’ve already said that’s not what I’m suggesting.
— Graham ![]()
Graham, I’m sorry I misinterpreted you earlier. You make a convincing argument for using SHOULD instead of MUST on 410, and I’ve updated the test suite accordingly.
— Mark ![]()
Also consider the edge case of it *not* being your server which is responding to the request … say your hosting service screws up and your entire site goes 404 for a while … once the screaming dies down you then have a huge bunch of subscriptions who have been unceremoniously dumped.
Hands up those ppl here that have ever had a hosting service screw up.
Eric– Mark’s not dictating that all record of there ever being a subscription on an URL that is now 404 be removed. He’s asking that agents not repeatedly bang the server.
The answer, in UI-space, then, might be to disable any feeds which are 404′ed, and allow the user to reactivate them. Or somesuch. You could make up complicated rules on the client to make a nice experience without punishing the server.
Why isn’t respect for robots.txt considered important for aggregators? I think they straddle the line between user agent and bot, and erring on the side of caution is probably a good policy…
Do ANY aggregators support it?
Ross: There has been some confusion about my robots.txt position, despite my absolute consistency on the matter. It has always been my position that agents that simply fetch a single resource over and over do not need to respect robots.txt. It is only when they follow the links found in that resource and recursively download resources that robots.txt comes into play.
For example, I have blocked wget in my robots.txt file. But you can still issue the command:
$ wget http://diveintomark.org/
and it will download my home page. However, if you say
$ wget -r http;//diveintomark.org/
it will download my home page, *then* download robots.txt and see that it is blocked from *following* links, and quit.
This translates to aggregators like this. Suppose the aggregator has an option to follow the [link] elements of an RSS feed and download the (HTML) permanent archive pages of each [item]. It should *always* download your RSS feed, but before following those [link] URLs, it should check with robots.txt.
Anyway, all that should definitely go in an RFC, but not this one. Probably the one about auto-discovery, which involves a similar situation: an aggregator downloading the home page (at user’s explicit request to “subscribe” to the site), and then following a link in that resource (specifically the [link] element of the [head]) to find the feed. This second step should also respect robots.txt, and my rssfinder.py module does this.
— Mark ![]()
That was helpful, thanks.
— Ross ![]()
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
© 2001–9 Mark Pilgrim