Good news, everyone. NewsMonster rediscovers the A-list for you. All joking aside, it’s good to see some more independent research in this area. I fiddled with this a bit last year (1, 2, 3, 4, 5, 6, 7). Radio has a feature called the weblog neighborhood. There is also free blogging ecosystem
data that could minimize the need for spidering.
In related but not so good news, it appears that NewsMonster will not be respecting robots.txt, an issue I brought up yesterday. In fact, it appears that the next version of NewsMonster will encourage its users to lie in the User-Agent field to circumvent attempts to block it, in much the same way it now lies in its Referer field to encourage website developers to click through from their referrer logs to the NewsMonster site (a form of logfile spam that more respectable aggregators have already stopped).
Hmm.
Lee Killough: How to Defeat Bad Web Robots With Apache.
Further investigation shows that NewsMonster will download any URL listed in the link element of your RSS items; it doesn’t sanity-check that it’s in the same domain as the RSS feed. This gives me an idea.
Here’s an up-to-date list of common User-Agents in case you’re puzzled by your access logs.
The Register: Watch out for those malicious referrer links. Last year I stumbled upon a similar bug in Manila (since closed), but since then there’s been a proliferation of little third-party scripts to show referrers inline. I had to deal with the same issues when I displayed automatic linkbacks. It seems that every new script, every new browser, every new aggregator — they all make the same stupid mistakes as the last one (or in some cases, worse mistakes). I’m all for reinventing the wheel if you just want to learn about wheels, but we don’t seem to be gaining any cumulative knowledge in the process.
§
Lying about the User-Agent is really not in the same league as the other crimes. If someone was using a version of NewsMonster that didn’t have spidering turned on and was blocked anyway, it really makes sense to lie about the user agent. Konqueror makes it really easy to lie about you browser because many sites restrict pages based on User-Agent in cases where it’s not warranted. It’s only because of the primary bug that it’s annoying.
Yes, I know lying in the User-Agent field is a long-established technique for alternate browsers. Opera, Safari, iCab, OmniWeb. (Actually IE started it, years ago, by masquerading as a Netscape-”compatible” browser.) But they still contain a marker of their true identity. Kevin is proposing an option to complete hide NewsMonster’s identity, masquerading as the standard Mozilla, which it obviously isn’t. And alternate browsers exhibit the same behavior as the browsers they spoof, but NewsMonster (in bulk-download mode) acts significantly differently than standalone Mozilla.
— Mark ![]()
So you would agree that it would be reasonable for NewsMonster to misrepresent itself as Mozilla if it was blocked by a site AND bulk-downloading was turned off?
Even in the default (non-bulk-downloading) mode, it should identify itself with a NewsMonster-specific marker. This can be part of a larger UA string; I recommended yesterday that NewsMonster use the standard Mozilla UA + “NewsMonster/version” + an identifying URL.
Last year, Radio UserLand (another news aggregator) had a very serious bug that could mis-parse the HTML content of a single RSS feed in a way that rendered the entire program unusable. ( See http://www.google.com/search?q=radio+double+decoding ) I was inadvertantly triggering this bug almost every day during my web accessibility series last summer (due to the nature of my posts, which discussed HTML markup). I was forced to work around the bug by presenting Radio with an abbreviated feed that did not include full HTML posts. This would have been impossible if Radio had spoofed the User-Agent field.
Program vendors are arrogant and assume they’ll never make mistakes, but mistakes can happen to anyone. Even if NewsMonster cleans up its act, it should *never* lie about who it is. A combination UA is probably the best solution. Kevin said yesterday he would was considering this, but then today’s post went all “user-configurable” and “spoof Mozilla” and “do anything to maximize the user’s experience”, so I am justifiably worried.
— Mark ![]()
<trivia>
Phoenix & Mozilla can also spoof the UA through the use of the preferences bar, available here: http://www.xulplanet.com/downloads/prefbar/
</trivia>
— Ken ![]()
Mark: Radio has a feature called the weblog neighboorhood. There is also free “blogging ecosystem” data that could minimize the need for spidering.
I’m doing a little experiment on my blog . It use a RSS feed to a Technorati Watchlist. I call this “Instant Comment”. Its so obvious that I guess there other people doing this.
http://blog.scriptdigital.com/index.php?entry=/Internet/Blogging/linvitation.html
— Emmanuel ![]()
Just to let you know, the latest build of Syndirella no longer puts the URL of its own site as the default referrer. The referrer is still configurable, but the default value for it is blank.
If it gets bad, you can always use mod_throttle .
Mark,
I’m glad you guys are talking about this stuff, but I don’t think I’ve seen you offer a compromise solution. NewsMonster is two things, an aggregator and an off-line reader for handhelds. I understand and respect your concerns about serving pages that are never even read, but how would you solve the “problem”.
If I were to suck in your site through plucker or some other tool, I assume you’d have the same concern, yes?
Jonathan, respecting robots.txt *is* the compromise solution.
Thanks for bringing Plucker to my attention. It has options for infinite-level spidering (much, much worse than NewsMonster), and it doesn’t respect robots.txt either. So it’s banned along with the rest of the rude bots I’ve been learning about in the past 48 hours. More on this soon.
— Mark ![]()
Arrgh. Messed-up UA strings are in the news all the time. I think it’s time we all agreed on a format for UA strings and got all browser/aggregator/spider/etc. makers to use this format. Browser makers could even include in the UA string what technologies they support, like CSS-2 or XHTML 1.1.
It would be something like:
ProductName/version (OS; supported technologies; revision) [AdditionalBrands/version] (site)
For example:
DiveIntoBrowser/1.0 (Windows XP; CSS2; XHTML1.1; 1.04) MegaCorpBrowser/8.0 (http://diveintomark.org/)
Unfortunately using the User-Agent for identifying standards compliance won’t work because browser makers don’t really know how standards-compliant their browsers are until after they ship. Microsoft thought IE5 got the box model right…
I agree with the general sentiment though. At the very least, all robots should include a URL in their User-Agent to a page that explains who they are, what they do, and (if they have any spider-like behavior) how to block them in robots.txt. Ditto news aggregators (although ones that simply grab a single RSS file do not need to honor robots.txt, but a URL to their home page would still be nice).
For example, what the hell is nntp/rss? It’s been showing up in my logs recently.
— Mark ![]()
Never mind, I found it.
http://www.methodize.org/nntprss
— Mark ![]()
Hey! I like Plucker…isn’t there a better way to limit spiders than simply prohibiting their access? I mean, I’ll go submit the RFE to respect robots.txt right now…the only reason I use Plucker is because there’s no RSS reader for the Palm. But I definitely don’t abuse it and don’t feel like I should be punished because the app developers haven’t implemented robots.txt.
— Ken ![]()
Mark, you’re disallowing /palm/ in your robots.txt? Doesn’t that mean that even if you allowed Plucker, it couldn’t pluck the information it really wanted?
I’m also asking the Plucker developers if they’re interested in supporting robots.txt. I’ll get back to you with the answer, but I suspect it’s going to be “fix it yourself”, and I’m afraid I just don’t have the time. If you wanted to submit a patch, however, I’m sure they would appreciate it. (Yeah, like you have more free time than I do. ;)
Denying access to any thing that caches pages for off-line reading is at least consistent, but it’s not very friendly. While I don’t do it any more, I used to be a big user of Avantgo, it gave me a change to read stuff when I was stuck between meetings.
I’m really trying to understand what your rational is for denying caching:
Is it that robots.txt should be an absolute and respected by all user agents (a logical line in the sand, but a HUGE barrier to entry for new agents, yes?)
Is it that you don’t want to pay for bandwidth people use to load pages from your site that they may not read? Again, logical but that disallows any type of store and read access, plucker, avantgo, etc.
Would it be acceptable for your stuff to be sucked once and cached elsewhere? Or is it that you don’t want to lose control of your content to any offsite caching? Logical, I WANT to cache stuff. You’re Dive into Usability was great stuff, the sort of thing I’d hate to loose. Web content is all too ephemeral and so much good content I’ve bookmarked has eventually gone away. If you go down that path, you should disable cut and paste, which would pretty much make the usability stuff useless, yes?
Not trying to be a jerk, trying to understand what your rationale is. Avantgo, for instance stores requests for pages that it doesn’t have locally for later download. If I were to suck in your RSS one-level deep, I’d get nont of the pages you reference, but clicking the link would grab the page for me the next time I synched. Sounds like you’d find this ok as long as Avantgo doesn’t lie about the user agent (though you’d have no way to know that the request wasn’t automated).
Thanks for your comments. I’m not trying to inconvenience readers, just robots and one-off leechers. Where that line is blurry, I will try to err on the side of allowing questionable access. But I don’t think you have a grasp of how bad the spambot/spybot/leecher situation really is. I will be publishing numbers shortly.
I will also be announcing a solution tailored for the needs of mobile users. In the meantime users of Plucker and AvantGo will continue to have full access.
— Mark ![]()
Jonathan, “Dive Into Accessibility” is fully downloadable in multiple formats:
http://diveintoaccessibility.org/
— Mark ![]()
Jonathan, robots.txt should be respected by any agent that follows links and recursively downloads pages. This includes NewsMonster (downloads RSS, follows all LINK elements, then downloads HTML and associated images), Plucker (downloads HTML recursively), AvantGo (apparently ditto, never used it), as well as offline bulk-downloaders like GetRight. It is *not* *that* *hard* to parse robots.txt; many languages have freely available libraries to do it. I found, learned about, used, and integrated Python’s robotparser.py in about half an hour.
*Respecting robots.txt would not stop you from reading my weblog offline*. Look at what I’m disallowing in my robots.txt. My stats pages, so that they don’t show up in searches. (Heh, that was fun.) My private playground, which is just a temp directory for testing new features and has no bearing on anything. My cgi-bin directory, for obvious reasons. /newdoor/, which generates a near-infinite number of pages, each of which hits a remote site and does several database queries. My /images/ and /pictures/ directories, because I choose not to participate in Google’s image search. /premium/, which was a joke. /magnetic/, my online magnetic poetry script which is nothing robots should be indexing.
None of this would prevent well-behaved robots from indexing thousands of pages of actual content — weblog content, projects, Safari info, even my latest “100 stories of unfamous people” project. None of this would stop you from reading my weblog offline.
There are lots of legitimate usage patterns that involve spidering, and robots.txt is a compromise to accomodate them. I’ve actually thought about what’s on my site, and what it’s reasonable for robots to have. But when I find programs that ignore all of that and simply take whatever they want, I am not at all inclined to go out of my way to meet them halfway.
— Mark ![]()
I have offered to add robots.txt support to plucker-build (the main downloader/parser) later this week, but I am very angry with you for taking the vigilante approach and not filing a bug report to warn us before blocking. Don’t you like Python applications being popular?
Mark, it just dawned on me that all your concerns are really just about bandwidth, and you’re spending all this time and effort (and emotional juice!) assuming that your bandwidth is a finite resource.
The point I’m trying to make is that maybe your time would be better spent trying to figure out a way to increase your effective bandwidth rather than fighting an uphill battle with stupid robots.
I’m sure that somewhere among your numerous daily readers there are some that might help you find free/dirt-cheap hosting. And although I’m not too familiar with DNS settings and site-mirroring software, I’m sure there is a way for someone like you to set up your site on more than one web-server and load-balance the incoming traffic transparently across all the different servers. I’m sure that some of your fans would be more than eager to help on something like this.
I mean, it seems that your biggest problem is not “evil page-sucking robots” but much rather your new-found Internet Celebrity status and the fact that loads and loads of people (including some employing Evil Robots of Doom(tm)) want to download your pages.
I’m sure that if you suddenly stopped being popular all your bandwidth problems would go away… :-)
Just a thought. Take care.
P.S. Mark, for what its worth, I totally agree with you that all respectable robots and web spiders should respect robots.txt, although as some people have already pointed out there is a very broad grey line there.
MJR, as I said in a later comment, Plucker is *not* banned. I am working on a specific solution for Plucker/AvantGo users which will be better for all involved. In the meantime, they will continue to have full access.
Furthermore, I resent the implication that I’m the one being a vigilante here. *You’re* the one who should be aware of this stuff; *you’re* the one writing a spider. A quite rude one, as it turns out. These are 10-year-old guidelines we’re talking about here; it’s not like this came out of thin air.
— Mark ![]()
Mar, I don’t think success has much to do with it. Spambots don’t care how many links you have in the blogger’s ecosystem; all it takes is one for them to find your site, and they’ll rip it to shreds and steal everything they can find. Ditto spybots (Turnitin, NameProtect, Cyveillance — search Google).
Most of the rest is simply ignorance, either people not understanding how rude they’re being by downloading every single Safari test case and associated full-screen images (I mean really, how many people need that other than Dave Hyatt?), or not understanding the options of their downloading program and accidentally spidering the whole site.
— Mark ![]()
Oh, and I did lobby for a way to increase my effective bandwidth: I got mod_gzip installed on my server. But then I changed to a new server and it’s not installed here yet. :( But even so, most of these bots are not smart enough to request compressed content.
— Mark ![]()
Success or no success, I *wholeheartedly* agree with your principles on robots.txt and rouge bots, but still think that if bandwidth usage is your problem, you might benefit from (at least partially) shifting your focus from fighting stupid program(mer)s towards finding ways to increase your effective bandwidth.
Either way, good luck. :-)
Why isn’t there some sort of standard for compressed Spider data? Mark is right, robots.txt has been around forever and should be observed (i.e. be nice and listen to what I ask you to do, or I’ll blacklist you). BUT there are lot of different powerful things that you can get by spidering the web.
It seems pretty simple to create a localhost spider than runs when YOU want it to and creates a gzipped dump of name, title, meta-tags, and then whatever in the way of links and images you want to allow.
Then you can handover a relatively small single file transfer to a nice spider, especially useful if your site is a heavy user of dynamic content.
Mark, an fyi - I just saw your comment about nntp//rss. The next release will contain a URL within the user-agent field.
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
firehose ‧ code ‧ music ‧ planet
© 2001–8 Mark Pilgrim