NewsMonster (currently beta 1
) is a preview of where personal news aggregation may be headed. It is, for lack of imagination at this late hour, an uber-aggregator. Ignoring for the moment all the things it doesn’t do yet (which all sound quite cool), it has one particularly disturbing feature: extracting full HTML content from linked RSS items. The feature is off by default, but once turned on (one checkbox during installation), every time it finds a new RSS item in your feed, it will automatically download the linked HTML page (as specified in the RSS item’s link element), along with all relevant stylesheets, Javascript files, and images.
Clearly, we have crossed a line here. Other aggregators have various options to reformat or display content from RSS feeds — even HTML content stored in the description or content:encoded elements. That’s fine, that’s what RSS feeds are for. You can include as much or as little content as you wish. But NewsMonster goes a step further by essentially bundling a bulk auto-downloader with its aggregator. (Hence uber-aggregator
, a term which I’m intensely disliking already.)
NewsMonster doesn’t care if you don’t provide full content in your feed. In fact, it doesn’t care if you do; it will still download the original HTML pages and images anyway. And, more disturbingly, it doesn’t respect robots.txt like other well-behaved bulk-downloaders. (For example, Wget is a Free Software program which can recursively retrieve web pages; it supports robots.txt). In fact, NewsMonster doesn’t currently provide a unique User-Agent string (it just comes across as a standard Mozilla browser), so it couldn’t possibly support robots.txt.
There is currently a discussion going on over on Ben Hammersley’s site about these issues, including the fundamental issue of whether uber-aggregators ought to respect robots.txt. Obviously, I think they should, starting with NewsMonster. Let’s review:
- A program retrieves a resource (RSS)
- It parses the resource for links (
linkelements) - It follows those links and retrieves the linked resources (HTML pages)
- It parses each of those resources for links (to CSS, JS, images)
- It follows each of those links and retrieves those resources
At this point, this certainly sounds like the sort of program that ought to be respecting robots.txt. If I showed you a program that downloaded your home page (or any random page) and then followed all the links on that page, and downloaded all of those pages and all of the images on all of those pages, and then I told you that there was a simple standard way to control such programs but that this particular program didn’t support that standard, you’d scream bloody murder. (There are such programs, and they are considered the scourge of the industry, in the same league as spambots and image leechers.) The fact that the initial resource is an XML file instead of an HTML file, and the links it follows are spelled
instead of link
, is all completely beside the point.a
Fortunately, the current beta of NewsMonster sends a forged referrer, which we can use to deny access in a .htaccess file. Note: for this to work, you’ll need mod_rewrite installed on your web server, and you’ll need privileges to define your own rules (probably Override All). If you don’t know what this means, or if you don’t know whether you have them, don’t attempt this until you move to a better host (like Cornerhost) that lets you do this sort of thing. Defining mod_rewrite rules incorrectly, or when you don’t have privileges to do so, will instantly render your entire website inaccessible.
Now then. In your .htaccess file:
RewriteEngine on
RewriteCond %{HTTP_REFERER} newsmonster\.org
RewriteRule .* - [F,L]
Which, in English, says to turn on mod_rewrite and, if any requests at all (.*
) come in with the referer newsmonster.org
, deny them (F
flag) and don’t process any further rewrite rules (L
flag).
Note that this will also deny NewsMonster the chance to read your actual RSS feed. You can deal with this in a number of ways, depending on your server setup. I keep all my RSS feeds in a subdirectory, so in that directory I added a second .htaccess file with a single directive:
RewriteEngine off
Which is crude but effective. Another solution, if your RSS feed is on your root level, is to add a special rule allowing anyone to download it, before the rule that says that NewsMonster can’t. Like this:
RewriteEngine on
RewriteRule ^index.xml$ - [L]
RewriteCond %{HTTP_REFERER} newsmonster\.org
RewriteRule .* - [F,L]
Which, in English, says that if a request comes in for index.xml on the root level, process it without modification (the -
), and don’t process any further rewrite rules (the L
flag)… including the second rule, which would have denied access to NewsMonster, if we had gotten that far, which we won’t.
This hack works for me, and to its credit, NewsMonster handles it quite well. It downloads my RSS feed, attempts to follow the links, fails, and simply displays the content from the RSS feed without the extra cache
link that would normally bring up the cached HTML page. No incomprehensible error messages or anything.
Note that this is all quite tentative, since the next version of NewsMonster may behave differently. It may send a different referrer string or none at all, thus rendering this hack useless. It may respect robots.txt, thus rendering this hack unnecessary. (If it does, I’ll show you how to set that up, when the time comes.) However, one thing is for sure: other uber-aggregators are coming, and maintaining control over your own content won’t get any easier.
Update: my rssfinder.py script now honors robots.txt. It downloads the file once per domain and caches it for the rest of the session. (Difficult searches may involve downloading pages from multiple domains, so it will only download robots.txt once per domain.) Site owners who do not wish to allow rssfinder to search their site for RSS feeds may put the following in their robots.txt file:
User-agent: rssfinder
Disallow: /


The thought that aggregators would begin to request non-RSS content in the same reptitive, bandwidth hogging manner they request RSS feeds has been bothering me ever since I saw the “web feed” feature of Syndirella and felt it was just a matter of time before aggregators took that one step further.
I’m in the process of writing my column on RSS aggregators where I now keep getting torn between warning people about using or implementing such features in news aggregators and letting sleeping dogs lie. *sigh*
This Newsmonster app looks very cool and I sincerely hope the author listens to reason by modifying his app to become better behaved instead of needing Apache config file hacking.
Comment by Dare Obasanjo — Thursday, February 20, 2003 @ 2:45 am
Yeah I wrote a small rant on my site the other day about referrer abuse where various search engine type things are sending what are basically fake referrer information. That is deliberate abuse whereas the problem you are talking about is merely a side effeect of what probably seemed like a good idea to the author of the program.
Both problems are fairly easily solved by technological means at the moment but I fear that as weblogs become more popular that people will find ways to abuse them that aren’t so easily dealt with.
Also - I’m not sure that downloading the site in the way you describe is necessarily a problem. For the sites I visit regularly I read all the pages anyway so the page needs to be downloaded one way or another. Does it really make any difference if it’s done when I look at the article or by a program doing so in advance? Or have I missed the point here?
Comment by John Burton — Thursday, February 20, 2003 @ 4:55 am
On a more positive note, is that aggregator-within-Mozilla thing just awesome or what? That’s one form of browser integration I can live with.
Comment by Elvis X — Thursday, February 20, 2003 @ 9:07 am
aggro-gator, I think.
John Burton, the problem arises in situations where you have the full post on the front page, full post in the feed and the full post on it’s own in a permalink.
In current aggregators (if they have proper http header support) they download the feed once, and if the user finds the post interesting enough, he’ll go to the permalinked page once.
The ‘monster, when it’s aggressive downloading thing is turned on, will download the feed and the post, no ifs or buts.
The problem arises when you, for example, publish four or five posts in a day. The ‘monster will download the feed, all the posts as well as probably the front-page with all the posts in it (depending on your setup).
Every post is effectively downloaded three times.
It really turns into a problem when you have a dynamically generated site. Then you risk having the majority of your recent content, all downloaded three times every hour including all images and style-sheets.
And if it downloads embedded movie-files automatically as well, you can see how having only a few ‘monster-enabled users with that optional feature turned on can cost sites real money.
Comment by Baldur Bjarnason — Thursday, February 20, 2003 @ 9:09 am
Would someone downloading your home page daily and reading everything you write bother you? The impact on your server would be the same.
I suppose it’s a bandwidth problem, and NewsMonster redownloading everything every halfhour or something.
If that’s the case then what you want is probably robust change indicators (etags, whatever) and a change rate estimator in the aggregator.
Comment by Duncan Wilcox — Thursday, February 20, 2003 @ 9:21 am
__The fact that the initial resource is an XML file instead of an HTML file, and the links it follows are spelled “link” instead of “a”, is all completely beside the point.__
FWIW, Mark, I think this is precisely the right way to look at this matter, in line with the TAG’s principles of the Web work, as I understand it. A resource is a resource is a resource, no matter the (often transient & certainly variable) representation of it.
I have NewsMonster in one of my browser tabs just now, waiting to find some time to play with it. But it not supporting robots.txt is a major faux pas and let’s hope something that the developer simply hasn’t implemented yet, rather than a conscious choice.
Comment by Kendall Clark — Thursday, February 20, 2003 @ 9:49 am
Re: #5. There is an extreme case, where someone came to my site and read absolutely everything I wrote, including all comments. Of course, in that case, they would not have downloaded my RSS feed as well (which BTW contains full HTML of each post, *and* a plain-text description, both of which NewsMonster displays). But fine, I will grant that in the extreme case, the difference is negligible.
And NewsMonster makes sure that *every* *single* *reader* who subscribes to my RSS feed is the extreme case.
You can say all you want that “I put it out there, so I should expect it to be downloaded”. Yeah, I put it out there, but not necessarily for you. That’s why we have robots.txt.
Comment by Mark — Thursday, February 20, 2003 @ 9:50 am
Re: #6. The developer has stated publicly that he has no plans to support robots.txt. See comments of http://www.benhammersley.com/archives/004118.html (no permalinks, it’s the 3rd or 4th comment, a response to mine): ” NewsMonster doesn’t support robots.txt. Of course, it isn’t a robot (IMO). I also have no plans to support robots.txt in the future. NewsMonster is a user agent not a robot.”
Hence this post.
Comment by Mark — Thursday, February 20, 2003 @ 9:56 am
Does NewsMonster download the linked pages for every feed, even if it includes the full content in the feed? If so, that’s a bug, IMO.
But worrying about your bandwidth seems so petty. Haven’t we gotten past that era? Maybe NewsMonster can use its P2P mojojojo to share the cached pages with other monsters.
“maintaining control over your own content won’t get any easier”
And that’s a good thing, no?
Comment by Aaron Swartz — Thursday, February 20, 2003 @ 11:17 am
“But worrying about your bandwidth seems so petty. Haven’t we gotten past that era?”
Not unless you’re willing to cover the hosting fees. Bandwidth is still treated like a scarce resource, with the corresponding expense. For example: http://www.thismodernworld.com/weblog/mtarchives/week_2003_02_16.html#000234
Comment by Andrew Raff — Thursday, February 20, 2003 @ 11:44 am
Re: #9. Once this feature is enabled application-wide, it applies to all feeds, regardless of how much or how little information they include in description or content:encoded.
Comment by Mark — Thursday, February 20, 2003 @ 11:51 am
There are lot of bloggers out there who write pretty well; and don’t have/want expensive hosting options. For them, bandwidth is a concern.
A robot is (IMO) anything that globs your content in an automated fashion. Which is why robots.txt is there as a standard to explain your preferences.
Also, some people don’t put whole content in RSS - they just put relevant summary. I believe this is because RSS is *intended* towards enhancing readabiliy for machines - not humans.
If it is a news aggregator, it should get aggregate what the news provider has set aside for aggregation. Not leaching the whole site.
If there are 25 Newsmonster clients getting info from my site, a cheaper option for me would be to generate a tar ball of the whole site and ask people to download it.
Comment by Babu — Thursday, February 20, 2003 @ 12:13 pm
Methinks: ‘overaggregator’.
Comment by SteveB — Thursday, February 20, 2003 @ 12:23 pm
Not to change the subject, but this seems to be a good place to show this. I found this in my access logs recently:
12.148.209.196 - - [16/Feb/2003:16:03:22 -0500] “GET /weblog/ HTTP/1.1″ 200 8615 “-” “NPBot-1/2.0 (http://www.nameprotect.com/botinfo.html)”
Quoted from the
As a Digital Brand Asset Management company, NameProtect engages in crawling activity in search of a wide range of brand and other intellectual property violations that may be of interest to our clients.
Scary. They’ve already spidered my whole site, but NPBot has the proud title of being the first bot in my robots.txt file.
Comment by Tony — Thursday, February 20, 2003 @ 12:36 pm
#14 - yes. I too got NPBot hits. And that actually prompted me to put up a robots.txt file :-) Let us see if they honor robots.txt as they claim they do.
Comment by Babu — Thursday, February 20, 2003 @ 12:50 pm
I have been analyzing my recent log files and researching User-Agents. NPBot has visited repeatedly, along with several people using bulk-downloaders (or IE’s “offline content” feature) to download entire sections of my site (like my Safari test cases). Preparations are underway to mitigate this sort of rude behavior.
Comment by Mark — Thursday, February 20, 2003 @ 1:38 pm
Er… I guess I’m missing the point here but what is wrong with somebody using a tool to download a group of files they want? If they didn’t use a bulk-downloader then presumabely they would just download each file seperately?
Comment by Matthew Pusey — Thursday, February 20, 2003 @ 1:54 pm
I’ve posted a few comments on this topic over on Ben Hammersley’s weblog, but I wanted to make a quick response here to something the Mark said above. I agree that spambots and such are the scourge of the Internet, and should be blocked. But it seems unfair to lump NewsMonster and IE’s Offline Web Pages feature in with that lot. The latter are true “user agents” (too bad that term has already been taken) in that they are acting on an individual user’s behalf in order to improve their browsing experience. So even though they may function in a similar fashion, they are fundamentally different from the kinds of spiders that robot.txt was designed to control.
BTW, I haven’t yet tried NewsMonster, so if I’ve mischaracterized it’s functionality in any way, I apologize in advance.
Comment by Scott Trotter — Thursday, February 20, 2003 @ 2:22 pm
Baldur Bjarnason:
> The problem arises when you, for example, publish four or five posts in a
> day. The ‘monster will download the feed, all the posts as well as probably the
> front-page with all the posts in it (depending on your setup).
> Every post is effectively downloaded three times.
> It really turns into a problem when you have a dynamically generated site. Then
> you risk having the majority of your recent content, all downloaded three times
> every hour including all images and style-sheets.
What? Did you do you research? NewsMonster will NOT download content three
or four times!
NewsMonster is very efficient and will ONLY download content once.
Content fetches are even load balanced within the aggregator so that I don’t
swamp one site with multiple downloads at once.
> And if it downloads embedded movie-files automatically as well, you can see how
> having only a few ‘monster-enabled users with that optional feature turned on
> can cost sites real money.
No… you didn’t do you research!
NewsMonster will not download large binaries. Right now the max contentLength
that NM will download is 100,000.
Comment by Kevin Burton — Thursday, February 20, 2003 @ 2:52 pm
From Mark:
> And NewsMonster makes sure that *every* *single*
> *reader* who subscribes to my RSS feed is the
> extreme case.
No! This isn’t true either!
All of these features are DISABLED BY DEFAULT.
If someone wants this functionality then there is a REASON they want it.
… they probably want to read your website on their PDA or on the train (on their laptop).
… god forbid….
Comment by Kevin Burton — Thursday, February 20, 2003 @ 2:57 pm
Aaron says:
> But worrying about your bandwidth seems so
> petty. Haven’t we gotten past that era? Maybe
> NewsMonster can use its P2P mojojojo to share
> the cached pages with other monsters.
I have plans to use ETags and ZeroConf to swarm NewsMonster downloads when there are other NM users on the local subnet.
I am also thinking about implementing similar functionality by building a DHT but this is in the future.
The ZeroConf support will allow NewsMonster user to avoid killing a link when all of them are at a conference together.
Comment by Kevin Burton — Thursday, February 20, 2003 @ 3:01 pm
Yes, Kevin, as I made quite clear in my original post, this feature is disabled by default in NewsMonster. A single checkbox during installation enables it, and it enables it application-wide (i.e. there are no per-feed settings).
If someone wants this functionality, that’s fine, they are free to enable it. And if I, as a site owner, wish to ask them politely not to do it, I am free to do so. robots.txt is a compromise between the wishes of end users and the wishes of site owners. Please play nicely.
The other upcoming bleeding edge features you mention sound super-cool. Please implement 10-year-old Internet standards first.
Comment by Mark — Thursday, February 20, 2003 @ 3:06 pm
In the spirit of leading by example, my rssfinder.py script now honors robots.txt.
http://diveintomark.org/projects/rss_finder/version_11.html
Comment by Mark — Thursday, February 20, 2003 @ 3:09 pm
It’s worth noting that Mozilla now has a ‘link prefetching’ feature:
http://www.mozilla.org/projects/netlib/Link_Prefetching_FAQ.html
This feature must be both explicitly enabled by the user in the browser’s preferences, *and* only follows [link rel="prefetch" href="/someurl"] or [link rel="next" href="2.html"] elements. What is does *not* do is automatically download all the linked content from a page.
Using this as an appropriate model for a user agent, NewsMonster should only prefetch content linked from feeds that have some sort of indicator that it is allowed (LazyWeb: a ‘prefetch’ RSS module).
Furthermore, both NewsMonster and Mozilla should obey the relevant robots.txt file associated with the linked resource, to prevent DDOS attacks. (Mozilla’s prefetching is not limited to the originating server).
Comment by Michael Bernstein — Thursday, February 20, 2003 @ 4:14 pm
Scott Trottler:
> The way I used to use this feature was as follows: I have a half-hour train ride
> to and from work every day. I had my laptop set to download a list of sites
> every weekday morning at 5 a.m. and again in the afternoon at 4 p.m. The sites
> included CNET, NYT-Tech, Wired, GMSV and a few others. I could then read the
> news on the train using my laptop with IE in offline mode. This was a tremendous
> time-saver for me. I’ve since switched to using a Pocket PC for the train ride,
> but I still use Offline Web Pages for a few sites that I look at in the evenings
> at home.
Yes. This is exactly what the NewsMonster offline cache is designed to handle.
It isn’t enabled by default and it is meant to be *used*… The goal isn’t to
waste network resources.
…
> KEY POINT: If Offline Web Pages obeyed the Robot Exclusion Protocol, it would
> render this valuable feature completely useless.
…
Yes. It would. My primary concern is for my users. If my users WANT
robots.txt I will give it to them. I highly doubt they would as it would render
large portions of the Internet unavailable to their PDAs and offline laptops.
> First, the offline user agents need to be very smart and efficient. They
> shouldn’t try and download content that they already have in their cache. (Sites
> like CNET which have multiple CMS-generated URLs that point to the same article
> complicate this.)
I don’t try to download content more than once.
Mark:
> Going back into my logs (just for February), I have found several instances of
> abuse of this “feature”; for instance, downloading my entire Safari pages,
> complete with ~70 test cases (2 full-screen images apiece).
I don’t see why you would have a problem with someone reading your website?
If you disable NewsMonster would you at least have the courage to disable
Mozilla and Internet Explorer too?
Scott Trottler
> As an aside, when Mozilla 1.0 was released, I switched to it as my primary,
> default browser away from Internet Explorer. The one feature that I miss most
> from IE, and which I still use IE for, is Offline Web Pages.
Use NewsMonster ;)
Comment by Kevin Burton — Thursday, February 20, 2003 @ 4:45 pm
Re: “I don’t see why you would have a problem with someone reading your website?”
I publish my home phone number online, but it’s unlisted in the local yellow pages. When I do business with companies and they ask for it, I tell them it’s unlisted. When telemarketers call, I tell them to put me on their do-not-call list.
Just because I have a phone doesn’t mean I want anyone calling me for any reason. Conventions of privacy (and, in some cases, laws) have evolved for those who wish to take advantage of them. If I tell a company my number is unlisted and they sell it anyway, I can choose to stop doing business with them (and I have). If a telemarketer refuses to abide by federal law when I tell them to put me on their do-no-call list, I can sue (luckily I have not had to do this).
The robots.txt standard is similar. It strikes a compromise between what end users want (everything, right away) and the realities that site owners must deal with (bandwidth is expensive, some visitors and usage patterns are more worthwhile than others).
If you refuse to play by these rules — which, incidentally, have worked for many people for many years — then that’s fine, but don’t expect any courtesy in return. By your refusal, you are putting your product in the same class as scumware and spambots, and we’ll act accordingly.
Comment by Mark — Thursday, February 20, 2003 @ 5:49 pm
Bottom line: Mark sees it as theft of his writing, as would any writer; Kevin sees it as providing a service.
Writing is a deeply personal activity. Writers spend enormous amounts of time tailoring their content and their blogs to their users while providing the presentation they want to provide. Something which swipes large chunks of it at a time just feels somehow wrong to the writer.
The problem here is, both of them are right. But neither one is necessarily more right than the other.
Comment by Dave Cantrell — Thursday, February 20, 2003 @ 6:08 pm
I just tried out newsmonster, and enabled the page fetching thing. I tested it out on a few RSS feeds, including MetaFilter’s. I was surprised to see that it not only grabbed the RSS file, it displayed a munged up view of the home page under every single entry’s “toggle visual content”. Also disturbing from a server admin angle was that the “content” link lead to a scraped version of the comment page.
This would mean that not only did it grab the single RSS feed for the last 24 hours of metafilter, it also grabbed the 25-30 comment pages beneath, and the home page.
Since the metafilter site is a community, and run on very little money and cheap hardware, the average page load times in the daylight hours are a bit slow. The index page typically takes a few seconds to load, and comment pages can take 5-10 seconds to load and render. Based on what I’m seeing, it appears that a single newsmonster refresh is pummeling my db server and site.
Is there any compromise between simply giving people RSS feeds rendered as HTML, and downloading every possible link going off them, whether they want to read it or not, and then also grabbing the home page to attempt to render that as well? Can the “content” links lead to fresh grabs of data instead? That way people would only download what they want to read.
Comment by mathowie — Thursday, February 20, 2003 @ 6:18 pm
Matt, it seems clear from today’s discussion that NewsMonster will continue its rude and potentially crippling behavior. So I would investigate options like mod_rewrite (see my original post). Also, search Google for “mod_rewrite tutorial” or “spambot htaccess”. If all your pages are dynamically generated, you could roll your own code to check User-Agent/Referer fields. Wouldn’t stop them from hitting the page and running the script, but you could cut down on the database access, and not give them any links to follow.
You should do some analysis on your logfiles, if you haven’t recently. Chances are, you have similar problems from email harvesters like EmailSiphon and bulk downloaders like GetRight or FlashGet. mod_rewrite can help handle those too (although most offer “helpful” options to spoof the User-Agent, but most users don’t know how or don’t bother). When I say NewsMonster is putting itself in bad company, this is the company I’m talking about.
I’ll be publishing the results of my own logfile explorations soon.
Comment by Mark — Thursday, February 20, 2003 @ 6:34 pm
There’s another newsmonster feature I noticed. For one reason or another, some people have excerpted feeds, where only the first 200 characters, first sentence, or partial first paragraph are syndicated. Newsmonster seems to do a good job of grabbing the full posts for display in both the content and toggle visible areas, which would seem to be against the wishes of the site authors.
If they only wanted you to see the first sentence, then click through to their site to see the rest of the entries you are interested in, it seems like pulling down the full page anyway is sort of crossing a line against their wishes.
Comment by mathowie — Thursday, February 20, 2003 @ 6:35 pm
Mark: Nice to see a civil list of complaints this time, instead of just calling the guy an idiot….
Kevin: ZeroConf? Wouldn’t that require telling other clients on your subnet what sites you’re subscribed to? ZeroConf is cool, but the privacy implications of asking your subnet peers “hey, do you have a copy of site X yet?” seem severe enough that such a feature should definitely be disabled by default, and have a warning when enabled.
Mark: You don’t like people using IE’s browse offline mode on your site? Why not? I think you’re taking it a little far if you actively try to block that. I’m real tempted to start ripping your site regularly from different IP addresses, using wget, ignoring robots.txt, and forging my UA tag, just to keep you on your toes…. Btw, how would you stop something like that?
I’ve watched this whole newsmonster thing very closely and will continue to. I’ve got mixed feelings about the subject, like many, but I think the end result will be an excellent piece of software. I hope Mark gets thanked in the NM acknowledgements one of these days! :)
Comment by xml dreamer — Thursday, February 20, 2003 @ 6:36 pm
I’m glad this still works:
curl -A ” ” http://diveintomark.org/|less
…even though this doesn’t:
curl http://diveintomark.org/|less
Out of curiosity, Mark, how many spam bots do you seriously think send Curl as their UA string? And how many do you think send the UAs of legitimate browsers?
Comment by xml dreamer — Thursday, February 20, 2003 @ 6:41 pm
As I said (more fully on Ben Hammersley’s site), examination of my logfiles shows virtually no one using IE’s “offline content” feature the way it was intended (to subscribe to individual pages). But I see lots of people abusing it to simply download entire sections of my site, such as my Safari information pages (which include many, many full-screen images of bug test cases, including test cases from previous builds that aren’t even relevant anymore).
Obviously if someone were sufficiently determined and/or bored, they could spoof the User-Agent string and go about downloading things for no reason. (This is not even terribly difficult in most programs.) I am aware that, by bringing up the issue, I am opening myself up to this sort of attack (and I don’t use that word lightly). If it becomes a problem, there are more drastic solutions I could explore to fight back against such attackers. See, for example, http://www.neilgunton.com/spambot_trap/
Comment by Mark — Thursday, February 20, 2003 @ 6:45 pm
Many users want to have content prefetched, because they don’t want to wait after every click, particularly if they have a slow link.
Many site owners want bandwidth to be minimized, because they pay for bandwidth.
There is clearly a middle ground which the robots.txt exclusion doesn’t solve: I know in advance that I read everything that Mark posts to diveintomark (Thanks, Mark!), so the bandwidth costs to Mark are exactly the same if it’s prefetched or not. Yet if I would honor his wishes and not prefetch, he gains nothing and I lose time.
Even if NewsMonster doesn’t achieve the goal of only prefetching that which I will download anyway, some future user agent will. We need something better than robots.txt, because as it stands it doesn’t make sense for that new user agent to honor it.
Comment by Aryeh Sanders — Thursday, February 20, 2003 @ 7:34 pm
A request. I have recently made the decision not to have net access at home (gasp!). The reason for this is that I love dinking around on the net too much. There are non-net things I would like to spend my time at home doing, but I seem unable to stay away from my computer if it has net access. ;)
As a compromise I’m working to get a nearby coffee shop hooked up with wireless access, so that when I need to get online I can semi-conveniently walk down there.
One of the main reasons for this is that I send (and recieve) a lot of email. Unfortunately email is a vicious cycle and the more email you send the more you recieve. I am no longer willing to spend 3-5 hours a day reading and replying to email. Instead what I do is use Mozilla’s offline mode to sync my email to my laptop when I leave work, and then read/reply to it offline at home. My hope (and so far it seems to be working) is that by limiting myself to responding to email in this way I break the cycle.
Anyway, there are some web sites I’d like to read offline in this manner as well. I have yet to pursue a tool to do this with but news monster had seemed a reasonable choice for a Linux laptop. It would be a shame if I couldn’t read your site offline.
Adam.
Comment by adam — Thursday, February 20, 2003 @ 8:09 pm
Even God gave us the middlefinger despite the danger of its misuse. An equivalent of a warning label around the middlefinger would separate type As from others.
Comment by Don Park — Thursday, February 20, 2003 @ 9:02 pm
NPBot update: I told NPBot to bugger off in my robots.txt file. It came back later today, but it seems to be honoring the directive.
Comment by Tony — Thursday, February 20, 2003 @ 9:49 pm
Dave Cantrell: “Bottom line: Mark sees it as theft of his writing, as would any writer; Kevin sees it as providing a service.”
This is a strawman argument. Mark has *not* objected to anyone ’stealing’ his content.
He is objecting to rude robots *unnecessarily* hammering his (and other people’s) server(s). If a user agent (robots and spiders are a subcategory of user agents) downloads the rss and then proceeds to download the permalinked HTML version for every post, all stylesheets, images, and further linked pages, what exactly is the point of providing an RSS feed in the first place? You might as well go back to scraping the website directly.
There may in fact be other webloggers who make this objection, and these are likely to be the same people who do not provide full content in their feed. Their wishes should be respected as well.
The robots.txt file is a very well established mechanism for making the statement “don’t suck this stuff down automatically”, with whatever qualifications and exceptions you wish to add. Not paying attention to the robots.txt file is the equivalent of ignoring a prominently posted sign, whether it’s a ‘yield’ sign governing traffic, or a ‘no-trespassing’ sign refusing access.
As for the application category name, how about ‘Hoggregator’?
Comment by Michael Bernstein — Thursday, February 20, 2003 @ 10:02 pm
Holy shit! I have no idea WHAT you’re talking about, but I love robots! Yay for robots!
Comment by Taylor — Thursday, February 20, 2003 @ 10:26 pm
As a reader, my rule No. 1: respect the one who writes the thing I am reading. I don’t see it’s a problem if people can’t catch Mark’s site in a whole. If you do read DiveIntoMark everyday, you could easily save the main page (or comments) once per day for the offline reading.. Is it necessary to get all html contents of your subscribed blogs? Wouldn’t RSS is designed to be used for scanning(searching) for your interested read?
I do doubt if one has enough time to read if NM’s prefetching feature is enabled. That’s the real waste of time.
Comment by yowkee — Thursday, February 20, 2003 @ 10:35 pm
Another way to describe this conflict is one between the provider of information wanting to be able to decide how that information will be accessed and a software author ignoring an established standard for the sake of a new feature. If my new webpage-watcher downloads your entire site every 30 seconds to compare with a previous version so that I can notify users of changes to your site, am I being a considerate player in this game?
Comment by Jim McCoy — Thursday, February 20, 2003 @ 11:59 pm
Now you know what big media feels like. Hey! Stop ripping those MP3s!
I happen to think this is more a social (law) problem than a technical one, but the chances of a reasonable law covering something this geeky are pretty darn low.
What Mark’s doing here is a [club solution].
What’s needed is [identity] and trust.
Given that, you can FOAF (or otherwise meta-data) undesirables, and the server owner can make policies as she likes.
No reliance on gentlemans’ agreements (like robots.txt) that anyone watching closely knows can’t last.
Now, lest I come off sounding pro-DRM, I am definitely not. I am quite sure that TBL didn’t expect the web to explode like it did, and I’m also quite sure he expected social issues to be dealt with by the correct authorities in a more appropriate manner.
So here we are. We can lobby for the social solution (make it illegal to forge referrer, for example), or we can invent a (less-than-perfect, but maybe effective) technical one.
[club solution]
http://diveintomark.org/archives/2002/10/29/club_vs_lojack_solutions.html
[identity]
http://www.oreillynet.com/pub/a/webservices/2002/07/09/udell.html
Comment by Jeremy Dunck — Friday, February 21, 2003 @ 12:35 am
On MSIECrawler:
I see several hits in my logs of this ‘feature’ that go to:
default.asp?pageid=2robots.txt
Draw your own conclusions…
*shakes his head sadly*
Comment by Sander — Friday, February 21, 2003 @ 3:40 am
In regard to #38 (Michael Bernstein)
>He is objecting to rude robots *unnecessarily*
>hammering his (and other people’s) server(s).
>If a user agent downloads the rss
>and then proceeds to download the permalinked
>HTML version for every post, all stylesheets,
>images, and further linked pages, what exactly
>is the point of providing an RSS feed in the
>first place?
Is RSS a “Really Simple Syndication” or an “RDF Site Summary” - if both, how do you take one random RSS feed automatically determine which one it is? Is there some marker that says “This is just a syndication of my weblog entry” (hint: no need to actually use the HTML link), and is there one that says “This is a summary/glimpse of my entry” (hint: you probably want to read the HTML link)?
If the content is identical between the RSS feed and the HTML feed - is that not confusing, since in effect you are linking from one “page” of content to itself. I understand the concept of permalinks - giving a permanent link in a dynamically generated “ticker” for future reference, but how is a script or aggregator supposed to differentiate it given only the RSS feed?
Apologies if these seem to be rather basic or naive questions, but I’m currently experimenting with a proxy-based knowledgebase / aggregator ( http://www.isolani.co.uk/blog/agents.html ), and I’d hate to have an abusive agent. I don’t quite see how robots.txt solves the particular problem with RSS - it certainly solves the symptom Mark is seeing, but the underlying problem remains.
I want my little tool to be able to cache new pages that meet my interest profile, so I can snag an interesting article into my cache (on an IBM microdrive) while working on a laptop at home, unplug the microdrive and plug it into my Zaurus and read the article on my way to work whilst on the train. So “prefetching” a page is mandatory to be able to read it offline.
Thanks for bringing up the point that following robots.txt should be a part of tools like these, I certainly will add it to my todo list. Although some clarification on the RSS issues above would certainly help.
Comment by Isofarro — Friday, February 21, 2003 @ 6:18 am
HTTrack ( http://www.httrack.com/ ) is another program that’s rude like this. I had an attack from some idiot using HTTrack a few weeks ago, from our national library of all places, a cursory scan on the logs seemed to indicate that it was grabbing many pages more than once. This went on for several hours.
The other fun part was that HTTrack also followed outside links, and bizarrely, left proper referral data when chasing up those links. Which had the effect of spamming people’s referrer stats with URLs from my site.
I don’t mind people pulling stuff off my pages for archiving, as seemed to be the case here as I have an ISSN, but once every month or so ought to be enough. But not pulling the same files every ten seconds because the idiotic program can’t keep track of what it’s already sucked up, throttling the server in the process because of its amnesia.
Comment by Graham — Friday, February 21, 2003 @ 8:35 am
Isofarro:
“If the content is identical between the RSS feed and the HTML feed[...] how is a script or aggregator supposed to differentiate it given only the RSS feed?”
This specific problem (identifying feeds that contain full content) could be solved once the site has been downloaded once.
Unfortunately, because NewsMonster only has a single global preference for this behaviour rather than per-channel preferences, the information wouldn’t do much good.
On the other hand, if NM had per-channel preferences for pre-fetching the HTML pages, then the user could only turn it on for those sites that do *not* provide full content. This would go a long way towards mitigating the bad behaviour NewsMonster exhibits towards most sites that do publish full content in their feed, and (presumably) would in fact be following the wishes of those site authors who do not (who want people to read their content in the context of their pages).
Ignoring robots.txt would *still* be considered rude behaviour, though, as someone who does not publish their full content in their feed may still want to disallow robot access to certain sections of their site.
If someone shoots themselves in the foot, by (a) not providing full content, and (b) disallowing all robot access, let them. Their readers will let them know soon enough.
Comment by Michael Bernstein — Friday, February 21, 2003 @ 6:10 pm
this kind of action by NewsMonster would seem to be self-defeating as a business-model. Gives you a leg-up in the short run ruins it for everyone in the long run.
Comment by bryan — Thursday, February 27, 2003 @ 10:25 am