First of all, I apologize to those of you who subscribe to my RSS feed and use web-based or browser-based news aggregators. If you checked your news page in the last 12 hours, you no doubt saw my little prank: an entire screen full of platypuses. (Please, let’s not turn this into a discussion of proper pluralization. Try to stay with me.) They’re gone from my feed now, although depending on your software you may need to delete the post in question from your local news page as well.
Now that the contrition is out of the way, let’s face facts: if this prank affected you, your software is dangerously broken. It accepts arbitrary HTML from potentially 100s of sources and blindly republishes it all on a single page on your own web server (or desktop web server). This is fundamentally dangerous.
Now, the current situation is not entirely your software’s fault. RSS, by design, is difficult to consume safely. The RSS specification allows for description elements to contain arbitrary entity-encoded HTML. While this is great for RSS publishers (who can just throw stuff together
and make an RSS feed), it makes writing a safe and effective RSS consumer application exceedingly difficult. And now that RSS is moving into the mainstream, the design decisions that got it there are becoming more and more of a problem.
HTML is nasty. Arbitrary HTML can carry nasty payloads: scripts, ActiveX objects, remote image web bugs
, and arbitrary CSS styles that (as you saw with my platypus prank) can take over the entire screen. Browsers protect against the worst of these payloads by having different rules for different zones
. For example, pages in the general Internet are marked untrusted
and may not have privileges to run ActiveX objects, but pages on your own machine or within your own intranet can. Unfortunately, the practice of republishing remote HTML locally eliminates even this minimal safeguard.
Still, dealing with arbitrary HTML is not impossible. Web-based mail systems like Hotmail and Yahoo allow users to send and receive HTML mail, and they take great pains to display it safely. It’s a lot of work, and there have been several high-profile failures over the years, but they’re coping.
Let me be clear: by design, RSS forces every single consumer application to cope with this problem.
So, to anyone who wants to write a safe RSS aggregator (or who has already written an unsafe one), I offer this advice:
- Strip
scripttags. This almost goes without saying. Want to see the prank I didn’t pull? More seriously,scripttags can be used by unscrupulous publishers to insert pop-up ads onto your news page. Think it won’t happen? Some larger commercial publishers are already inserting text ads and banner ads into their feeds. - Strip
embedtags. - Strip
objecttags. - Strip
framesettags. - Strip
frametags. - Strip
iframetags. - Strip
metatags, which can be used to hijack a page and redirect it to a remote URL. - Strip
linktags, which can be used to import additional style definitions. - Strip
styletags, for the same reason. - Strip
styleattributes from every single remaining tag. My platypus prank was based entirely on a single roguestyleattribute.
Alternatively, you can simply strip all but a known subset of tags. Many comment systems work this way. You’ll still need to strip style attributes though, even from the known good tags.


I do allow img tags because every rss feed I do pull, the image’s are many times part of the content. Guess I have to rethink it, but it doesn’t bother me at the moment.
Comment by John Beimler — Thursday, June 12, 2003 @ 11:20 am
The as-of-last-week-fresh Syndirella weekly build did not take offense.
Comment by Jesper — Thursday, June 12, 2003 @ 11:33 am
I should be refreshing the RSS Bandit release within the hour. God, I hate you. :)
Comment by Dare Obasanjo — Thursday, June 12, 2003 @ 11:35 am
Alternately:
1. Give up on Dive Into Mark, since the amount valuable content has decreased considerably and you don’t want to be a victim of another “prank” at the expense of the readers.
And, I wasn’t even affected by the platypuses.
–Kynn
Comment by Kynn — Thursday, June 12, 2003 @ 11:35 am
As for reading rss feeds: My combination of nntp//rss ( http://www.methodize.org/nntprss/ ) and Opera’s news/mail-client, M2 ( http://www.opera.com/ ) handled this beatuifully. By default the Opera mail/news client won’t fetch any external resources, like images (not even from style=”…”) or <link />, neither will it display frames/objects, nor will it “obey” meta redirects. Of course it won’t run scripts
One thing I’m curious about, however: how should an aggregator handle embedded content with the <:enclosure /> element of RSS2.0?
Comment by Arve — Thursday, June 12, 2003 @ 11:59 am
You wouldn’t by any chance be talking about this
http://diveintomark.org/archives/2002/03/04/radio_escaping_bugs.html
would you Mark?
Comment by Patrick Berry — Thursday, June 12, 2003 @ 12:00 pm
Patrick: no. That was a bug (which has long since been fixed): the publisher meant for markup to be displayed, but the consuming application interpreted it instead.
In this case, the publisher meant for markup to be interpreted, but that was not in the consumer’s best interests.
The point here is that the spec explicitly allows for this, so consumers need to look out for their own interests.
Comment by Mark — Thursday, June 12, 2003 @ 12:06 pm
Arve: I don’t know. I don’t use the enclosure element, and my homegrown news aggregator simply ignores it. Off the top of my head, I would say to look at the file extension and warn the user about opening potentially unsafe files like .exe or .doc. Like mail clients do with attachments. If the user has a virus scanner installed, maybe try to auto-scan enclosures as they come in.
Comment by Mark — Thursday, June 12, 2003 @ 12:14 pm
Mark, although I’m happy that you documented this exploit and that you are offering some sort of solution against it, I wonder, did you first alerted privately UserLand and other aggregators vendors affected by this exploit?
Comment by Emmanuel — Thursday, June 12, 2003 @ 12:17 pm
I don’t have time to write my own news aggregator, and I have my own reasons for using Radio UserLand, which was affected by the prank. My solution to badly behaving feeds is unsubscribing.
Comment by Gregory Graham — Thursday, June 12, 2003 @ 12:25 pm
Sensible advice for description. Does the same advice hold for content:encoded? Some news readers automatically display this instead of description.
Comment by Paul — Thursday, June 12, 2003 @ 12:38 pm
You forgot two important ones:
1) If you strip style attributes, you want to strip event handlers too.
Otherwise:
… onLoad=”location.href=’http://www.playboy.com‘” …
2) Plus, there are the layout-breaking tags, like a closing DIV or closing TABLE.
Joe
http://www.joegrossberg.com/archives/000596.html
Comment by Joe Grossberg — Thursday, June 12, 2003 @ 12:44 pm
Great article, as usual, but I don’t really understand why you consider the inclusion of popup ads in an RSS feed as unscrupulous (as much as I hate such ads personally, and would drop such a feed from my subscription list in a hot second). One of the objections I’ve seen to including the full text of posts in a feed is that then there’s less of a reason to visit the site proper, reducing ad-impressions. Putting the ads (however obnoxious) in the feed seems like a fine workaround to me (if you’re running an ad-supported site and think (misguidedly) that popups are a good idea). If I, as a subscriber, don’t think the feed is worth the ads, then I can unsubscribe. Just a quibble though, and it doesn’t detract from your main point about potential RSS exploits.
Comment by Jim Biancolo — Thursday, June 12, 2003 @ 12:44 pm
Emmanuel: the issue has been raised many times in many forums. See, for instance:
http://www.intertwingly.net/blog/940.html
http://feeds.archive.org/validator/docs/warning/ContainsScript.html
http://webservices.xml.com/pub/a/ws/2002/11/19/rssfeedquality.html?page=2
http://www.securiteam.com/unixfocus/6L00H205PY.html
http://project.antville.org/stories/200348/
http://www.peerfear.org/rss/permalink/1028943207.shtml
http://diveintomark.org/archives/2002/10/10/more_on_evolvable_formats.html
http://philringnalda.com/blog/2002/04/thinking_about_rss.php
http://groups.yahoo.com/group/radio-userland/message/9965
http://radio.weblogs.com/0100887/categories/rss/2002/05/23.html#a265
http://vyom.org/cat_internet/rss_security_vulnerabilities.php
A Google search for “rss strip html tags” will turn up dozens more.
Comment by Mark — Thursday, June 12, 2003 @ 12:54 pm
Speaking of exploits, Torsten copied a feature from SharpReader where it automatically jumps to the Web Page if there is a link and no description which I had misgivings about but now realize is a security issue.
*sigh*
PS: When is your blog going to support the CommentAPI? I hate having to switch modes to post to your blog. :(
Comment by Dare Obasanjo — Thursday, June 12, 2003 @ 12:55 pm
Oh, and let’s not forget browser specific-bugs, such as doing:
[input type foo]
somewhere in your HTML.
A malicious person could screw you over and, if you didn’t know about that bug previously, it’d take hours to figure out why it was happening.
Joe
http://www.joegrossberg.com/archives/000623.html
Comment by Joe Grossberg — Thursday, June 12, 2003 @ 12:59 pm
And you should use regular expressions to remove them.
/<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]*<\/\1>/i
/<\/?!?(param|link|meta|doctype|div|font)[^>]*>/i
/(class|style|id)=”[^"]*”/i
Use these to remove them.
Comment by Mattias — Thursday, June 12, 2003 @ 1:06 pm
Thanks Mark for the reply.
Then, would have been more nice to set a distinct RSS feed just to show this exploit?
Comment by Emmanuel — Thursday, June 12, 2003 @ 1:09 pm
One more tag to be wary off: <body>. When IE encounters a <body onload> inside the main <body>-section, it will execute that script as if it was on the outer-<body>.
I’m not sure how other browsers handle this…
Comment by Luke Hutteman — Thursday, June 12, 2003 @ 1:21 pm
Mark, how many comments and emails do you get on a regular basis exclaiming how you have, in some way or another, wronged a valuable member and they’re never coming back?
I don’t do RSS, so wasn’t affected this time, but part of what makes this site enjoyable is that you like to play with technology. This is done by pushing and tinkering, and the results make for a bumpier, far more fascinating ride.
So my vote in this unabashed dictatorship is to keep it up.
Comment by ardenstone — Thursday, June 12, 2003 @ 1:26 pm
Luke: holy crap! I never knew that. No wonder most bulletin board systems just allow a known subset of tags and attributes.
Comment by Mark — Thursday, June 12, 2003 @ 1:29 pm
It’s Mark’s site, he can do as he wishes. Everyone is free to start a diveinto$name.org site, if they wish 8)
BTW, nice hack.
Comment by Bryan L. Fordham — Thursday, June 12, 2003 @ 1:33 pm
ardenstone: I received a few emails asking what was going on, and whether I was aware of it, but I apologized to each of those people individually and I’ve received no replies suggesting that I lost subscribers over it. There have been a smattering of negative comments here, and one reader posting about it on his own weblog, but I think most people saw the humor in it.
Of course, I could be wrong. Time will tell whether my readership numbers suffer for it. Luckily, I’m not in this for the numbers.
Comment by Mark — Thursday, June 12, 2003 @ 1:35 pm
Mark:
Even if you didn’t do your image hack/prank, you’d still have people complaining that you’re provided “hackers” with a cookbook for messing up others’ RSS aggregators.
An actual image of a Plat. is much more likely to get noticed, of course. And it’s not like you did something truly malicious to drive the point home.
So keep up the good work, and please keep informing people of these security loopholes.
Thanks,
Joe
Comment by Joe Grossberg — Thursday, June 12, 2003 @ 1:52 pm
Um, by that I meant that it didn’t work in it. I suck at forming phrases that can only be interpreted one way (well, one non-sexual way).
Comment by Jesper — Thursday, June 12, 2003 @ 2:41 pm
Jesper: “suck” is probably not the verb you want to use to apologize for double entendres. :D
Comment by Mark — Thursday, June 12, 2003 @ 3:12 pm
I’m working in PHP and have been using strip_tags() to limit tags for both outgoing and incoming for ages. But I got caught out by My Platypus. Guess I’ll have to add some regex to catch that one.
What I do have problems with is unmatched starting and ending tags. If anyone’s got some php code to auto-close unmatched tags I’d really appreciate it.
julian_bond at voidstar.com
ps. I do think it was a bit mean of Mark to dump this on us. But I’ll forgive him. This time.
Comment by Julian Bond — Thursday, June 12, 2003 @ 3:28 pm
Thanks Mark, now I get to spend the day up to my neck in regex-hell :)
Here is what I have so far for an algorithm:
1. Escape every < into <
2. Only convert < back into < if it is followed by an acceptable tag type:(a|b|img|i|em|strong|code|p|div).
3. Remove all style attributes.
After having done all that should I still strip onload, onclick, etc attributes?
Comment by Joe — Thursday, June 12, 2003 @ 3:37 pm
BottomFeeder - http://www.cincomsmalltalk.com/BottomFeeder - is immune to this kind of prank. Not to mention cross platform….
Comment by James Robertson — Thursday, June 12, 2003 @ 4:34 pm
I’d be surprised if this problem can be solved properly using regular expressions - for example, the examples regexps pasted in above would miss out on tags that don’t have a closing tag and unquoted attributes. I know from experience ( http://simon.incutio.com/archive/2003/02/23/safeHtmlChecker ) that there are a huge number of HTML “tricks” for causing problems, especially if your browser is IE (which is reknowned for accepting pretty much any garbage markup).
To be truly safe, you need to use a proper HTML parser to pre-process the markup. Even worse, the parser can’t just be a standard HTML parser - it will need to closely match the parser of the eventual consuming browser (generally IE) as otherwise it could miss stuff that IE will still process.
It’s a very nasty problem.
Comment by Simon Willison — Thursday, June 12, 2003 @ 4:53 pm
Re: Emmanuael’s The Devil’s Feed from #19: if you don’t do it, I will.
Comment by Phil Ringnalda — Thursday, June 12, 2003 @ 5:40 pm
This was one of the reasons I’d asked for a subset of acceptable HTML tags for use within RSS items. I’m less interested in an ever-growing list of what to avoid. I’d rather see a short list of what’s allowed. At least within the limited context of an RSS item’s title and description elements.
Comment by Bill Kearney — Thursday, June 12, 2003 @ 5:44 pm
I’m thinking that a test suite for RSS aggregator authors would be very, very nice…
Comment by jacob — Thursday, June 12, 2003 @ 7:10 pm
Mark: Thanks for bringing this up at a moment when I could actually do something about it.
There are limits to what I can strip from posts by default; people are posting blog entries as well as comments, after all. But I’ve got a “strip tags” toggle in there for just these kinds of situations, and your prank got me to update it to kill inline styles and even handlers.
Comment by Roger Benningfield — Thursday, June 12, 2003 @ 7:14 pm
Mark: Uh-oh. Well, to reprase that, I blow at forming phra… Oh crap.
Comment by Jesper — Thursday, June 12, 2003 @ 8:00 pm
Hi Mark…
thanks for the education.
woa, this really messed up my Radioland aggregator. I quickly figured out where it came from:
http://diveintomark.org/archives/2003/06/11/in_brief_independent_reality_edition.html
and I was planning to hunt the miscreant down who hacked my page…. phew… just a prank. You had me ready to wipe my hard drive and get new firewalls.
I knew xml was full of security holes. I had just read an article about it, but I just didn’t expect it to be demonstrated on my computer… thanks for the education and please,
I hope no one else follows suit… okay, Mark, I hope no hackers are reading your very informative weblog.
this could be the end of open aggregators…
Comment by M. Ford — Thursday, June 12, 2003 @ 8:01 pm
M. Ford,
How does the classic flaw of trusting arbitrary user input translate to “xml was full of security holes”?
I’m also extremely curious about what kind of article could be written about security flaws in XML that wouldn’t be just be more noise about the billion laughs exploit via recursive entities.
Do you have a link to this article?
Comment by Dare Obasanjo — Thursday, June 12, 2003 @ 8:12 pm
XML isn’t a program or even a language… it’s a set of language rules to build your own markup language with… I guess you meant RSS.
Comment by Jesper — Thursday, June 12, 2003 @ 8:21 pm
Interesting… I have Radio Userland, and I never got the platypusen. My install doesn’t display the content:encoded, only the description.
Comment by Matthew Ernest — Thursday, June 12, 2003 @ 9:54 pm
Hmm.. well, I can think of one way an XML mechanism can open up these security holes, Dare. Surprisingly, it won’t affect IE. [div xmlns:h="http://www.w3.org/1999/xhtml"] [h:object …
Any browser that understands the xhtml namespace (say Mozilla) will use the object, and most parsers which are stripping out specific items, as per Mark’s list above won’t match h:object as opposed to object.
Comment by Lach — Thursday, June 12, 2003 @ 10:30 pm
Mark - have you given any thought to publishing an RSS feed of “test cases” of exploits? It may be useful to the aggregator/consumer developers. Or is there one already available elsewhere.
As a reply to 26 - about “people complaining that you’re provided ‘hackers’ with a cookbook for messing up others’ RSS aggregators”, the simple explanation is that security through obscurity is no security at all. As evidenced by a rather large software company.
Comment by Isofarro — Friday, June 13, 2003 @ 4:37 am
The only major problem here is the tag and *maybe* the embed and object tag.
NewsMonster will strip … even if it didn’t it really isn’t a big deal.
Do something stupid, get unsubscribed :)
Script is the big issue because I can’t think of a single good reason (worth keeping around) to have a script in RSS but there are tons of DoS attacks and potential security issues.
Kevin
Comment by Kevin Burton — Friday, June 13, 2003 @ 6:00 am
Another note… I requested that this be added to the RSS 1.0 spec a long time ago as an implementation note.
http://www.peerfear.org/rss/permalink/1025904432.shtml
The group decided to veto the proposal. Obviously I wasn’t happy :)
Kevin
Comment by Kevin Burton — Friday, June 13, 2003 @ 6:06 am
Do you strip the word ‘javascript’ ???
Looks like you did after I read my previous comments…
Lets see… this is a test post
javascript
Comment by Kevin Burton — Friday, June 13, 2003 @ 6:12 am
Wow… that is strange… Didn’t strip javascript from my previous post but it looks like it was stripped from the one above that….
Anyway.
Comment by Kevin Burton — Friday, June 13, 2003 @ 6:13 am
Kevin, it strips all tags. If you want to make a tag show, use <tagname>.
Comment by Jesper — Friday, June 13, 2003 @ 7:35 am
Meanwhile back in the real world…
There are a lot of feeds out there that blindly copy out the bad html that has been entered by someone else in a comments box. Either that or they include all the formatting used on the blog itself. To avoid the item being too long they then chop after N characters and add “…”. The end result of this is <description> containing not malicious but annoying tags like <font and <table and because this isn’t cleaned up before chopping these are often unmatched. I think all this is much more of a problem than the rare occasions where someone deliberately tries an exploit, dangerous though that might be.
So I can strip_tags() selectively, and use some simple regex to get rid of the worst of the tag attributes. But I’ve still got to build an HTML tidy to catch the unmatched tags.
It’s enough to make me want to exclude everything except <a href, <img and I’m not too sure about those either.
It’s easy to say just ignore the feed and unsub. But some of the most worthwhile blogs are some of the worst offenders. (Are you listening, Doc? Howard? David?)
Comment by Julian Bond — Friday, June 13, 2003 @ 8:16 am
Mark, I recommend that you change your “list of things to remove” to say “remove/defang everything except the following tags…” You should also include a list of acceptable attributes for each of these tags.
The approach you’re recommending at the moment is inherently futile, as some people have pointed out in the comments above.
Comment by Már Örlygsson — Friday, June 13, 2003 @ 10:13 am
Lach,
“Any browser that understands the xhtml namespace (say Mozilla) will use the object, and most parsers which are stripping out specific items, as per Mark’s list above won’t match h:object as opposed to object.”
a.) RSS Bandit does.
b.) This still doesn’t sound like an “XML hack” to me but a case of blindly trusting arbitrary user input which has been a known security flaw for decades.
Comment by Dare Obasanjo — Friday, June 13, 2003 @ 10:34 am
Stepping back from the details for a moment, it would seem to me that the right long-term approach is to make sure that the commonly used rendering components (the IE COM thingee, Gecko, etc) expose programming APIs that let you specify the trust level when making a rendering request. My claim: we should be able to achieve better security with a handful of widely used rendering components than with fifty tag strippers.
Comment by Andrew Grumet — Friday, June 13, 2003 @ 2:52 pm
Alternatively and in the same train of thought, users of browser-based aggregrators could be directed to set a low trust level for the URL(s) where they read aggregated content.
Comment by Andrew Grumet — Friday, June 13, 2003 @ 3:01 pm
Andrew: browsers already have a trust level; IE calls it “zones”. Different zones have different default privileges, and can be tweaked independently. I mentioned this in my article. The problem with desktop-browser-based RSS readers is that they subvert this system by taking content from remote zones (by default, the least trusted) and blindly republishing it in the local zone (by default, the most trusted).
This problem is not specific to RSS as a data format; any program that took any kind of potentially harmful remote content and republished it on a local web server would have the same problem. But lots of RSS readers do exactly this, so I’m focusing on them.
The more general problem of “trusting arbitrary user input” is, as others have correctly noted, been a problem for decades. This is just the latest specific case that I care enough to write about.
Comment by Mark — Friday, June 13, 2003 @ 3:20 pm
Mark: “The problem with desktop-browser-based RSS readers is that they subvert this system by taking content from remote zones (by default, the least trusted) and blindly republishing it in the local zone (by default, the most trusted).” Exactly. Do you think it would be easy/hard/impossible to modify these desktop-browser-based readers so that they republish in a non-trusted zone?
Comment by Andrew Grumet — Friday, June 13, 2003 @ 3:35 pm
To illustrate what I’m getting at, consider a program that could register with the OS that “http://127.0.0.1:5335/path/to/aggregator“ should be considered untrustworthy when rendering.
Comment by Andrew Grumet — Friday, June 13, 2003 @ 3:45 pm
To Dare: No, I read it, forgot about it, until now, but I’ll go back and try to track back to the article on the web… it was about xml (not as a language) and exploits… I’m pretty sure it was a few steps away from dailyrotation.com’s many feeds… or what it a few steps away from a subscribed feed…
I turn javascript off when I read from Radioland’s aggregator … and always have. But I still got the blacked out link to the platypus jpg at the top of the aggregator page… and that’s scary…
To Dare and all, I know next to nothing about progamming languages compared to you… … thanks for the education
Comment by M. Ford — Friday, June 13, 2003 @ 4:10 pm
Lach,
Unless I’m greatly mistaken, your proposed exploit will only work if the aggregator spits out the page with an XML content-type; in text/html Mozilla should just chew it into tag-soup (no namespace handling for that content-type). So it’s possible, but not terribly likely.
Comment by Chris Hoess — Friday, June 13, 2003 @ 7:53 pm
Interesting. I feel left out - my aggregator only displayed one platypus. But I do a lot of tag stripping and I also clip descriptions at an arbitrary length. As others have mentioned, a test suite would be enormously useful, as would suggestions for safely dealing with links and images (since I want to keep these displayed).
Comment by Stephen Downes — Friday, June 13, 2003 @ 11:23 pm
Chris: Hmm… does that mean I’ve just come up with an arguement for sticking to html in aggregators? Or just for ignoring the ’should’ mime-type of xhtml?
Dare: I’d never dream of calling that a hole in XML itself, but M. Ford’s comment and your reply made me think of using that as a way around tag strippers, since I’d imagine most would just use simple regex’s. It’s not too hard to fix your regex, of course, you just have to strip <.+:tagname as well as <tagname, but I thought it was worth pointing out for people who are stripping tags.
Comment by Lach — Saturday, June 14, 2003 @ 2:01 am
/<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]*<\/\1>/i
This regex could be improved:
running it on the HTML of this very page, it caught the first <script type=”text/javascript” src=”/js/common.js”></script>, the following few tags and the next <script>…</script> block… :(
Moreover, you can reduce the number of alternatives in the first part of the regex, which speeds things up.
<((no)?script|object|embed|style|frameset|i?frame)[^>]*>[^<]*<\/\1>
> I’d be surprised if this problem can be solved properly using regular expressions - for example, the examples regexps pasted in above would miss out on tags that don’t have a closing tag and unquoted attributes…
I am pretty sure it CAN be fixed just with regexes (be a long list), even if it might require a lot of firepower — making it in the end more trouble than necessary. Remember Jeff Friedl’s 6KB regex to validate an email address?
Most probably a mixture of regexes and code should do the trick.
And there is no such thing as regex-hell… only a regex-paradise ;-)
Comment by dda — Saturday, June 14, 2003 @ 2:29 am
> Most probably a mixture of regexes and code should do the trick
The regex you gave is a good reason that it isn’t easy to use regexes to parse html.
You forgot to take into account that html may have optional spaces before the tag name and that you can’t rely on the closing tag to not have either white space or attributes (of course that’s invalid, but browsers would accept it and it would sneak past your regexp). Also you could have an attribute that contained a non-escaped ‘>’.
You really need to use a proper HTML parser. If you’re a perl user, you may find HTML::TagFilter handy.
Comment by Gavin Estey — Saturday, June 14, 2003 @ 6:34 pm
Also, be sure to restrict the URLs of images, links, etc. For Mozilla, you must disallow links to javascript: and data: URLs. For IE and NS4, I think there are a few synonyms for javascript: you also have to disallow.
Comment by Jesse Ruderman — Saturday, June 14, 2003 @ 11:22 pm
Gavin, you’re right. It is not easy to do it properly with just regexes (and for the record, “my” regex wasn’t aiming at foolproof parsing: as I said, we’ll need code, ie ‘intelligence’, on top of the regexes to catch everything…
The optional spaces are catched easily enough (\s*), but that has a tendency to slow down dramatically NFA engines, which is what most of us will use — Perl, PHP, Python…
Since I am a PHP guy, I guess I’ll have to hack my own HTML Parser (Not!).
Re-reading the comments, the lists of things to disable is so long, I think it’s a better (as in “I am lazy”) idea to disable everything and re-enable only a selected few :(
Comment by dda — Sunday, June 15, 2003 @ 3:48 am
I’ve only used Javascript in a blog entry once, but it was with good reason: OCN-linking a movie file.
OCN is Open Content Network, a way of swarming downloads of popular web-served files. (Basically a web-based BitTorrent that doesn’t require all that tedious mucking about with trackers and seeders). See:
http://www.open-content.net/
The method of publishing stuff on OCN is to provide a normal link but include some Javascript on the page that does automatic detection/install of the OCN client and translates the link to an OCN download. Admittedly, using an interstitial download page would solve this problem.
However, I think that Andrew’s comment (#62) is easily the best solution. Moreover, it could also solve a stupid problem common to many RSS readers: not setting the BASE of the HTML correctly so that relative links work.
Comment by Yoz — Sunday, June 15, 2003 @ 4:52 am
Lately, I have moved away from rss readers. Reading a site on a RSS aggregator doesnt feel like the real thing somehow.
Comment by anand — Sunday, June 15, 2003 @ 7:02 am