You have been directed to this page because you have blogged about, submitted a bug report about, or otherwised mentioned the RSS validator at feeds.archive.org and claimed that UserLand’s validator says my feed is fine!
Please don’t use UserLand’s validator. At all. It hasn’t been updated since the days of RSS 0.91, and it was never very helpful to begin with. It even passes
non-well-formed XML in many cases!
Example #1: feeds with HTML entities; valid? UserLand’s says yes, but ours says no. We’re right, they’re wrong.
Example #2: feeds with high-bit characters (like curly quotes
); valid? UserLand says yes, but we say no. We’re right, they’re wrong.
Even when they’re right, UserLand’s validator has cryptic, unhelpful error messages. It even admits this as it’s giving you the message. Example #3: Missing a link? UserLand says Can’t get the address of “link” because the table doesn’t have an object with that name.
We say missing channel element: link
, with a link to more information.
As an added bonus, our validator is open source and comes with over 350 test cases. UserLand’s validator is closed source and comes with none.
This isn’t better because someone says it’s better, or because there’s a conspiracy to lock Dave Winer in a trunk. It’s better because it’s better. You don’t believe me, so I suggest you go out to bat with this and see how you get on.
§
“Example #2: feeds with high-bit characters. ” I don’t wanna be pedantic (actually that’s a lie, I love being pedantic), but the problem isn’t that there’s a high-bit character, it’s that the byte 0×93 is not a valid UTF-8 encoding of anything. If you change it to for example « (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK), your validator is fine with it.
— Tim Bray ![]()
Whatever you want to call it, the document is not well-formed XML, but UserLand’s validator “passes” it anyway, despite the fact the third line of the RSS 2.0 spec states that *all* RSS feeds are required to be well-formed XML.
— Mark ![]()
My feed at http://crazedmonkey.com/blog/index.rss used to validate but now it doesn’t. The only error message I can coax out of the validator is “This feed does not validate as RSS”, which isn’t very helpful. I’m assuming that this is a bug with the validator, otherwise it would have given me a more useful error message.
— ian ![]()
The offending line appears to be “CPAC Canada’s Political Channel – Cha&iuml;ne politique du Canada">Parliamentary channel</a>”. I’m assuming that it doesn’t like “&iuml;”
— Aquarion ![]()
ian: it’s a bug; if there’s something wrong, it should always tell you what it is. In this case, the XML parser is complaining about a UnicodeError, which probably means you copied-and-pasted a high-bit character (like a curly quote or apostrophe) into your blog entry and your blog software doesn’t transform that into the proper numeric equivalent. I’m pretty sure it’s in this blog entry:
http://crazedmonkey.com/blog/politics/alternate_reality_canadian_right2.html
but I’m having trouble pinpointing exactly where. (You’d think computers would be good at this sort of precision.)
I’m updating the RSS validator to make sure it always outputs detailed information about the error.
Thanks.
— Mark ![]()
Dave Winer is funky. (meant as humour, not flame)
Is this open warfare between you and Dave now? From your recent posts, Mark, you seem to be sticking the boot in a fair bit. From the outside perspective, Dave seems to have started turning the other cheek (which, imho, is something he should have done a while ago). However, I can’t help but feel that Dave is winding you up via private email.
Or am I blowing things up out of all proportion?
okay, I’m confused. In my own feed, I put html inside the tags. The only thing I change from “normal” html is that I change the to < and >, respectively, and that I change ” to ". It doesn’t validate, but it works using my RSS reader.
Obviously, it’s not correct, so what steps do I have to take to make html RSS-ready?
— LKM ![]()
okay, I’m confused. In my own feed, I put html inside the description tags. The only thing I change from “normal” html is that I change the left brackets and right brackets to an lt-entities and a gt-entities, respectively, and that I change quotes to a quot-entity. It doesn’t validate, but it works using my RSS reader.
Obviously, it’s not correct, so what steps do I have to take to make html RSS-ready?
— LKM ![]()
LKM: putting escaped HTML in [description] is permitted by the RSS 2.0 spec. It’s possible you are not escaping it properly. Are you converting ampersands as well as left and right brackets?
An example URL would help.
— Mark ![]()
Hm, ampersands could be a problem. I guess I need to convert them before I convert the other entities, right? So I just convert all ampersands to amp-entities bevore I convert the other entities?
(the url is http://cube.lkmc.ch/rss.php. I’ve already sent a mail to the feeds-archive-talk-list today, which I suspect might be partially responsible for this blog entry, since I mentioned using the UserLand validator)
— LKM ![]()
Mark, I wasn’t trying to say the Userland validator was better than yours. I admitted yours was best. You didn’t have to do this.
— Randy ![]()
My RSS2 feed gives an error, with your (and Sam Ruby’s) validator, because of the ‘ç’ character in one of my post titles. I write mostly in English and a small (teensy tiny) amount of French once in a while. After the MT rebuild I edit by hand to get around the problem. A ‘Ç’ character does not exhibit this invalidation.
I know next to nothing about RSS, and I wouldn’t expect ‘foreign’ chracters to divine invalidation. The Userland RSS validator says everything’s fine, I’m not excusing ‘bad validation’ on their part, something I cannot judge. So what gives?
ian: the fix is deployed and the documentation has been updated with a bit more information. It’s still a difficult error to diagnose and fix, but the error is real and now it’s reported as accurately as possible.
Whether you care to fix it or not is, of course, a separate question. ;)
— Mark ![]()
Gummi: your problem is not RSS-specific, any XML format would have the same problem. Characters like that should be converted to numeric equivalents in order to be included in an XML document like an RSS feed.
For example, the cedilla character you mention (ç) should be converted to ç.
I have the same problem on my Spanish blog ( http://es.diveintomark.org/ , http://es.diveintomark.org/index.rss ). Turns out there’s a relatively standard Perl library called HTML::Entities which includes a function to do this conversion for you! (I just learned this a few days ago. The function is called HTML::Entities::encode_numeric. I hacked in support for this into Blosxom’s RSS generator and now my feed validates.)
Bottom line: the error is real. Your weblogging software should be converting that cedilla into the appropriate numeric entity for you when it generates your RSS feed. Movable Type is not very good about this, and neither are many other publishing systems. It’s a common problem, one of the most common problems when working with XML.
— Mark ![]()
Mark, that’s simply not true. XML is Unicode, and you can enter all Unicode characters directly into it; the numeric entities are not necessary. If you enter in the numeric entities, there’s a pretty good chance that they’ll be serialized as the UTF8 characters directly — the XML Infoset makes no distinction between the numeric entity and the actual character.
If your validator isn’t accepting legitimate Unicode characters, it’s broken.
(But, to be perfectly clear, your validator is NOT broken, and it does accept valid UTF8 characters. I don’t know why you and others are having problems, but my own feed validates fine.)
ç = ç
Thanks for the info Mark! I suppose an effort on my part would have found the answer but I was just too lazy, or it may be inertia.
Mark: What’s up with your obsession with Dave these days, eh? I guess this recent RSS FUD campaign (here and elsewhere) must really bump Dave’s Google-Ranking :-)
Mike K,
Not quite. XML *can* be encoded in utf-8 or utf-16, but those aren’t the only encodings allowed. Yes, an XML processor *must* accept utf-8 and utf-16, but others can be used, see:
http://www.w3.org/TR/REC-xml#charencoding
You can use any encoding, such as ISO-8859-1, ISO-2022-JP, or any of the other encodings listed in:
http://www.iana.org/assignments/character-sets
— Joe ![]()
I am developping a blog that currently passes validation on your RSS validator (http://www.lznet.com/rss.xml – in development). This RSS file contains unencoded cedilla chars (ç) but it starts with an XML declaration with encoding set to UTF-8.
Using this declaration, and feeding UTF-8 encoded data, as far as I can understand, I am not required to further encode these characters.
— Claude ![]()
Many more feeds would validate as XML (and therefore be of far greater utility to those feed readers that are based on XML parsers) if people would simply declare the encoding that they are using.
A very popular encoding is iso-8859-1. If you are prone to cutting and pasting from Microsoft Word, you might consider windows-1255.
For more options and details, see http://www.intertwingly.net/blog/832.html
— Sam Ruby ![]()
Thanks, Mark. I see the new error message, although I’m still somewhat puzzled. The character it’s pointing to is definitely low-bit and viewing the RSS through a hex dump confirms this. I also tried grepping the hexdump output for possible high-bit values and came up with nothing:
hexdump index.rss | egrep ” ([0-7][0-9a-f])?[8-9a-f]”
I’m not saying that your validator is lying, Mark, just that I don’t agree with the location of the error. ;)
— ian ![]()
If you put an “encoding” attribute to your XML declaration, it should accept whatever character belongs to that encoding. My software produces ISO-8859-1 encoded XML documents and they should validate fine, which they do (well, that’s not true… right now, they don’t validate because of relative src attribs on images)
I typically find that setting the encoding to iso-8859-1 solved a lot of issues that I had with directly cutting and pasting non-English characters such as Ç. For instance, Mark’s http://diveintomark.org/public/2003/06/highbit.xml is well-formed XML once the encoding is set to iso-8859-1
Re: it’s well-formed XML once I do something to it.
In other words, it’s not well-formed XML now.
— Mark ![]()
I use your own RSS 2.0 feed template which does not escape HTML entities (appearently). → isn’t escaped with an &.
— Jesper ![]()
“UserLand’s validator is closed source”
That’s not true. UserLand ships the source to the validator. It’s the RSS parser in Radio and Frontier. Everyone who uses those products gets the source. Hardly “closed.”
Dave,
It’s not available except as part of a commercial software package (I intentionally avoided the term ‘proprietary’ here for the sake of argument), and it can’t be modified and redistributed.
That makes it closed. Not closed in the sense of a car with the hood welded shut (in that sense, Userland’s products are very open), but closed source just the same. The same goes for MovableType, of course, but a validator is a piece of infrastructure whose openness can’t be constrained to visibility.
ian: Sam and I are still looking into it. You may very well be right. There definitely was a bug, in that the validator was throwing an error but not reporting it. That has been fixed. There may be another bug that causes it to throw an error when it shouldn’t; that’s what we’re looking into now.
Did I mention “actively maintained” in my list of benefits? ;)
— Mark ![]()
Dave: are users who receive that source code free to modify it and redistribute their changed version? If not, then it’s closed source.
You can argue about whether “open source” is a worthwhile benefit over “closed source”, but you can’t redefine the terms.
— Mark ![]()
Mark,
I realise that you are trying to make the distinction between open-source and other licensing schemes, but calling it “closed-source” is not accurate, in my opinion.
Everyone involved with Free Software/Open Source Software is keen to differentiate between their particular flavours and software that isn’t freely modifiable+redistributable, but the term they use is “with-source, not open-source”. “With-source”, in my mind, is different to “closed-source”. For one thing, if the vendor goes under, you aren’t needlessly constrained, plus you get to see how it works.
I would completely agree if you said that “with-source” in no way gives anywhere near the benefits of “open-source”, but I can’t agree with equating “closed-source” with “with-source”. As a matter of fact, I believe a validator is one problem domain in particular that benefits from the scrutiny given by Free Software/Open-Source Software.
Open source: kernel(xml.compile)
Aside: is it time to add “FUD” to Mark’s list of magically-markedup acronyms?
I encode my RSS feeds as ISO-8859-1 because that’s what my content is. Am I wrong? It should be ok and makes everything more manageable.
— alessio ![]()
I am still waiting for Windows-1250 encoding support:
This feed does not validate as RSS.
line 1, column 30: unknown encoding (maybe a high-bit character?) [help]
— JB ![]()
Tim,
You stated that:
“With-source”, in my mind, is different
to “closed-source”. For one thing, if
the vendor goes under, you aren’t
needlessly constrained, plus you get to
see how it works.
Which isn’t true. If the vendor goes under, you are still bound by the intellectual property restrictions under which you licensed the code in the first place. These rights very rarely “go away” (witness the SCO thing), and some corporate officer or entity left to pick up the peices of the failed vendor will invariably seek to keep these IP rights on life support until such time as they are deemed useful. You are depending as much on the altruism of the vendor WRT to their license and what you can do with their code as you were when they were in business. It’s an easy mistake to make, and a lot of people without experience fighting IP battles or choosing licenses blithely assume that if you get source, you can do what you like with it if no one is looking.
There is no fundamental difference regarding your rights regarding a peice of code in compiled or source format if the license is restrictive. Equating “with-source” and “closed-source” is indeed the correct way to assess the situation for end users. Sorry.
I think that your validator believes that highbit.xml is in ISO-8859-1 because that is the default charset used by the HTTP protocol. Use UTF-8 if you want to use high bit characters (UTF-8 and/or UTF-16 should always be supported by an XML parser).
— Dries ![]()
‘Equating “with-source” and “closed-source” is indeed the correct way to assess the situation for end users. Sorry.’
I disagree.
It depends entirely on the licence accompanying the source. There is clearly a continuum between “Open-Source” licences and “With-Source” licences so restrictive that they might as well be “Closed-Source”.
I would ask Mark whether he would apply the label “Closed-Source” to MovableType. It’s certainly not “Open-Source”, but most of us are quite happy with the existing licencing terms.
Alex, I agree with what you said *almost* completely. The critical difference would be the addition of the qualifier ‘over the long term’ to the beginning of the last paragraph.
In the short term, visible source (particularly in an interpreted code system, which Frontier is) still goes a long way to let people scratch their ‘I need to fix this right now’ itch, even if they can’t distribute the changes.
I’d appreciate it if those programs weren’t called “validators”. RSS purports to be an application of XML. In the XML context, “valid” (without qualifiers such as “schema valid”) has a specific technical meaning defined in the XML spec. The RSS “validators” do not check for validity as specified in the XML spec. They check for something else (which is useful but is not the same thing as checking for validity in the XML sense). I suggest they be called “linters” or something like that.
(Whether RSS is actually an application of XML is debatable. I think it should be considered alarming that people are even discussing whether RSS should be well-formed while at the same time taking it for granted that RSS is supposed an application of XML. Being well-formed is not optional in XML. If RSS is an application of XML, RSS documents have to be well-formed—no exceptions. If a documents isn’t well-formed as per the XML spec, then the document isn’t an XML document—no excuses.)
I prefer the validator built into TopStyle (http://www.bradsoft.com/topstyle/). Sure it cost me $70 … it also saved me days if not weeks of developing and deploying tons of tableless templates.
Re: “Being well-formed is not optional in XML.”
Henri: I think everyone here agrees with you, including the RSS 2.0 specification, which clearly states in the opening section that “RSS is dialect of XML. All RSS files must conform to the XML 1.0 specification.”
My point is that it’s dangerous to rely on a service that purports to “validate” an XML-based format when it obviously doesn’t even use a conforming XML parser to do so. Any conforming XML parser would catch these mistakes (like improper character encoding). Ours does; theirs doesn’t. Therefore ours is better. End of story.
Now, ours still has limitations, as one or two people have pointed out. There are encodings that it doesn’t recognize, which means it is not as useful as it could be to as many people as it could be.
But the right answer is not “well, let’s just ignore the encoding and try to validate it anyway”. That is exactly 180 degrees away from the right answer. At least ours admits its limitations; when it comes across an encoding it doesn’t recognize, it barfs, which is exactly what a conforming XML parser *must* do. The best answer, long-term, is to get a better XML parser with a wider range of encoding support. We’re looking into that now.
— Mark ![]()
Jacques,
You said:
> It depends entirely on the licence
> accompanying the source. There is
> clearly a continuum between
> “Open-Source” licences and “With-Source”
> licences so restrictive that they might
> as well be “Closed-Source”.
Well, of course the license matters, how do you think Open Source licenses enforce their freedoms? By definition though, anything that can’t be called Open Source carrys licensing terms that are onerous in one way or another. My point was less that any one license is highly restrictive right now, but rather that as long as you receive source that does not come with the kinds of gurantees provided with real – honest to goodness – Open Source licenses, you are at the mercy of the vendor. Whether or not you like the vendor or their license right now is orthoginal.
Making the discussion a little bit more concrete, under the userland license right now, you may not (from http://radio.userland.com/license):
“Modify, translate, reverse engineer, decompile, disassemble (except to the extent applicable laws specifically prohibit such restriction), or create derivative works based on the Software;”
Read that slowly and carefuly. You may not create derivitive works. You simply can’t. Source or not, if you make a change to his code, you are breaking his license (which may be either goverened by IP or contract law, depending on who you ask).
If you like Open Source and what it stands for, then you MUST recognize that licensing terms like that must first and foremost be respected, and secondly, they must be defeated with better code. You can talk about continuiums all you like, but that doesn’t make Dave Weiner’s license any less hostile.
And you also said:
> I would ask Mark whether he would apply
> the label “Closed-Source” to
> MovableType. It’s certainly not
> “Open-Source”, but most of us are quite
> happy with the existing licencing terms.
Again, your peace of mind regarding non-Open Source licensing terms does not make them more open, it simply means that they meet your personal requirements (which is, after all, what matters when you choose software).
Re: “I would ask Mark whether he would apply the label “Closed-Source” to MovableType.”
Jacques, Movable Type is closed source. Quoting from the license:
“”"
Restrictions on Use. Licensor grants you the non-exclusive, non-transferable right to use the Software to manage and update your personal, non-commercial website. You may not redistribute the Software without Licensor’s prior written consent. Although you may modify or alter the Software for your own use (including copies that extend, or enhance the Software), you may not distribute, transfer, or resell the modified or derivative copies of the Software; you may not use such copies for other than personal, non-commercial purposes;…
“”"
(source: http://www.movabletype.org/license.shtml )
Now, here’s a quote from the Open Source Initiative:
“”"
The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources.
…
The program must include source code, and must allow distribution in source code as well as compiled form.
…
The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.
“”"
(source: http://opensource.org/docs/definition.php )
Bottom line: “comes with source” != “open source”. It is fairly clear from reading the licenses of both Movable Type and Radio (and the RSS validator that it contains) that they are both very far removed from being “open source”. I feel entirely justified in calling both of them “closed source”.
— Mark ![]()
My point with the terminology hair splitting above was that the archive.org RSS Validator does not check for validity in the XML sense. For example, it proclaims that the (unfunky) tag soup (“entity-encoded HTML”) over RSS 0.92 feed of Macsanomat is “valid” even though the feed cannot be valid in the XML sense, because it is a DTDless XML document.
I think what the RSS Validator is doing may even be more useful than validation in the XML sense but the thing it is doing isn’t validation in the sense specified in the XML spec. Hence, my suggestion to not call it a “validator”.
Yes, Henri, I fully concede the point that RSS is a DTD-less XML vocabulary, and hence can not be “valid” in the sense of “checking against a DTD”. This is a very arcane point which is completely beside the point, because “validator” is a generic word used in many contexts. Our validator is in good company:
http://jigsaw.w3.org/css-validator/
http://www.w3.org/P3P/validator.html
http://validator.soapware.org/
http://www.ldodds.com/rss_validator/1.0/validator.html
http://developer.mimer.com/validator/
http://www.daml.org/validator/
http://www.stg.brown.edu/service/oebvalid/
And so forth.
— Mark ![]()
Dave took his survey down, ironically spotted in his RSS feed. I put one back up. On his site.
>> Based on everything you’ve read in the last week or so, do you feel a need to change what RSS is, or is it good enough the way it is? <<
http://www.userland.com/surveys/run/zapadoo@yahoo.com/doYouLikeRss
Scripting News is taking a break. (Ok Dave, make it permanent)
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
© 2001–9 Mark Pilgrim