dive into mark

You are here: dive into markArchivesJune 2003History of RSS date formats

Saturday, June 21, 2003

History of RSS date formats

I want to talk about prior art. But I can’t do that yet, because first I need to give my opinion about this funky RSS business.

It’s FUD. It’s crap. Cut it out. It’s exactly the sort of thing that everyone hates Microsoft for: talking trash about competitors, making vague threats of incompatibility, and scaring end users in the process.

(If you don’t know what I’m talking about, trust me, you don’t want to know. Stop reading right now before you get sucked into this world.)

My RSS 2.0 feed uses Dublin Core (those dc: elements) to express things like dates, authors, and subjects. This is a perfectly legitimate use of RSS 2.0. The RSS 2.0 specification explicitly states that a RSS feed may contain elements not described on this page, only if those elements are defined in a namespace. So anything goes, and it’s all good. Furthermore, several Dublin Core elements (like dc:date) are well-supported in the current generation of aggregators. So not only is it acceptable, it’s actually supported and useful, right now, today.

However, in RSS 2.0, there are several things that you can express in two different ways. Dates, for example. Instead of dc:date, you could use pubDate. Many people do this. Given the choice, you may ask why I (and others) would choose to use Dublin Core instead? The honest answer is, either way works, but we like doing it this way. This brings me to my original point: respecting prior art.

I have previously taken the position that using Dublin Core respects prior art, but using pubDate does not. After doing some digging, I am reversing this position. Both pubDate and dc:date respect prior art.

Dublin Core has been around for years. It was originally conceived in 1995 as a way of embedding common metadata (like titles, dates, and subjects) into many different types of data. The original focus was on HTML, but there are now well-defined guidelines for expressing it in XML and RDF as well. It is widely used, and it’s even an ISO standard now.

Many people (myself included) have chastised Dave Winer in the past for not embracing Dublin Core explicitly in RSS 2.0. BUT… RSS 2.0 simply builds on the precedent set by the RSS 0.9x line. When Dave introduced the item-level pubDate element in RSS 2.0, he was respecting prior art: the prior art of the original RSS 0.9x line. Consider:

Netscape’s RSS 0.90 specification (March 1999) had no dates of any kind, but Netscape added a channel-level pubDate in RSS 0.91 (July 1999). (They also added lastBuildDate.) These date elements were in RFC-822 format; that is, they were in the same format as the dates in email headers and HTTP headers. This date format has persisted (with the addition of 4-digit years) to the present day.

BUT… in 1999, why did Netscape add those elements to RSS 0.91, and why did they use that particular date format? Well, they didn’t make them up; Netscape took them straight out of Dave’s competing syndication format, called scriptingNews format. (BTW, this is the basis of Dave’s claim that he “co-authored” RSS. Please, let’s not go there today, OK?)

Now, scriptingNews format predates RSS by a wide margin. Dave started developing it back in 1997, and it had a pubDate element. But he didn’t make up a new date format. He used the one that’s always been used in email headers, as defined in RFC 822, which dates back to 1982. (4-digit years were added in RFC 2822 RFC 1123.)

Now, the Dublin Core element set was stable by 1997, when Dave started work on the scriptingNews format. (Here’s an article from July 1997 on how to use Dublin Core in HTML.) BUT (a) Dublin Core was still very new, and (b) all the work being done with it focused on HTML. I find no record of it being used in XML until 2000.

BUT… the ISO 8601 date format (which is what Dublin Core uses, and which is the format that Sam refers to as “easier to get right”) did exist in 1997; it dates all the way back to 1988. In fact, Microsoft’s early versions of CDF (submitted to the W3C in March 1997, and later enhanced and used in Internet Explorer 4.0) defined a LastMod element that was a date in ISO 8601 format.

BUT… the ISO 8601 date format that we know and love today (and that is used in Dublin Core, RSS 1.0, and some RSS 2.0 feeds), didn’t actually exist in 1997. The specification we now call ISO 8601 is really ISO 8601:2000, meaning that there was a revision of the standard in the year 2000. The revision was made because the 1988 specification (ISO 8601:1988) defined an overly wide variety of date formats that, you guessed it, made it very difficult to parse.

Update: But wait! There’s more! Kellan notes that Dublin Core doesn’t actually use the full ISO 8601 specification, it uses a profile (a restricted subset) defined by the W3C. And when was this profile published? Well, it was submitted in September 1997, but it wasn’t officially published until August 1998. So it probably wasn’t widely known in 1997, when Dave was developing his scriptingNews format.

To recap: in 1982, RFC 822 defined a date format. In 1997, Dave Winer respected prior art by using that date format for the date elements in his syndication format. He could have chosen a different date format, but he didn’t, and his choice made good sense at the time. In 1999, Netscape respected prior art by taking elements from Dave’s scriptingNews format and not changing the date format. In 2000, Dave Winer continued the RSS 0.9x line and respected his own and Netscape’s prior art by not changing the date format. In 2002, Dave Winer respected this entire line of prior art by adding item-level pubDate, with the same date format.

Now, none of this is to suggest that namespaces are bad. That’s just ridiculous. Namespaces were the biggest new feature in RSS 2.0; they are the very reason RSS 2.0 is called 2.0 and not 0.94. Yes, using pubDate also respects prior art. But using Dublin Core also respects prior art, just a different lineage of prior art. Using either in RSS 2.0 is absolutely legitimate, and every news aggregator I know of, that cares about dates, supports both.

Furthermore, Dublin Core and ISO 8601 have won in the larger worldwide marketplace. Outside of the Internet, virtually no one uses the RFC 822 date format. If I were creating a brand new format today, any kind of format, for any reason, I would absolutely use the profile of the ISO 8601 date format published by the W3C. If I were creating an RDF-based format, or an XML-based format for namespace-aware consumers, I would absolutely use Dublin Core, straight up. It’s here, it works, it’s its own ISO standard. But RSS’s pubDate wasn’t invented today; it was invented in 1997. It still works, and you can still use it in your RSS if you want. I use Dublin Core.

Filed under

40 comments

  1. So, let me get this straight. RFC2822 is the format used by the pubDate element in RSS 2.0, and looks like this: Fri, 21 Nov 1997 09:55:06 -0600

    ISO8601 is used in the dc:date element in RSS 1.0 (and sometimes RSS 2.0) and looks like this: 1998-05-12T14:15:00

    Both formats (at least according to their specifications) can contains more or less information, i.e the time can be left out and the time zone is optional.

    And to confuse matters even more, some feeds that have a pubDate element use ISO8601 (see my rant about RSS dates: http://simon.incutio.com/archive/2003/04/04/lettingOffSomeSteam )

    I wish there was a vendor independant FAQ about all of this stuff, because when you combine all of the nuances it becomes pretty hard working out what to extract from an RSS feed.

    Comment by Simon Willison — Saturday, June 21, 2003 @ 4:28 am

  2. I don’t see anything in the Netscape 0.91 spec that says that pubDate and lastBuildDate are RFC-822. The one example given uses RFC-822 + 4-digit years, yes, but it’s never safe to extrapolate from examples; the example is in GMT, maybe all RSS dates need to be?

    UserLand’s 0.91 through 0.93 specs explicitely say RFC-822 dates; there’s nothing at all in the language of those specs that allows 4 digit years.

    Comment by Todd Larason — Saturday, June 21, 2003 @ 4:49 am

  3. Actually RFC-1123 updated RFC-822 to use four digit years in October 1989. Strange enough four digit years were already in RFC-733 in November 1977, but were apparently dropped in RFC-822, which replaced RFC-733.

    If I understand it correctly, RSS 2.0 introduced both the pubDate field and the possibility of namespace extensions. In that case I find it strange to use this extension possibility to introduce a field that already exists.

    Comment by Curioso — Saturday, June 21, 2003 @ 7:04 am

  4. Okay, so both have a legitimate claim to be there.

    It’s still madness to keep pubDate. Why? It’s done better by Dublin Core, authors aren’t dependent upon pubDate, and readers find it easier to implement Dublin Core.

    So you have two choices: get rid of one, have a tiny amount of backwards compatibility to handle for a short amount of time, and end up ahead in the long run; or put it off, and force everyone to deal with the issues indefinitely. It looks like Dave chose the latter for RSS 2.0, though I can’t understand why.

    Comment by Jim — Saturday, June 21, 2003 @ 7:36 am

  5. Great essay. Nicely balanced and very informative. Thanks.

    I think I’ll stick to using pubDate for the time being, but if I ever decide to use the wider range of dc: elements I will then change my feed to use dc:date. It somehow doesn’t seem worth the extra syntax bloat to use the dc: namespace for just this one element.

    Comment by Már Örlygsson — Saturday, June 21, 2003 @ 7:56 am

  6. Trackback by DIENSTRAUM MediaMondo
  7. Great essaie. I’d add that at one points there was one date format in RSS and the a second was introduced. I believe dc:date was added to the repertoire in RSS 1.0.

    Comment by Randy — Saturday, June 21, 2003 @ 10:00 am

  8. more…
    boing boing does have the 822 dates incorrect, but then both you and mark have the iso dates incorrect :) timezone is wrong

    Comment by Randy — Saturday, June 21, 2003 @ 10:21 am

  9. Dublin Core (and therefore RSS) actually specifies a profile of ISO 8601, a simplified subset of the format sometimes referred to as W3CDTF. There spec is at:

    http://www.w3.org/TR/NOTE-datetime

    Yes some of the pieces are optional, in fact nearly all of it is, but the spec gives clear examples of what is and isn’t optional, and generally it is a very straight forward spec, as datetime specs go. (be thankful you’re only parsing RSS and not iCal)

    Sorting:
    The one piece the spec leaves unclear is comparing dates of different granularities (e.g. I post so infrequently that I use only the year for my dc:date, while you have to use fractional secdonds for yours) This is supposed to be defined by the spec using the standard, a task that both Dublin Core, and RSS punt on.

    The general consensus for sorting dates is with missing info is, assume January 1st on a missing date, assume midnight on a missing time, assume GMT on missing timezone (though having time without a timezone violates the spec)

    Date geek will be horrified at the idea of conflating 2003 with 2003-01-01, but down that road lies madness.

    Parsing:
    Also I’ve made some code for parsing W3CDTF available, both where quick jobs, so I don’t guarentee they’ve been fully debugged (maybe with more users :)

    In Perl, as part of the new Perl DateTime project:
    http://search.cpan.org/dist/DateTime-Format-W3CDTF/

    And in PHP as part of my RSS parser Magpie, see parse_w3cdtf in rss_utils.inc:
    http://magpierss.sf.net

    ps. Randy, Mark’s timezones look good to me.

    Comment by kellan — Saturday, June 21, 2003 @ 10:49 am

  10. Trackback by bradchoate.com
  11. But while we’re complaining about RSS date formats, the one that is really really broken is the one used by the syndication module.

    http://purl.org/rss/1.0/modules/syndication/

    Utterly ambigous and non-descriptive.

    http://laughingmeme.org/archives/000392.html

    Comment by kellan — Saturday, June 21, 2003 @ 11:07 am

  12. Trackback by Simon Fell > Its just code
  13. kellan, re: mark’s dates, sorry, must just be me

    Comment by Randy — Saturday, June 21, 2003 @ 11:38 am

  14. Mark, there’s absolutely nothing wrong with using namespaces in RSS 2.0.

    Second, I chose the more reader-friendly date format for scriptingNews format. I was aware of the new format, I had seen it in stuff created by Microsoft, but I found it impossible to read. One of my goals for formats I design is that they be transparent and absolutely as easy as possible for non-technical people to understand. To me, this was the wonder of HTML, that people with very skimpy technical backgrounds could understand it very quickly.

    More on the subject of simplicity in formats.

    http://davenet.userland.com/2000/09/02/whatToDoAboutRss#rssIsAboutSimplicity

    Comment by Dave Winer — Saturday, June 21, 2003 @ 12:43 pm

  15. Mark, there’s absolutely nothing wrong with using namespaces in RSS 2.0. I disagree with some of the other things you say, I believe I am entitled to an opinion on what other weblog tools vendors do with content using the name RSS, much as they would be interested if one of their competitors changed what Trackback means or what the Blogger API means. I think you have obscured the issue, which is a shame, because it is an important one.

    Second, I did choose the more reader-friendly date format for scriptingNews format. I was aware of the new format, I had seen it in stuff created by Microsoft, but I found it hard to read. One of the goals for formats I design is that they be transparent and absolutely as easy as possible for non-technical people to understand. To me, this was the wonder of HTML, that people with very skimpy technical backgrounds could understand it very quickly.

    Further, since the more human-readable format had been around for a long time, and was use in email headers (as you note), I reasoned that most scripting environments would likely already have support for this format implemented. Even though it’s harder to parse than the newer format, it wasn’t that much harder to parse. I think this guess was a good one, I’ve rarely heard a complaint about this choice.

    More on the subject of simplicity in formats.

    http://davenet.userland.com/2000/09/02/whatToDoAboutRss#rssIsAboutSimplicity

    Comment by Dave Winer — Saturday, June 21, 2003 @ 12:52 pm

  16. Dave, Re: reader friendlyness:

    I find ISO dates are more readable than RFC822, because RFC822 assumes English and puts the month before the day. In Spanish (in Britain and in most Europe also, I think) you can understand both YYYY-MM-DD or DD-MM-YYYY, but MM-DD-YYYY is quite unnatural. RFC822 is somehow better than a MM-DD-YYYY format, because at least the month is in letters and this makes it less error prone, but for non-technical Spanish people it is still gibberish (I imagine this holds even more for, say, the 1000 millions of Chinese people).

    But I don’t really care, so long as the format is easily generatable and parseable in most tools or computer languages.

    Re: Timezone, I tend to prefer using “Z” for timestamps or other distributable dates in server side software, but current tool implementation status and bug^H^H^Hfeature sets :-) should be taken into account to avoid further mudding of an already wet field.

    The reason for using UTC is obvious if we consider the alternative is called “local” time.
    What does “local” mean when I’m posting from a hotel in San Francisco to a host in Madrid, and pinging machines all around the world, and someone in next room is going to read my pings from a machine in say, Raleigh, VA?

    Comment by Santiago Gala — Saturday, June 21, 2003 @ 1:31 pm

  17. Trackback by Observations
  18. Santiago — good point about readability. What I said applies to the US only. We have a nasty habit here of thinking the world revolves around us, and I say that with no sarcasm.

    If I were new to RSS, I would do what I did in the past with HTML and HTTP, find a source that is universally supported and mimic what they do. I did that over and over with Yahoo, figuring if they did something one way, that it had been vetted through lots of iteration and testing, and that if people were going to complain about it, they’d go to them first, because they’re so much larger. Same with Apache, Netscape, MSIE. So if I were new to RSS, I’d mimic the most popular feeds, the ones that the aggregators likely test against, the ones who if they broke them they’d have to fix their aggregator.

    We used that technique with MORE in the 80s with good results, we always tried to surf in Microsoft’s wake (they were notoriously difficult for Apple to keep from breaking). We figured we could probably get away with anything they were doing. We also looked at Aldus the same way.

    End of ramble.

    Comment by Dave Winer — Saturday, June 21, 2003 @ 2:06 pm

  19. Trackback by Teal Sunglasses
  20. Hasn’t HTML used 8601 in the datetime attributes of the <ins> and >del> elements since 1997?

    2003-06-21T20:45:00Z

    Comment by Phil Wilson — Saturday, June 21, 2003 @ 3:56 pm

  21. Including a local time offset provides more useful information to associate with a post or comment. It can be relevant to supply the time of day as it gives the audience an idea about whether something was written at 4.00am or 1.00pm, for example, where UTC might not. The time on this comment, when posted, will not be representative of the time it is here (actually, it’s out by 10 hours and 11 minutes).

    Comment by Anonymous — Saturday, June 21, 2003 @ 8:22 pm

  22. Santiago: dd/mm/yyyy is also the date format used within Australia as well. As far as I know mm/dd/yyyy is used exclusively within America. I wonder how on earth that came to happen, anyway?

    Comment by Lach — Saturday, June 21, 2003 @ 9:36 pm

  23. Lach: When Americans say dates aloud, we say things like “June twenty-first, 2003″. This translates naturally to a format like 6/21/03. (The expaded version, 06/21/2003 is rarely if ever seen.)

    Comment by Dave Menendez — Saturday, June 21, 2003 @ 10:02 pm

  24. Trackback by Virtuelvis
  25. re. “funky” - yeah, agreed.
    Nice to see the history, thanks Mark.

    Personally I’d come down strongly in favour of ISO 8601/W3CDTF for purely practical reasons, irrespective of any prior art.

    Away from RSS, I’ve experienced serious mixups more than once thanks to the US/UK difference on (numeric) mm/dd/yy and dd/mm/yy. As Santiago points out, using the names is hopeless outside of anglo-speaking countries.

    Another significant point that was touched on for other reasons is sorting. W3CDTF’s sort order is (more or less) the same chronologically and alphabetically. I’ve started calling folders of photos according to the date as yyyy-mm-dd because is just stacks up right on the file system automagically.

    I don’t think Dave’s argument about RFC-822 being easier for users holds very well - it may be easier (for Americans) to read in the ‘view source’ sense, but is harder to parse, so it’s making extra work for the end user. Unless of course the aim is to merely pass on the date unchanged, in which case why mark it up at all? Let’s have one-element syndication - surely content is all you need?

    Finally my usual crit of RSS 2.0 also applies - pubDate is used just in this stovepipe application domain (blog/newsreading), dc:date can be used *anywhere* (and is being - aside from RDF, note the use of DC.Date in HTML headers).

    Comment by Danny Ayers — Sunday, June 22, 2003 @ 5:45 am

  26. I do not understand the practical reasons for choosing the ISO dates. The RFC dates are just as unambiguous, compare “Sun, 22 Jun 2003 15:55 +0200″ versus “2003-06-22T15:55+02:00″. I find the ISO dates somewhat more ambiguous, because something like “2003-06-22T01:00-05:00″ could be interpreted as between one and five in the night… Also GMT to me is more clear than Z… I do not think either of the formats is clearly better or worse than the other, they both originate from a christian-centred world, and thus discriminate against other religions. Internet protocols and formats are english oriented anyway; maybe we should switch to numbered fields or latin, so that everybody has the same disadvantage ;-)))

    This seems to be e recurring cycle in discussions about data formats, to prefer machine readable over human readable, or vice versa. The Internet has a long history in preferring human readable formats, and this is an aspect I like. But since the Internet boom a lot of people have joined from other backgrounds, and are trying to push their side, leading to obfuscation and introduction of variant formats where none are really necessary.

    Seeing that a date is an essential attribute of a weblog entry, I think a date field should be part of the RSS specification, and not be delegated to an extension. Note that RFC-822 specifies a full timestamp, while dc:date leaves a lot open; in practical use of Dublin Core often just a date or year is used, and the time is omitted.

    Comment by Curioso — Sunday, June 22, 2003 @ 10:08 am

  27. Curioso: One of the advantages of ISO dates, is sorting and comparing. Since the order of magnitue decreases the further you read to the right, they can be sorted and compared with simple text comparisions [*]. For RFC dates, you first need to put the separate elements of the date into a fitting data structure.

    [*] You will, however, as with RFC Dates need to comvert to a common timezone.

    Comment by Arve — Sunday, June 22, 2003 @ 10:51 am

  28. The sorting argument is nice as long as you do not have timezones. I also use YYYYMMDD strings in filenames and folders that I want to keep sorted. But converting 1999-12-31T23:00-05:00 to 2000-01-01T04:00Z is not really trivial, and it is easily overlooked that this is necessary for sorting to work.

    Comment by Curioso — Sunday, June 22, 2003 @ 11:24 am

  29. kellan, Sam adjusted his timezone. Must not of been just me.

    Comment by Randy — Sunday, June 22, 2003 @ 7:26 pm

  30. Trackback by Compendium
  31. Trackback by Rodent Regatta
  32. Trackback by There Is No Cat
  33. Trackback by Random Stuff
  34. I’ve actually started dating my checks in 8601 format. It’s so simple! Year-month-date. Can’t beat it.
    To the poster who said negative timezones could be confused with durations, I believe the spec dictates using forward slashes to indicate duration. For example, I wrote this message 2003-06-24T12:39:33/42:09

    Comment by d chalmers — Tuesday, June 24, 2003 @ 4:53 pm

  35. I’ve actually started dating my checks in 8601 format. It’s so simple! Year-month-day. Can’t beat it.

    Comment by d chalmers — Tuesday, June 24, 2003 @ 8:30 pm

  36. Trackback by Maximum Aardvark
  37. Trackback by Maximum Aardvark
  38. Given the feed formats are data formats, what does it matter if the date format is readable or not by humans?

    Shouldn’t the aggregator format the datetimes how the user wants? Perhaps even by converting to the user’s local time.

    It’s certainly something I’ve been meaning to add to the mail reader I use.

    (Oh, and the guy who said RFC822 had months before days — um, no, it doesn’t.)

    [for those interested, like Kellan I'm helping with the Perl DateTime project. I did DateTime::Format::Mail which handles RFC822, RFC2822 and assorted bastardisations that are common in the wild. Kellan's W3CDTF module is far simpler, lending credence to W3CDTF being a nicer format for these things.]

    http://search.cpan.org/~spoon/DateTime-Format-Mail/

    Comment by Iain Truskett — Thursday, June 26, 2003 @ 9:25 pm

  39. Why human readable dates? Because developers aren’t perfect. If every single program written to produce RSS was absolutely perfect at divining what time it is (in UTC, in the server’s timezone, and in the user’s timezone, even if the server is in P(S|D)T and the user is in Israel, where daylight savings time is determined by fiat, based on how the holy days fall out in a particular year), then humans wouldn’t ever need to read it. But as it is, most of the feeds I subscribe to are accurately timed for half the year, and a great many more aren’t ever right. My favorite counter-argument to “developers will save us, users don’t need to ever look at the nasty bits” is to point at the first month or two of RSS-DEV messages, and ask how many of the developer promises have been broken.

    Comment by Phil Ringnalda — Friday, June 27, 2003 @ 2:47 am

  40. That’s the joy of W3CDTF. Timezones are offsets. Only exception is that of Z, which is +00:00. Most languages have a gmtime function, or similar. The only problem is when the generating machine’s clock is out. None of this ‘CDT’, ‘PST’, whatever, nonsense.

    Yes, most developers tend to mess up their time handling. Most languages have crappy support for timezones. W3CDTF does tend to be easier to get right than *822 style. If things are, on average, tending more toward correct, then things are good.

    As Mark mentioned in his original entry, people tend to use whatever date format they feel like. So even if we encourage particular formats, what does it matter? We may as well encourage a decent, tidy, format. People will get it wrong no matter what. Heck, RFC822 has been around for *ages*, yet people writing mail programs still get it wrong. My module that I mentioned earlier has a test suite. One file, t/invalid.t, tests that the module ‘correctly’ doesn’t handle particular dates that I’ve found in the wild. I go to a number of lengths if the parser is in loose mode, but I won’t go as far as parsing these failures.

    http://search.cpan.org/src/SPOON/DateTime-Format-Mail-0.25/t/invalid.t

    Anyway, are you suggesting that you read raw RSS as part of your daily aggregating? I think having an easier to format and easier to parse datetime string would tend to more accuracy. If even one more feed has an accurate set of times because of that, then that’s good.

    Hmm. My apologies if I’m sounding argumentative. I’m just a proponent of user ease rather than developer ease, and in this case I think helping the developer helps the user. Win-win.

    Comment by Iain Truskett — Friday, June 27, 2003 @ 3:28 am

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim