I am amazed, bordering on appalled, at the attention garnered by my use of the cite tag. It is admittedly a clever little hack, but #2 on Daypop? This is the epitome of a slow news day. Aren’t we all supposed to wile away our Fridays on pointless quizzes and Flash games? I know I do.

However, the ensuing maelstrom has generated some interesting talking points.

  1. Dare Obasanjo thinks I’m contradicting myself, and told me so in private email as well. I had no idea what he was trying to say, and I told him so, rudely, which is unusual, since I’m seldom rude to anyone who doesn’t deserve it, and Dare doesn’t. Dare, I apologize; you caught me in the 8th hour of a Windows XP install. However, in the clear light of day (pleasantly surrounded by Macs running OS X), I still have no idea what point you’re trying to make, or why you feel I’m contradicting myself.

    Later: Dare clarifies: Given that the W3C thinks XML is the basis for RDF and the Semantic Web it seems the general direction going forward is to move towards replacing a WWW full of HTML documents to one full of XML documents. … If you are for the Semantic Web, you are for an XML Web not for an HTML one. Wow, that’s just exactly the kind of wrongheaded thinking I was addressing in my original post when I said Let’s try pushing the envelope of what HTML is actually designed to do, before we get all hot and bothered trying to replace it, mmmkay? Thanks for clarifying.

  2. Tantek Çelik has a few more links on better blogging through semantic markup.
  3. Hans Nowak wants to use the code tag to mark up blocks of computer code. The problem is that code is an inline element, not a block element, so all the lines get smooshed together. There are a couple of ways to solve this problem. Joe Gregorio explores using code within pre, and variations. I use code within a p tag and put in explicit line breaks with the br tag and explicit spaces with   (which matters for languages like Python where whitespace is significant). Dougal Campbell suggests using the white-space: pre CSS declaration, which is arguably purer than the other approaches but doesn’t work on inline elements in IE 5.x for Windows. (IE 5: the Netscape 4 of a new generation.)
  4. Nico Brünjes has a nice summary of a few lost tags and attributes. There’s so much more to the a tag than the href.
  5. Sam Ruby is producing archives by citation, without using cite tags. Regular expressions go a long way.

Stay with me, I’m working my way up to a point.

A few months ago, Jon Udell wrote about Google’s co-founder, Sergey Brin. When asked about RDF and the Semantic Web, Sergey said, Look, putting angle brackets around things is not a technology, by itself. I’d rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.

Google has invested millions of dollars into their code, and they can do amazing things teasing meaning out of piss-poor markup. They’ve written sophisticated algorithms to figure out which bits in a morass of nested tables and presentational markup are important, and they make the best possible use of the few tags that are universal (title tag for page titles, a tag for links, and so forth).

Then again, Google doesn’t really have a choice. It’s their code, but it’s not their markup. So of course they’re going to invest money in code. It’s far more cost-effective to throw money at your own code than to try to get millions of independent developers to change their ways just to make your life a little easier.

But what if it’s both your code and your markup? Then you have choices.

For example, Sam it’s just data Ruby is replicating most of the functionality of my little cite tag hack without using cite tags: he wrote code that knows enough about his own writing style to guess what un-marked-up bits of a post are citations, and proceeds accordingly. It’s not bulletproof, but with a few hours of effort, it ended up being good enough. On the other hand, since my script could assume consistent semantic markup, my code was simpler, had fewer special cases, and worked on the first try.

Another example: my further reading script auto-generates quotes from referring pages; along the way, it also attempts to determine the permalink of the referring blog entry which is linking to mine. Finding permalinks is harder than it sounds. My script employs a variety of convoluted methods, including matching ID attributes of div tags to partial URLs in links immediately after the post, looking for permalink or permanent link in the title of links, parsing out the Trackback data that Movable Type includes in comments in the home page template, and a few other tricks. This works about 60% of the time; 30% of the time I get a false negative (there’s a permalink but the script can’t find it), and 10% of the time I get a false positive (the script picks the permalink to the wrong post, or picks a random link that isn’t a permalink at all).

Now, there is a way to specify permalinks in HTML, but virtually nobody uses it. On each actual permalink link, you can specify rel="bookmark". This would solve the problem of choosing a random link that isn’t a permalink, since the page has told me that certain links are permalinks and others aren’t. It wouldn’t solve all my problems, but it would increase the overall accuracy.

I write similar scripts that act on my own content, and they’re 100% accurate, and much easier to write. Why? Better metadata. And as other people begin to write such scripts, mine will be the easy case, the one that’s done and debugged first, before they take a deep breath and tackle a myriad of guesswork and special cases.

Yesterday, in response to this latest discussion, Jon Udell took the middle ground: This isn’t an either/or proposition. Like Mark, I strongly recommend exploiting to the hilt every scrap of latent semantic potential that exists within HTML. Like Jeff, I strongly recommend sharpening your text-mining skills because semantic markup, in whatever form, will never capture the totality of what can be usefully repurposed.

This is the point: if you have million-dollar markup, you don’t need million-dollar code, and vice versa. But they’re not mutually exclusive, either; it’s a spectrum, and where you fall depends on what you need. Neither Sam’s code-centric approach nor my data-centric approach is inherently better. They both accomplish the same short-term result. Which approach is better in the long run depends on whether you are more likely to re-use the content or the code that parses the content. Google applies their algorithms to millions of web pages for a single purpose: keyword search. I want to be able to reuse my own content in millions of ways, to do things nobody has thought of yet. They need million-dollar code; I need million-dollar markup.

§

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodeplanet

© 2001–9 Mark Pilgrim