Several people have noticed that my URL format has recently changed. Previously I was using a munged form of the entry title, but now I’m using a simpler form. For instance, my wildly popular How to install Windows XP in five hours or less has the clean, cruft-free URL http://diveintomark.org/archives/2003/08/04/xp.

Here’s how I did it:

  1. First, I got rid of my .html extension. There are a number of ways to do this, but this is what worked for me. I created a new .htaccess file within my archives/ subdirectory and added these 2 lines:

    DefaultType text/html
    DirectoryIndex index index.html

    This will cause files without file extensions to be served as HTML instead of plaintext.

  2. Within the Movable Type web interface, I went to Weblog Config, then Preferences, and blanked out the file extension for archive files. I’m not actually sure if this step is necessary, since I have a custom Archive File Template, but it made me feel better.

  3. Still within Weblog Config, I went over to Archiving and removed the file extension from the Archive File Template for the Individual Entry Archive. It used to look like this:

    <$MTArchiveDate format="%Y/%m/%d"$>/<$MTEntryTitle dirify="1"$>.html

    And I changed it to this:

    <$MTArchiveDate format="%Y/%m/%d"$>/<$MTEntryTitle dirify="1"$>

  4. Rebuilt all individual entries.

  5. Of course, the rest of the world is still linking to the old archive pages with the .html extension, so I needed a redirect (this also goes in archives/.htaccess):

    RedirectMatch permanent /archives/(.*)\.html$ http://diveintomark.org/archives/$1

  6. I also changed the output filename of my main archive page from archives/index.html to archives/index, and similarly removed the file extension from my category archive templates, and rebuilt all those too.

  7. I could have stopped here, and I would have cruft-free URLs with no file extensions. But I wanted to go one step further and be able to customize individual URLs with short slugs instead of the full munged entry title (which can get pretty long). Thus…

  8. This next step requires Brad Choate’s lovely MT-IfEmpty plugin.

  9. Within the Movable Type web interface, I went to Weblog Config, then Archiving, and changed my Individual Entry Archive’s Archive File Template once again, to this:

    <$MTArchiveDate format="%Y/%m/%d"$>/<MTIfEmpty var="EntryKeywords"><$MTEntryTitle dirify="1"$></MTIfEmpty><MTIfNotEmpty var="EntryKeywords"><$MTEntryKeywords$></MTIfNotEmpty>

  10. I went to edit an entry, and clicked Customize the display of this page, selected custom view, and checked Entry keywords.

  11. The deal is, old entries need to keep their old munged-title URLs (minus the file extension). That’s why we’re using the MT-IfEmpty plugin in the file templates. Old entries don’t have any keywords, so it falls back to the entry title. But when I create new entries, I can enter a short slug in the entry keywords field, and this is used as the last part of the entry’s URL, instead of the title. On my How to install Windows XP in five hours or less article, my entry keywords field was simply xp. For this article, the keyword is slugs.

Further reading:

§

Sixty eight comments here (latest comments)

  1. MovableBLOG (trackback)
  2. the anti-mega outboard brain (trackback)
  3. I could understand the elegance/reason for all the steps except the first:

    >First, I got rid of my .html extension.

    As an end user, I sometimes derive assurance in knowing what kind of file I am clicking on and going to view. The removal of the “.html” extension removes that. Furthermore, if I am not mistaken, this removal of extension and speifying of the extentsion in the .htaccess would be creating more work for the web server that now needs to read the htaccess file every time someone asks for a page.

    Is the shorter URL worth this?

    — Srijith #

  4. Mark, your installation of MT will continue to send out cruft-enabled URLs as part of Trackback ping requests. You’ll have to do some minor hacking in the MT source to clean that out as well.

    <shameless:plug>
    I wrote a HowTo file documenting the steps I took to remove cruft from my blog URLs earlier this summer. It includes directions on how to get rid of all the cruft built into MT’s Trackback and comment scripts.

    Howto: Future-proof URLs in Movable Type
    http://mar.anomy.net/entry/2003/06/22/17.15.00/
    </shameless:plug>

    — Már Örlygsson #

  5. Another shameless plug: If you want to get rid of the .html extension in your url’s, but you want to keep them on your server’s filesystem (for example to make manual file-management easier) you can use a rewrite rule as described on: http://www.vandervossen.net/2003/07/clean_url

    — Thijs van der Vossen #

  6. Re comment 3: TimBL argues http://www.w3.org/Provider/Style/URI.html that HTML may not be around in 20 years but that the URI might well be; dropping the .html extension allows you to change the underlying representation while presenting the same public URI.

    As a purist, I can see that. But as a user I still feel the same reluctance as Srijith: when I see a .html extension on a link, I’m reasonably confident that it’s going to open in my web browser. When I don’t, I’m less sure what I’m getting; it may be a web page, but it may equally well try to throw a huge PDF my way; or a Powerpoint or Word file; or an MP3; or an executable.

    — James Kew #

  7. ;;;

    — Anonymous #

  8. Turning on MultiViews in Apache does much the same, letting you drop the extension, while also letting you specify versions for different languages or encodings.

    Then if you set up a mime-type for application/xhtml+xml, you could serve it to browsers that have it higher than text/html in their Content-Accepts header.

    — Aaron Brady #

  9. #8 isn’t all that clear:

    Assuming you use templates to generate your HTML, you can generate HTML401 as .html, and XHTML as .xhtml, then you can select between them using Content-Accepts.

    What’s the certify button for?

    — Aaron Brady #

  10. Is your Archives directory in your URLs really necessary? By the nature of weblogs, everything IS an archived page except for the posts for the current day. The date should be enough to see that they are already archives without labeling them as archives.

    — CowPi #

  11. Mar’s article was extremely helpful for me when I did this:

    http://www.kennsarah.net/archives/2003/07/22/php_juice/

    I don’t have the ability to use .htaccess files, though, so I used MT to build a redirect page to keep inbound links from breaking.

    — Ken Walker #

  12. Re: multiviews. I’ve never gotten that to work properly in combination with server-side includes.

    Re: /archives/. It is necessary; go to http://diveintomark.org/archives/ and see why. Also, weblog archives is just one part of my site.

    Re: .htaccess. My .htaccess file is 3 pages long; I don’t think adding 2 lines is going to make all that much difference.

    Re: hacking MT source code to get rid of cruft in Trackback and comment pages. Um, OK. I don’t use separate Trackback and comment pages. I use Adam Kalsey’s SimpleComments plugin to list them all inline on the individual entry pages. Surely that’s a more elegant solution than forking your MT install?

    Re: visible .html files giving you the warm fuzzies. Sorry, just don’t buy it. So little on the web these days is served up with visible .html extensions, and yet the web survives, and people click anyway.

    — Mark #

  13. zlog - posts (trackback)
  14. Live in the Delirious Cool (trackback)
  15. Serving a negotiated resource w/o extension makes good sense. In this case, people could bookmark /xp and some day later, Mark might offer the same resource in a different representation.

    Of course, it’s also good web arch to supply a URI for the negotiated representation, too. That way, a person can link to either the negotiated resource or the format-specific resource as needed.

    — Jeremy Dunck #

  16. On the subject of archives:

    http://diveintomark.org/archives/2003/08/index

    does that trailing “index” count as cruft? I notice the URI works nicely if I hack it off, with or without the trailing slash.

    It isn’t hackable to a yearly archive, but falls back nicely to the main archive index.

    — James Kew #

  17. No more file extensions!

    http://scribbling.net/A_case_against_file_extensions

    I went through the reworking of URLs for my site recently, too, but I don’t use MT.

    http://scribbling.net/How_Scribblingnet_freed_itself_from_file_extensions_and_internal_IDs

    Still got a ways to go, though. But don’t we all.

    — Gina #

  18. Er, what if you happen to use the same keywords for another entry? Will the original entry be overwritten?

    — Tremendo #

  19. Regarding the trackbacks Mark, I think the previous commenter means that if you ping via Trackback another website, MoveableType will send the wrong URL – so the trackback listing on other blogs will point to an invalid page at Dive Into Mark.

    — Luke Reeves #

  20. re (18), the URL includes a date-driven path, so if Mark writes an entry with keyword ’slugs’ tomorrow, it will be at /archives/2003/08/16/slugs, instead of at this URL of this page, perfectly unique.

    Mark, what will happen if people put a slash at the end of your URL, mistaking the /slugs/ for a folder?

    The other approach of putting pages in /slugs/index.html would permit URLs ending in /slugs and /slugs/ to work.

    BTW, my site(s) are hosted on my server in NY and all down, so I put my typepad URL in the home page box above.

    (Just noticed your new [?] checkbox for posting!)

    — xian #

  21. *slowly reaches for the can opener, and carefully places it on a can labelled “worms”*

    Okay, Mark, you like cruft-free URLs without the .html extension. The question is why do you consider this cruft? You don’t agree with the arguments that the .html extension is needed, but then I ask why is it not needed? Why is it important to remove it?

    I could have a 745246-character url I don’t care what the url is as long as my web browser can read it (can it read URLs that long? :) )

    The only thing I see this benefiting is the web developer, but only within his own development sphere. You can simplify your development of your own site, but that then creates two standards on the web instead of one, one using HTML, and one requiring the alteration of .htaccess. One standard will do fine thank you.

    — Adrian #

  22. I’m a big fan of nice, clean URLs. My blog archive lies in /trashcan/mtentryid/, giving an URL like /trashcan/260/. However, trackbacks will refer to /trashcan/260/index.php. I tried the htaccess piece above, but it gets stuck in an infinite loop. I’ve stared myself blind at this now — does anyone see what I am doing wrong in this statement?

    RedirectMatch permanent /trashcan/([0-9]+)/index\.php$ http://echo.ashpool.org/trashcan/$1/

    — Johan Svensson #

  23. - Sorry guv, we’re right out of parrots.
    - I see. I see. I get the picture.
    - I’ve got a slug.
    - Does it talk?
    - Yeah.
    - Great, I’ll have that one then!
    – Monty Python ‘Parrot’ sketch preformed live.

    Great. Terrific. I’m looking to do this myself so thanks for the great links.

    — Jesper #

  24. Now, wait a minute.

    Cruft is defined as all that junk the average web user doesn’t care about that makes URL’s long and annoying.

    For example, ‘/000041.php’, the 0000 are cruft, and the software should be smart enough to leave it out. So is the .php, which you’ve figured how to remove here.

    Not, my next question is, is it cruft when the URL is long BUT important? Using keywords instead of titles perhaps makes it not clear what the entry is: I have no idea what ’slugs’ refers to. Similarly, in dwws, Zeldman notes that he uses /c/ or /i/ or /s/ instead of /css/, /images/, /styles/…I question whether this is an improvement as well. /c/ means very little to the average user, while more people can figure out /css/.

    Just some thoughts I suppose.

    — Steven Canfield #

  25. re: Trackback. Check the source of this page (while it’s still open for Trackback pings). The URL of the entry is correct; any auto-discovering TB client should pick it up properly. I’m still failing to see the problem.

    Now, the Trackback URL itself, /mt/mt-tb.cgi/151, could be considered crufty. But I really don’t care about that, because end users will never see it.

    — Mark #

  26. I think the issue was that if you yourself post something that sends a Trackback ping somewhere else, the ping you send will include a crufty URL.

    Only, um, I’m not sure that’s actually true. With the method you used to create extensionless filenames, MT still knows the right URL for each entry.

    — Moss Collum #

  27. DS: What is this “Entry Keywords” field he speaks of?

    (click click click)

    Hot damn!!!

    — Dave S. #

  28. Mark: revision history is having major problems since the change. For this story all quotes are double escaped, while looking at the google calculator topic no text is showing for the revisions at all.

    — Sander #

  29. Mark is right about the trackback thing being a non-issue because of the way he sets this up. The key difference between this method and the method I and many others are using is that Mark isn’t hiding this file extensions – his permalinks don’t contain *any* file extension in the first place.

    This method is by far the simplest and cleanest one I’ve seen so far, granted you have the access priveliges needed to do webserver tweaks.

    Excellent!

    — Már Örlygsson #

  30. …except, that I really don’t like the “index” at the end of the month and day archive URLs:

    http://diveintomark.org/archives/2003/08/index
    http://diveintomark.org/archives/2003/08/15/index

    …but that can be easily fixed with a simple regex rule.

    — Már Örlygsson #

  31. Mar: you’re right, that “index” is still crufty. Damn.

    — Mark #

  32. Jaykul :: Huddled Masses (trackback)
  33. I can’t seem to get DefaultType text/html or ForceType text/html to have any effect in my .htaccess files. Does anybody know of any gotchas on why this isn’t working?

    — tamaracks #

  34. My worry about this system, as I have been of late with ANY system that involves user-changeable data, is that you’re overloading the purpose of the box you’re basing your filenames on.

    a) Title is bad ‘cos it can change (or uberlong)
    b) Author is bad ‘cos a document can change hands.
    c) Keywords are bad because they’re not meant for filenames; they can be used multiple times (on the same day); multiple keywords can be used; lack of clarity (is XP ‘extreme programming’ or ‘windows xp’? what if you want to make this distinction a year from now?)

    Now, I can’t say I have an answer – I agree that dates should be in the URI path, and I agree that internal IDs create ugly URLs that mean nothing in an email (besides, possible, a date or category in the path), but I’ve yet to find a bit of data that:

    a) will never change, regardless of semantic meaning or planned updates in the future,

    b) isn’t overloading the intended field

    As such, I’m still, sadly, intending a non zero-padded unique ID for my next layout, but I still read all these various layout redesign journeys, hoping for the panacea that’ll make me happy.

    Note: Yes, technically, in MT, you can change the date that an item was posted. I consider this poisoning the data, however, and would never be something I’d ever do personally (probably as much as Mark never clarifying what ‘xp’ means. And so it goes).

    Note 2: I *could* use an MT-Meta pluginish thing to create a filename= field within my posts that I could certify would never ever change, but I’d like something that doesn’t require that much extra effort. I’d rather wait for MT-Pro’s purported custom fields feature.

    — Morbus Iff #

  35. Incidentally, Mark, I’ve never had any troubles using MultiViews with SSI – I use them all over the place, in fact (http://disobey.com/dnn/, http://gamegrene.com/, etc., etc.,) What problems were you having?

    — Morbus Iff #

  36. This is getting silly. We’re not dicussing the future of the web here … this is a scheme for his blog!

    The author’s name as the file name on a post is meaningless, this is diveintomark.org … the author’s basically the same for everything.

    A keyword entry has the SAME PROBLEMS as the title entry, but simply gives flexibility to name the file manually.

    It’s as meaning(less|full) as the author makes it, although if it was really the KEYWORD … it could conceivably change more easily than the title: imagine a page: installing our software. Which starts out with a single keywords: installation, but evolves into a list of solutions for problems people have while installing. You would then want to change the ‘keywords’ to “Installation, FAQ, Problems, Getting Started” (that is, if you were actually using the keywords to drive your search at all) …

    Now that I’ve said that, it may sound as though I’m in favor of the system … you can read my post (above) to see that I’m not.

    — Jaykul #

  37. Quarter Life Crisis (trackback)
  38. GeraBlog (trackback)
  39. GeraBlog (trackback)
  40. GeraBlog (trackback)
  41. n o g w a (trackback)
  42. One advantage of going the extra step of activating the slash on the end (http://diveintomark.org/archives/2003/08/15/slugs/) is that you now have a sensible place for storing assets (such as images) relevant to that post.

    Of course, it makes a mess if relative urls are used by that page to refer to those images, which means either hard code the parent structure (which might change in the future) into the absolute URL, or server redirects to the canonical URL (much nicer).

    — Eric Scheid #

  43. Good Signor,

    I’ve read and re-read your article.  I’m afraid I’ve failed to understand the finer points.

    1. Pray you, explain the rationale for eliminating the »html« file extension.

    2. On the same note, in step 5 of your procedure you write:  »Of course, the rest of the world is still linking to the old archive pages with the .html extension.«

    In my IE6/Win, mousing/clicking »Cruft-free URLs in Movable Type« produces

    http://diveintomark.org/archives/2003/08/15/slugs

    Am I missing your point or a file extension ?  Pray, advise.

    — Anonymous #

  44. 43: His goal was to eliminate ‘cruft’, that is, url that is not essential to the url. A url of
    /archives/2003/08/15/slugs.html
    doesn’t add much when compared to
    /archives/2003/08/15/slugs
    so he removed the ‘.html’. This also happens to open up a lot of doors; if he happens to want to serve each page with PHP or any other technique he could do that without the url changing.

    »Of course, the rest of the world is still linking to the old archive pages with the .html extension.« That’s because he recently converted to not using .html, and his weblog has been running for over two years now. People linking to his earlier posts will have linked to the ‘.html’ version of his permanent links. Thus he needs a way to make sure these links still works, which is what the code in step 5 accomplishes.

    — Jesper #

  45. “url that is not essential to the url” should be “text in the url that is not essential to the url”.

    — Jesper #

  46. #43. In addition to #44, removing the “.html” means easier updating of the backend without changing the frontend. One good example is making a previous HTML file:

      http://www.example.com/about.html

    into an index file under a similarly named directory:

      http://www.example.com/about/index.html

    or changing the HTML to use SSI:

      http://www.example.com/about.shtml
      http://www.example.com/about/index.shtml

    If you use an extensionless URL, you wouldn’t need to worry about breaking URLs whenever you change the underlying technology. The about page can be always be accessed as

      http://www.example.com/about

    — Eugene #

  47. #46: “This also happens to open up a lot of doors; if he happens to want to serve each page with PHP or any other technique he could do that without the url changing.”

    But yeah.

    — Jesper #

  48. #42. My own blog (at http://boston.conman.org/) works like that, although resources like images are shared across all entries of a particular day, for instance, http://boston.conman.org/2003/08/13 . The entry http://boston.conman.org/2003/08/13.2 has two images stored at http://boston.conman.org/2003/08/13/clouds.1.jpg and http://boston.conman.org/2003/08/13/clouds.2.jpg (also note that I don’t use extensions for the HTML, but I do for other resources). When writing an entry, I don’t need to remember what the final URL will be; I just stick the name of the file into the entry and my CMS system (which I wrote) will take care of adding in the rest of the location when the page is served (it’s a dynamic system).

    Also, if you chop off the day portion of the URL, you get all the entries for the month. Chop off the month, you get all the entries for the year. That was a design choice on my part. I do have an archive page that lists links to each month, but that is current hand maintained. I need to correct that one of these days.

    — Sean Conner #

  49. protocol7 (trackback)
  50. protocol7 (trackback)
  51. I would like to use this URL format, but I’m using server side includes. Is there a way to tell the server that all files use includes, even if they don’t have the extention .shtml? Possibly through the htaccess file?

    — Joshua Kaufman #

  52. I don’t mind the .html part of the URL. In fact, I like it (for the same reasons others have posted). Yeah, in theory we could move to something else, but I think it’s unlikely — once you have a large enough usage base, it’s incredibly difficult to switch. We’re not getting rid of TCP/IP or Windows or HTML or x86, even if we find something better. :(

    The part about switching to php or shtml brings up one of my pet peeves: the method for producing HTML (php, shtml, pl, py, cfm, whatever) is an *implementation* detail. The URL is the *interface*. The interface is HTML. So even if you’re using PHP in your implementation, your extension should be .html (or blank).

    — Amit Patel #

  53. Webspiffy (trackback)
  54. Good post! I’ll be interested to hear what method you use for removing the “index” part of category/monthly archives, should you do so (I have the same problem). I wonder what the pros/cons of using the Keyword field are over using Brad Cholate’s KeyValues plugin http://www.bradchoate.com/past/keyvalues.php ? Also you might want to update http://diveintomark.org/about/templates/ sometime, which is a great resource btw.

    — oli #

  55. #51: http://httpd.apache.org/docs/howto/ssi.html or http://httpd.apache.org/docs-2.0/howto/ssi.html – the first for Apache 1.* and the second for Apache 2.* is the way to go.

    — Jesper #

  56. Don't Back Down (trackback)
  57. To get rid of “index” for your date-based archives, just use:
    /archives/<$MTArchiveDate format=”%Y/%m/%d”$>/

    — Mike Steinbaugh #

  58. Mark’s article caused a lot of discussion on the Blosxom list. Here are notes about how to do permalinks well in Blosxom:
    http://www.nelson.monkey.org/~nelson/weblog/tech/blosxom/permalinks.html

    — Nelson #

  59. vedana.net (trackback)
  60. Michael Tsai's Weblog (trackback)
  61. I have to agree with Mar, getting rid of cruft is part of it, future-proofing your URLs is the other part. I spent forever trying to get mod_rewrite to strip off the index.html from the end of a URL so that trackbacks would always work. The problem was maintaining name anchor links which apache escapes so that:

    http://www.wherever.com/……/index.html#anchor
    becomes
    wwww.wherever.com/……./#anchor

    — tarun #

  62. Arthur is verweg! (trackback)
  63. Betalogue (trackback)
  64. OK, the spurious “/index” at the end of many internal URLs is now gone.

    — Mark #

  65. reedmaniac - iBlog (trackback)
  66. reedmaniac - iBlog (trackback)
  67. reedmaniac - iBlog (trackback)
  68. Nickel And Chromium (trackback)

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodeplanet

© 2001–9 Mark Pilgrim