I just noticed that one of my further reading
links for Changes in XHTML 2.0 is in Icelandic:
If you hover your cursor over the link, it will tell you This page is in Icelandic
, since that’s the title of the <a> tag. If you dig into the HTML source, you’ll see that the <a> tag also includes an hreflang attribute, which specifies the language code of the page being linked to. (Just when you thought you knew everything there was to know about HTML…)
You see, my script that generates the further reading
links actually downloads each referring page and verifies that it really links to the post my access logs claim it links to. (This is primarily to minimize referrer spamming.) Along the way, it gathers some other useful information, such as the page title (used to construct the link text) and the page language (used to construct the title and hreflang).
There are several correct ways to specify your language, and several incorrect ways that are commonly used. I recommend using the xml:lang and/or lang attributes in your <html> tag, but you can also specify it in a meta tag. And the web server itself can theoretically specify a default language for pages that don’t specify it, although few servers do.
Parsing the HTML to get this metadata is tricky. Python has the sgmllib library that does the hard, generic part (the regular expressions) and provides a framework for coding the application-specific part (dealing with all the variations and mistakes in specifying the actual metadata). I discuss sgmllib in detail in chapter 4 of Dive Into Python.
Anyway, I finally got it all working, and it’s been quietly working for a few months now, but today was the first time I’d seen Icelandic. It’s a rare occurrence that one of my scripts is written well enough to surprise me.
htmlinfo.py, iso639.py. GPL-licensed.
§
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
© 2001–9 Mark Pilgrim