Update: I’d report this to the author, but I can’t find any semantic metadata in the article that might indicate who wrote it or how to contact them. They should look into that (whoever they are).
You are here: dive into mark → Archives → December 2006 → Data on the web
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
| Recent Stuff | For You, Special Price | Stay Here |
|---|---|---|
| Good Stuff | Buy The Cow | Go Away |
|
||
| Powered by Google | Drink The Milk | Don't Steal |
| posts / comments | © 2001-8 Mark Pilgrim |
OK, so to state the obvious, this has unescaped Unicode characters in it. Safari - what I’m using here - applies the default character encoding (Western ISO Latin 1) and you get a horrendous mess.
I can get a normal view by going to View on the menu bar - you know that useful application-centered menu that you don’t have on that Neanderthal-Age Ubuntu distro - and selecting:
Text Encoding > UTF-8
All I can say is, at least they didn’t use Windows-1252.
Comment by Mike — Thursday, December 14, 2006 @ 11:27 am
Salient excerpts:
Content-Type: text/html;charset=US-ASCII
<meta http-equiv=”content-type” content=”text/html; charset=utf-8″ />
More supporting evidence for Ruby’s postulate, I guess.
P.S. Amusingly, the page itself is well formed. It could therefore have been served as
application/xhtml+xmland appear precisely just as borked, because Firefox respects RFC 3023, at least in this case.Comment by Sam Ruby — Thursday, December 14, 2006 @ 11:59 am
Categorize under metadata will save us.
Comment by Mike Mariano — Thursday, December 14, 2006 @ 12:23 pm
Funny, but the second quote on http://www.fngtps.com/2006/06/discussing-unicode-for-ruby is funnier. ;)
Comment by Thijs van der Vossne — Thursday, December 14, 2006 @ 2:51 pm
Mike #1: your comment confused me, because I have already specified in my browser’s preferences that my preferred encoding is UTF-8. (Yes, I see a lot of funny characters from sites that assume ISO-8859-1, which, IIRC, many browsers treat identically to CP-1252 for exactly this reason.) So I checked the View/Encoding menu and was surprised to see that this document was being treated as US-ASCII. I viewed source and saw that the document includes a META element declaring the encoding to be UTF-8; Page Info dialog confirms this. And then it hit me:
mark@atlantis:~$ lynx -head -dump http://lists.w3.org/Archives/Public/www-archive/2006Dec/att-0010/SEMWEB.html | grep -i ^content-type
Content-Type: text/html;charset=US-ASCII
Metadata won’t save us. Authoritative metadata will save us. Or not.
(Edit: Sam beat me to it, but got snagged by my spam filter. Sigh.)
Comment by Mark — Thursday, December 14, 2006 @ 2:53 pm
For what it’s worth, a page can’t be both well-formed and have “invalid” character sequences in it.
Comment by Anne van Kesteren — Thursday, December 14, 2006 @ 3:29 pm
The page is an attachement from a post from Dan Connolly to the www-archive list.
Unfortunately, there is no way to view the raw message; because my hunch is that when Dan attached the file to his mail, the
charsetparameter was added by his mail client; in which case – I’m afraid I need to be an asshole here – the mailing list archive is doing the right thing, and there is noone for you contact to fix it, because that’s how Dan sent the message.Comment by Aristotle Pagaltzis — Thursday, December 14, 2006 @ 3:47 pm
Lego blocks: hieroglyphics for the 21st century.
Comment by Patrick Mueller — Thursday, December 14, 2006 @ 4:00 pm
The raw message should be in the mboxes directory for that list. (That might require W3C member access, though.)
Comment by David Baron — Thursday, December 14, 2006 @ 4:10 pm
True.
Related: Firefox treats US-ASCII as a synonym for Windows-1252, as evidenced by the Euro symbols in the screen capture.
Comment by Sam Ruby — Thursday, December 14, 2006 @ 6:30 pm
Fess up, Sam, you b0rked up Mark’s post on purpose.
Comment by Evan Goer — Thursday, December 14, 2006 @ 11:32 pm
I see this kind of thing fairly often as I navigate the Web. Sometimes Opera will display the intended characters when Firefox/Mozilla/SeaMonkey/K-Meleon/Galeon/Epiphany will not, and sometimes it is the other way around.
Comment by W^L+ — Friday, December 15, 2006 @ 12:09 am
I saw that, but…
… it does.
Comment by Aristotle Pagaltzis — Friday, December 15, 2006 @ 9:08 am
Anne, what on this lovely earth is ‘invalid’ about “? ;)
Comment by Thijs van der Vossne — Friday, December 15, 2006 @ 2:50 pm
Thijs, dunno, I viewed this in Opera 9 and got to see U+FFFD characters. I draw my conclusion from that. (Looking in Internet Explorer 7 now I see them as well, albeit slightly different.) The well-formedness checker Sam was using above apparently didn’t take the character encoding into account. Looking at the output of that it seems the characters were interpreted as UTF-8.
Comment by Anne van Kesteren — Friday, December 15, 2006 @ 6:12 pm
I may be missing something here but the document seems to be properly encoded as UTF-8. The only thing that’s wrong is the Content-Type header.
Comment by Thijs van der Vossne — Saturday, December 16, 2006 @ 6:00 am
Thijs, well, that’s relevant per RFC 3023. Although if you follow that line of thought it isn’t XML either so I guess my assertion was wrong.
Comment by Anne van Kesteren — Saturday, December 16, 2006 @ 10:10 am