Update: I’d report this to the author, but I can’t find any semantic metadata in the article that might indicate who wrote it or how to contact them. They should look into that (whoever they are).
§
OK, so to state the obvious, this has unescaped Unicode characters in it. Safari – what I’m using here – applies the default character encoding (Western ISO Latin 1) and you get a horrendous mess.
I can get a normal view by going to View on the menu bar – you know that useful application-centered menu that you don’t have on that Neanderthal-Age Ubuntu distro – and selecting:
Text Encoding > UTF-8
All I can say is, at least they didn’t use Windows-1252.
Salient excerpts:
Content-Type: text/html;charset=US-ASCII
<meta http-equiv=”content-type” content=”text/html; charset=utf-8″ />
More supporting evidence for Ruby’s postulate, I guess.
P.S. Amusingly, the page itself is well formed. It could therefore have been served as application/xhtml+xml and appear precisely just as borked, because Firefox respects RFC 3023, at least in this case.
— Sam Ruby ![]()
Categorize under metadata will save us.
Funny, but the second quote on http://www.fngtps.com/2006/06/discussing-unicode-for-ruby is funnier. ;)
Mike #1: your comment confused me, because I have already specified in my browser’s preferences that my preferred encoding is UTF-8. (Yes, I see a lot of funny characters from sites that assume ISO-8859-1, which, IIRC, many browsers treat identically to CP-1252 for exactly this reason.) So I checked the View/Encoding menu and was surprised to see that this document was being treated as US-ASCII. I viewed source and saw that the document includes a META element declaring the encoding to be UTF-8; Page Info dialog confirms this. And then it hit me:
mark@atlantis:~$ lynx -head -dump http://lists.w3.org/Archives/Public/www-archive/2006Dec/att-0010/SEMWEB.html | grep -i ^content-type
Content-Type: text/html;charset=US-ASCII
Metadata won’t save us. Authoritative metadata will save us. Or not.
(Edit: Sam beat me to it, but got snagged by my spam filter. Sigh.)
— Mark ![]()
For what it’s worth, a page can’t be both well-formed and have “invalid” character sequences in it.
The page is an attachement from a post from Dan Connolly to the www-archive list.
Unfortunately, there is no way to view the raw message; because my hunch is that when Dan attached the file to his mail, the charset parameter was added by his mail client; in which case – I’m afraid I need to be an asshole here – the mailing list archive is doing the right thing, and there is noone for you contact to fix it, because that’s how Dan sent the message.
Lego blocks: hieroglyphics for the 21st century.
The raw message should be in the mboxes directory for that list. (That might require W3C member access, though.)
For what it’s worth, a page can’t be both well-formed and have “invalid” character sequences in it.
True.
Related: Firefox treats US-ASCII as a synonym for Windows-1252, as evidenced by the Euro symbols in the screen capture.
— Sam Ruby ![]()
Fess up, Sam, you b0rked up Mark’s post on purpose.
I see this kind of thing fairly often as I navigate the Web. Sometimes Opera will display the intended characters when Firefox/Mozilla/SeaMonkey/K-Meleon/Galeon/Epiphany will not, and sometimes it is the other way around.
— W^L+ ![]()
The raw message should be in the mboxes directory for that list.
I saw that, but…
That might require W3C member access, though.
… it does.
Anne, what on this lovely earth is ‘invalid’ about “? ;)
Thijs, dunno, I viewed this in Opera 9 and got to see U+FFFD characters. I draw my conclusion from that. (Looking in Internet Explorer 7 now I see them as well, albeit slightly different.) The well-formedness checker Sam was using above apparently didn’t take the character encoding into account. Looking at the output of that it seems the characters were interpreted as UTF-8.
I may be missing something here but the document seems to be properly encoded as UTF-8. The only thing that’s wrong is the Content-Type header.
Thijs, well, that’s relevant per RFC 3023. Although if you follow that line of thought it isn’t XML either so I guess my assertion was wrong.
I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)
§
© 2001–9 Mark Pilgrim