[screenshot of article showing encoding-related data loss]

Update: I’d report this to the author, but I can’t find any semantic metadata in the article that might indicate who wrote it or how to contact them. They should look into that (whoever they are).

§

Eighteen comments here (latest comments)

  1. OK, so to state the obvious, this has unescaped Unicode characters in it. Safari – what I’m using here – applies the default character encoding (Western ISO Latin 1) and you get a horrendous mess.

    I can get a normal view by going to View on the menu bar – you know that useful application-centered menu that you don’t have on that Neanderthal-Age Ubuntu distro – and selecting:

    Text Encoding > UTF-8

    All I can say is, at least they didn’t use Windows-1252.

    — Mike #

  2. Salient excerpts:

    Content-Type: text/html;charset=US-ASCII

    <meta http-equiv=”content-type” content=”text/html; charset=utf-8″ />

    More supporting evidence for Ruby’s postulate, I guess.

    P.S. Amusingly, the page itself is well formed. It could therefore have been served as application/xhtml+xml and appear precisely just as borked, because Firefox respects RFC 3023, at least in this case.

    — Sam Ruby #

  3. Categorize under metadata will save us.

    — Mike Mariano #

  4. Funny, but the second quote on http://www.fngtps.com/2006/06/discussing-unicode-for-ruby is funnier. ;)

    — Thijs van der Vossne #

  5. Mike #1: your comment confused me, because I have already specified in my browser’s preferences that my preferred encoding is UTF-8. (Yes, I see a lot of funny characters from sites that assume ISO-8859-1, which, IIRC, many browsers treat identically to CP-1252 for exactly this reason.) So I checked the View/Encoding menu and was surprised to see that this document was being treated as US-ASCII. I viewed source and saw that the document includes a META element declaring the encoding to be UTF-8; Page Info dialog confirms this. And then it hit me:

    mark@atlantis:~$ lynx -head -dump http://lists.w3.org/Archives/Public/www-archive/2006Dec/att-0010/SEMWEB.html | grep -i ^content-type
    Content-Type: text/html;charset=US-ASCII

    Metadata won’t save us. Authoritative metadata will save us. Or not.

    (Edit: Sam beat me to it, but got snagged by my spam filter. Sigh.)

    — Mark #

  6. For what it’s worth, a page can’t be both well-formed and have “invalid” character sequences in it.

    — Anne van Kesteren #

  7. The page is an attachement from a post from Dan Connolly to the www-archive list.

    Unfortunately, there is no way to view the raw message; because my hunch is that when Dan attached the file to his mail, the charset parameter was added by his mail client; in which case – I’m afraid I need to be an asshole here – the mailing list archive is doing the right thing, and there is noone for you contact to fix it, because that’s how Dan sent the message.

    — Aristotle Pagaltzis #

  8. Lego blocks: hieroglyphics for the 21st century.

    — Patrick Mueller #

  9. The raw message should be in the mboxes directory for that list. (That might require W3C member access, though.)

    — David Baron #

  10. For what it’s worth, a page can’t be both well-formed and have “invalid” character sequences in it.

    True.

    Related: Firefox treats US-ASCII as a synonym for Windows-1252, as evidenced by the Euro symbols in the screen capture.

    — Sam Ruby #

  11. Fess up, Sam, you b0rked up Mark’s post on purpose.

    — Evan Goer #

  12. I see this kind of thing fairly often as I navigate the Web. Sometimes Opera will display the intended characters when Firefox/Mozilla/SeaMonkey/K-Meleon/Galeon/Epiphany will not, and sometimes it is the other way around.

    — W^L+ #

  13. Internet Alchemy » Shit Happens (pingback)
  14. The raw message should be in the mboxes directory for that list.

    I saw that, but…

    That might require W3C member access, though.

    … it does.

    — Aristotle Pagaltzis #

  15. Anne, what on this lovely earth is ‘invalid’ about “? ;)

    — Thijs van der Vossne #

  16. Thijs, dunno, I viewed this in Opera 9 and got to see U+FFFD characters. I draw my conclusion from that. (Looking in Internet Explorer 7 now I see them as well, albeit slightly different.) The well-formedness checker Sam was using above apparently didn’t take the character encoding into account. Looking at the output of that it seems the characters were interpreted as UTF-8.

    — Anne van Kesteren #

  17. I may be missing something here but the document seems to be properly encoded as UTF-8. The only thing that’s wrong is the Content-Type header.

    — Thijs van der Vossne #

  18. Thijs, well, that’s relevant per RFC 3023. Although if you follow that line of thought it isn’t XML either so I guess my assertion was wrong.

    — Anne van Kesteren #

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



§

firehosecodeplanet

© 2001–9 Mark Pilgrim