dive into mark

You are here: dive into markArchivesDecember 2006Data on the web

Thursday, December 14, 2006

Data on the web

[screenshot of article showing encoding-related data loss]

Update: I’d report this to the author, but I can’t find any semantic metadata in the article that might indicate who wrote it or how to contact them. They should look into that (whoever they are).

18 comments

  1. OK, so to state the obvious, this has unescaped Unicode characters in it. Safari - what I’m using here - applies the default character encoding (Western ISO Latin 1) and you get a horrendous mess.

    I can get a normal view by going to View on the menu bar - you know that useful application-centered menu that you don’t have on that Neanderthal-Age Ubuntu distro - and selecting:

    Text Encoding > UTF-8

    All I can say is, at least they didn’t use Windows-1252.

    Comment by Mike — Thursday, December 14, 2006 @ 11:27 am

  2. Salient excerpts:

    Content-Type: text/html;charset=US-ASCII

    <meta http-equiv=”content-type” content=”text/html; charset=utf-8″ />

    More supporting evidence for Ruby’s postulate, I guess.

    P.S. Amusingly, the page itself is well formed. It could therefore have been served as application/xhtml+xml and appear precisely just as borked, because Firefox respects RFC 3023, at least in this case.

    Comment by Sam Ruby — Thursday, December 14, 2006 @ 11:59 am

  3. Categorize under metadata will save us.

    Comment by Mike Mariano — Thursday, December 14, 2006 @ 12:23 pm

  4. Funny, but the second quote on http://www.fngtps.com/2006/06/discussing-unicode-for-ruby is funnier. ;)

    Comment by Thijs van der Vossne — Thursday, December 14, 2006 @ 2:51 pm

  5. Mike #1: your comment confused me, because I have already specified in my browser’s preferences that my preferred encoding is UTF-8. (Yes, I see a lot of funny characters from sites that assume ISO-8859-1, which, IIRC, many browsers treat identically to CP-1252 for exactly this reason.) So I checked the View/Encoding menu and was surprised to see that this document was being treated as US-ASCII. I viewed source and saw that the document includes a META element declaring the encoding to be UTF-8; Page Info dialog confirms this. And then it hit me:

    mark@atlantis:~$ lynx -head -dump http://lists.w3.org/Archives/Public/www-archive/2006Dec/att-0010/SEMWEB.html | grep -i ^content-type
    Content-Type: text/html;charset=US-ASCII

    Metadata won’t save us. Authoritative metadata will save us. Or not.

    (Edit: Sam beat me to it, but got snagged by my spam filter. Sigh.)

    Comment by Mark — Thursday, December 14, 2006 @ 2:53 pm

  6. For what it’s worth, a page can’t be both well-formed and have “invalid” character sequences in it.

    Comment by Anne van Kesteren — Thursday, December 14, 2006 @ 3:29 pm

  7. The page is an attachement from a post from Dan Connolly to the www-archive list.

    Unfortunately, there is no way to view the raw message; because my hunch is that when Dan attached the file to his mail, the charset parameter was added by his mail client; in which case – I’m afraid I need to be an asshole here – the mailing list archive is doing the right thing, and there is noone for you contact to fix it, because that’s how Dan sent the message.

    Comment by Aristotle Pagaltzis — Thursday, December 14, 2006 @ 3:47 pm

  8. Lego blocks: hieroglyphics for the 21st century.

    Comment by Patrick Mueller — Thursday, December 14, 2006 @ 4:00 pm

  9. The raw message should be in the mboxes directory for that list. (That might require W3C member access, though.)

    Comment by David Baron — Thursday, December 14, 2006 @ 4:10 pm

  10. For what it’s worth, a page can’t be both well-formed and have “invalid” character sequences in it.

    True.

    Related: Firefox treats US-ASCII as a synonym for Windows-1252, as evidenced by the Euro symbols in the screen capture.

    Comment by Sam Ruby — Thursday, December 14, 2006 @ 6:30 pm

  11. Fess up, Sam, you b0rked up Mark’s post on purpose.

    Comment by Evan Goer — Thursday, December 14, 2006 @ 11:32 pm

  12. I see this kind of thing fairly often as I navigate the Web. Sometimes Opera will display the intended characters when Firefox/Mozilla/SeaMonkey/K-Meleon/Galeon/Epiphany will not, and sometimes it is the other way around.

    Comment by W^L+ — Friday, December 15, 2006 @ 12:09 am

  13. Pingback by Internet Alchemy » Shit Happens
  14. The raw message should be in the mboxes directory for that list.

    I saw that, but…

    That might require W3C member access, though.

    … it does.

    Comment by Aristotle Pagaltzis — Friday, December 15, 2006 @ 9:08 am

  15. Anne, what on this lovely earth is ‘invalid’ about “? ;)

    Comment by Thijs van der Vossne — Friday, December 15, 2006 @ 2:50 pm

  16. Thijs, dunno, I viewed this in Opera 9 and got to see U+FFFD characters. I draw my conclusion from that. (Looking in Internet Explorer 7 now I see them as well, albeit slightly different.) The well-formedness checker Sam was using above apparently didn’t take the character encoding into account. Looking at the output of that it seems the characters were interpreted as UTF-8.

    Comment by Anne van Kesteren — Friday, December 15, 2006 @ 6:12 pm

  17. I may be missing something here but the document seems to be properly encoded as UTF-8. The only thing that’s wrong is the Content-Type header.

    Comment by Thijs van der Vossne — Saturday, December 16, 2006 @ 6:00 am

  18. Thijs, well, that’s relevant per RFC 3023. Although if you follow that line of thought it isn’t XML either so I guess my assertion was wrong.

    Comment by Anne van Kesteren — Saturday, December 16, 2006 @ 10:10 am

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



Recent Stuff For You, Special Price Stay Here
  • Greasemonkey Hacks
Good Stuff Buy The Cow Go Away
Dive Into Python
Powered by Google Drink The Milk Don't Steal

 

posts / comments
© 2001-8 Mark Pilgrim