Despite a complete lack of fanfare or self-promotion, much of the Python-loving world seems to have found my Universal Encoding Detector, which is a pure-Python port of Mozilla’s encoding detection. UED is used in a variety of end-user applications and other developer libraries, including:
- Gaupol, a subtitle editor/shifter/tweaker (wonderful for converting PAL-timed subtitles found on OpenSubtitles.com)
- Griffith, a film collection manager
- ninix-aya, which is apparently some sort of desktop mascot
- html5lib, a standards-compliant tokenizing HTML parser
- My own Universal Feed Parser, and other applications that rely on it, like Planet and Venus
- Pygments, and other applications that rely on it, like Odtwriter and Pudge
- Wah!Cade, a MAME front-end
And probably some others I don’t know about.
This is what it feels like to be an upstream author. And I use the term “author” loosely, since all I did was port somebody else’s wicked-smart algorithm, introduce new bugs, and write a few incoherent pages of documentation. But still, it is humbling to step back and observe the enormous worldwide community that is constantly packaging, updating, integrating, and distributing this stuff.
Anyway, version 1.0.1 is out, with a whopping two bugs fixed. Sorry it’s so late, but I was busy practicing witchcraft and becoming a lesbian.
Yeah, I didn’t see that coming either.


Wow, being a Venus user, I’ve been using this one without even knowing it. Nice!
Comment by Scott Johnson — Wednesday, March 5, 2008 @ 4:48 pm
Clarification: Planet and Venus will make use of chardet if it is already installed, but do not include or distribute it. The same is true for html5lib.
Comment by Sam Ruby — Wednesday, March 5, 2008 @ 7:09 pm
I’m coding a stupid Trac like tool for perforce, and I had problems importing some change lists (set) comments into it, but now it’s all over thanks to your chardect.
Thanks !
Benjamin.
http://code.google.com/p/p4watch/source/diff?r=9&format=side&path=/trunk/django_web/p4populate.py
Comment by Benjamin Sergeant — Thursday, March 6, 2008 @ 12:32 am
I didn’t even know it was yours! Your rock.
Comment by Zack — Thursday, March 6, 2008 @ 1:18 am
How ironic that the Mozilla page on universal character encoding detection is being served with the incorrect character encoding.
Perhaps [lazyweb]someone[lazyweb] should also port this code to Apache.
Comment by Dotan Dimet — Friday, March 7, 2008 @ 7:10 pm
#6: Fastest way to do that is to write a mod_perl handler, need not be very large, and hook it up with Encode::Detect. This is the Perl interface to Mozilla nsUniversalDetector.
http://search.cpan.org/dist/Encode-Detect/
Comment by Anonymous — Saturday, March 8, 2008 @ 10:37 pm