I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. So I ran over and said, “Stop! Don’t do it!”
“I can’t help it,” he cried. “I’ve lost my will to live.”
“What do you do for a living?” I asked.
He said, “I create web services specifications.”
“Me too!” I said. “Do you use REST web services or SOAP web services?”
He said, “REST web services.”
“Me too!” I said. “Do you use text-based XML or binary XML?”
He said, “Text-based XML.”
“Me too!” I said. “Do you use XML 1.0 or XML 1.1?”
He said, “XML 1.0.”
“Me too!” I said. “Do you use UTF-8 or UTF-16?”
He said, “UTF-8.”
“Me too!” I said. “Do you use Unicode Normalization Form C or Unicode Normalization Form KC?”
He said, “Unicode Normalization Form KC.”
“Die, heretic scum!” I shouted, and I pushed him over the edge.
(with apologies to Emo Philips)


Haha good one Mark, I’m guessing he’d still be over the side if he said KD too, but would you have spared him for D?
Comment by Neil Dunn — Tuesday, July 6, 2004 @ 12:36 pm
I was following you until the Normalization form whatever bit, and then you lost me. What precisely is it?
Comment by Neil T. — Tuesday, July 6, 2004 @ 12:41 pm
Definition from the Unicode.org glossary:
A process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence. In the Unicode Standard, normalization refers specifically to processing to ensure that canonical-equivalent (and/or compatibility-equivalent) strings have unique representations.
Comment by Neil Dunn — Tuesday, July 6, 2004 @ 12:49 pm
See also Unicode Normalization Forms
Comment by Mark — Tuesday, July 6, 2004 @ 1:42 pm
:)
So it is.
Comment by Jesper — Tuesday, July 6, 2004 @ 1:46 pm
Would anybody mind elaborating what’s bad about normalisation forms?
When I first saw them, I went “Huh, who’d want even more ways to encode Unicode strings?” But then I was persuaded that programmers really like this kind of thing for string comparisons because the intelligence for normalising strings is right in Unicode rather than everybody having to reinvent the wheel and figure out that people may be using (MODIFIER LETTER SMALL A) when they actually mean (FEMININE ORDINAL INDICATOR) and others may just write an a.
With this site traditionally being a place that values clean solutions over brief avoidance of programming pain, I wonder what would be a better solution to the problem.
Comment by ssp — Tuesday, July 6, 2004 @ 3:07 pm
It would be interesting to hear why you think KC is so much more evil than C. I’m not saying I will disagree. Having fought that fight in IDNs, I might agree with you, but I also found that different people have different things they love/hate about KC.
Comment by Paul Hoffman — Tuesday, July 6, 2004 @ 4:10 pm
To ssp: both normalization forms C and KC are horridly complex. Programming them is prone to errors, and both take fairly large tables. Also, the Unicode Consortium has changed the rules for them even though they said they never would (and plan to do so again in the near future).
The forms are needed for sane use of Unicode, but that is because of (in hindsight) bad decisions of what to include in Unicode. If someone was re-making Unicode today, they would simply not allow the characters that need to be normalized.
Comment by Paul Hoffman — Tuesday, July 6, 2004 @ 4:14 pm
> With this site traditionally being a place that values clean solutions over brief avoidance of programming pain
You must be new here.
Comment by Mark — Tuesday, July 6, 2004 @ 4:57 pm
> If someone was re-making Unicode today, they would simply not allow the characters that need to be normalized.
Paul, are you sure? One of Unicode’s goals was to losslessly handle legacy encodings. The need for normalization seems to naturally fall out of that. Am I wrong?
Comment by Keith — Tuesday, July 6, 2004 @ 6:42 pm
> Also, the Unicode Consortium has changed the rules for them even though they said they never would (and plan to do so again in the near future).
I would be very interested to learn when they changed, and what the differences were. Also, when they are expected to change again, and what the differences are expected to be.
Comment by Mark — Tuesday, July 6, 2004 @ 6:49 pm
>> If someone was re-making Unicode today, they would simply not allow the characters that need to be normalized.
>Paul, are you sure? One of Unicode’s goals was to losslessly handle legacy encodings. The need for normalization seems to naturally fall out of that. Am I wrong?
You are right that that was one of their goals, and that normalization fell out of that goal. It turns out (again in hindsight) that the horrible mess that evolved out of that decision was not worth the pain. The legacy encodings could have been handled by a transfer algorithm instead of by codepoints.
>> Also, the Unicode Consortium has changed the rules for them even though they said they never would (and plan to do so again in the near future).
>I would be very interested to learn when they changed, and what the differences were. Also, when they are expected to change again, and what the differences are expected to be.
Too busy to look up the old one now, but the upcoming one is documented here
Comment by Paul Hoffman — Tuesday, July 6, 2004 @ 9:51 pm
To answer your original question, I have no informed opinion whatsoever about the relative merits of NFC vs. NFKC. The only analogy I’ve read that makes any sense at all is that NFKC is kind of like converting everything to lowercase so you can do case-insensitive searching (like Google). Except it’s not that; that’s just an analogy. It actually involves subtle modifications to characters that only exist in languages I neither speak nor write, and I don’t hold out a lot of hope for understanding it in the near future.
To the broader question of “why this post,” I simply like pointing out that abstractions leak, and that choices that seem simple and obvious (like “just use XML”) are usually complex and heart-wrenchingly difficult.
In this vein, I am thrilled (in a perverse kind of way) to discover that Unicode Normalization Form C is itself an unstable concept. When the foundations on which we build waver ever so slightly, we can only sit back and wonder whether our entire surroundings will come crashing down around us. Will the upcoming changes to the NFC rules affect existing digital signatures which are build on Canonical XML?
I have no solutions or better alternatives to offer. But a week ago I didn’t even understand enough to ask annoying questions.
Comment by Mark — Wednesday, July 7, 2004 @ 12:05 am
Re: merits of NFC vs. NFKC
NFKC causes data loss in the sense that it loses some presentationalism like character-level superscripts. (Applies to English, too.)
According to the sample in the spec, it also converts the special character for ångström to a regular Å. What’s the point of having a separate character ANGSTROM SIGN in the first place?
Besides, charmod mandates NFC, so someone speccing NFKC for use on the Web would run againt that spec. :-)
Comment by Henri Sivonen — Wednesday, July 7, 2004 @ 3:09 am
LOL…
That was hilarious…
Good one mark :)
Comment by Subzero Blue — Wednesday, July 7, 2004 @ 3:45 am
Re lowercase: In fact, I was wondering how it was decided exactly how far these ‘four letter normalisations’ go. E.g.: Why is ½ equal to 1/2 but A not equal to a. Most of the searches I do are case insensitive. (But at least the information for conversion to lowercase is included in Unicode as well)
I can see how there may be shortcomings in the existing conversion algorithms and how changes in them may be annoying. It may also be that the normalisations as they exist don’t do exactly what you want them to do. Yet, my gut feeling is that this may be better than every geek coming up with his own imperfect solution to this problem.
Comment by ssp — Wednesday, July 7, 2004 @ 5:34 am
The difference from lower-casing is that both the C and KC forms are supposed to end up with a string that *looks* the same. If you showed two strings, one with E followed by ACUTE and E-WITH-ACUTE, you would be surprised that they do not match, even though they look identical on the screen. With internationalized URLs this becomes a big deal.
Normalization form C would convert the former to look like the latter. But Unicode has separate code points for a few symbols (like MICRO SIGN) that are expected to look the same as other letters (like small Greek MU), but *might* conceivably look different depending on what fonts you are using, but to most sane people would be considered the same letter. Form C leaves these alone; form KC forces the use of the compatibility character (so I think MICRO gets changed to MU).
I believe Unicode has a whole other section discussing what it means to convert a string to lower-case, and how to mush strings up to make for string comparisons that are as unsurprising as possible…
Comment by Damian Cugley — Wednesday, July 7, 2004 @ 6:03 am
“queries for free”
Comment by Sam Ruby — Thursday, July 8, 2004 @ 12:00 am