python - Normalizing Unicode -
is there standard way, in python, normalize unicode string, comprehends simplest unicode entities can used represent ?
i mean, translate sequence ['latin small letter a', 'combining acute accent']
['latin small letter acute']
?
see problem:
>>> import unicodedata >>> char = "รก" >>> len(char) 1 >>> [ unicodedata.name(c) c in char ] ['latin small letter acute']
but now:
>>> char = "á" >>> len(char) 2 >>> [ unicodedata.name(c) c in char ] ['latin small letter a', 'combining acute accent']
i could, of course, iterate on chars , manual replacements, etc., not efficient, , i'm pretty sure miss half of special cases, , mistakes.
the unicodedata
module offers .normalize()
function, want normalize nfc form:
>>> unicodedata.normalize('nfc', u'\u0061\u0301') u'\xe1' >>> unicodedata.normalize('nfd', u'\u00e1') u'a\u0301'
nfc, or 'normal form composed' returns composed characters, nfd, 'normal form decomposed' gives decomposed, combined characters.
the additional nfkc , nfkd forms deal compatibility codepoints; e.g. u+2160 (roman numeral one) same thing u+0049 (latin capital letter i) present in unicode standard remain compatible encodings treat them separately. using either nfkc or nfkd form, in addition composing or decomposing characters, replace 'compatibility' characters canonical form:
>>> unicodedata.normalize('nfc', u'\u2167') # roman numeral viii u'\u2167' >>> unicodedata.normalize('nfkc', u'\u2167') # roman numeral viii u'viii'
note there no guarantee composed , decomposed forms communicative; normalizing combined character nfc form, converting result nfd form not result in same character sequence. unicode standard maintains list of exceptions; characters on list composable, not decomposable combined form, various reasons. see documentation on composition exclusion table.
Comments
Post a Comment