python - Normalizing Unicode -

July 15, 2014

is there standard way, in python, normalize unicode string, comprehends simplest unicode entities can used represent ?

i mean, translate sequence ['latin small letter a', 'combining acute accent'] ['latin small letter acute'] ?

see problem:

>>> import unicodedata >>> char = "á" >>> len(char) 1 >>> [ unicodedata.name(c) c in char ] ['latin small letter acute']

but now:

>>> char = "á" >>> len(char) 2 >>> [ unicodedata.name(c) c in char ] ['latin small letter a', 'combining acute accent']

i could, of course, iterate on chars , manual replacements, etc., not efficient, , i'm pretty sure miss half of special cases, , mistakes.

the unicodedata module offers .normalize() function, want normalize nfc form:

>>> unicodedata.normalize('nfc', u'\u0061\u0301') u'\xe1' >>> unicodedata.normalize('nfd', u'\u00e1') u'a\u0301'

nfc, or 'normal form composed' returns composed characters, nfd, 'normal form decomposed' gives decomposed, combined characters.

the additional nfkc , nfkd forms deal compatibility codepoints; e.g. u+2160 (roman numeral one) same thing u+0049 (latin capital letter i) present in unicode standard remain compatible encodings treat them separately. using either nfkc or nfkd form, in addition composing or decomposing characters, replace 'compatibility' characters canonical form:

>>> unicodedata.normalize('nfc', u'\u2167')  # roman numeral viii u'\u2167' >>> unicodedata.normalize('nfkc', u'\u2167') # roman numeral viii u'viii'

note there no guarantee composed , decomposed forms communicative; normalizing combined character nfc form, converting result nfd form not result in same character sequence. unicode standard maintains list of exceptions; characters on list composable, not decomposable combined form, various reasons. see documentation on composition exclusion table.

Search This Blog

Parth Code

python - Normalizing Unicode -

Comments

Post a Comment

Popular posts from this blog

c# - WPF Converters DLL - Failed to Add Reference -

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

c++ - qgraphicsview horizontal scrolling always has a vertical delta -