Removing diacritics, underlines and other marks from unicode strings in Python

Posted at 05 Jan 2016
Tags: icu, python, unicode

Sometimes it is necessary to normalize a string by removing all kinds of diacritics (accents), underlines or other “marks” that can be attached to characters in unicode. This is important for example for full-text search or text mining. Transliteration to ASCII characters is not an option because this would for example also eliminate Greek, Russian or other characters. With the help of the PyICU library, the task can easily be achieved:

from icu import UnicodeString, Transliterator, UTransDirection
u = UnicodeString(s)
t = Transliterator.createInstance("NFD; [:M:] Remove; NFC", UTransDirection.FORWARD)
normalized = str(u)

After converting a Python string to an ICU UnicodeString object, we can apply a transliteration operation that is defined as "NFD; [:M:] Remove; NFC“. This operation means the Unicode string is at first decomposed (NFD), then the character class "marks” is removed (“[:M:] Remove”) and finally the string is re-composed again (NFC). At the end, the UnicodeString object is converted back to a Python str.

After defining a function we can use it as follows and see that it works (underlines may not be displayed correctly in your browser):

> 'cafe'
normalize_string('αυτῷ μνήμ̳η̳ς')
> 'αυτω μνημης'
If you spotted a mistake or want to comment on this post, please contact me: post -at- mkonrad -dot- net.
← “Python implementation of Linde-Buzo-Gray algorithm
View all posts
Finding out annual music favorites from Clementine music player” →