Sometimes it is necessary to normalize a string by removing all kinds of diacritics (accents), underlines or other “marks” that can be attached to characters in unicode. This is important for example for full-text search or text mining. Transliteration to ASCII characters is not an option because this would for example also eliminate Greek, Russian or other characters. With the help of the PyICU library, the task can easily be achieved:
After converting a Python string to an ICU UnicodeString
object, we can apply a transliteration
operation that is defined as "NFD; [:M:] Remove; NFC
“. This operation means the Unicode string is at first decomposed (NFD), then the character class "marks” is removed (“[:M:] Remove”) and finally the string is re-composed again (NFC). At the end, the UnicodeString
object is converted back to a Python str
.
After defining a function we can use it as follows and see that it works (underlines may not be displayed correctly in your browser):