Character Sets and Encodings

Encodings are one of the most common cause of headaches in NLP applications. Data often comes in claiming to be in a particular encoding but actually containing invalid characters for that encoding or more frequently (and worse) without any clue as to what the encoding might be. Complex scripts like Arabic and Devanagari provide their own challenges since logical characters may not match what is displayed on the screen - Arabic characters have different presentation forms for initial, medial and final positions and Devanagari uses an extensive set of ligatures. Different encodings make different decisions about how to encode graphemes vs. glyphs.

Also see:

Unicode

Other Encodings

Working With Encodings in Python

  • Up to Contents
  • Back to XML Parsing