• Craig Anderson

Transliteration—helping users get data from Word or Excel into legacy systems


If you've spent much time copying and pasting data between systems, there's a good chance you've come across this character:


🀆

Affectionately known as the tofu block, this character normally means that there's some data which is trying to be displayed, but something doesn't know how to display it. There are numerous reasons this can happen and I'm not going to get into those details here, but there is a technique we can use to mitigate this problem—transliteration.

Transliteration is the process of converting characters which are likely to cause problems (such as “smart quotes” and accents on words like cafĂ©) into alternatives which are less likely to cause issues (so "straight quotes" and removing accents on cafe).

In My Data Chameleon, a product which prepares data for a legacy system run by the California Public Utilities Commission (CPUC) which doesn't accept any data outside of the limited ASCII character set, we use Unidecode to help users replace of remove characters which would cause problems when uploaded to the CPUC.

If you're familiar with Python, you can get up and running with this library very quickly:

>>> from unidecode import unidecode
>>> unidecode("“cafĂ©â€")
'"cafe"'

However, you should be careful—transliteration can effect the meaning of strings in ways that users might not expect:

>>> unidecode("マăƒȘă‚Șă‚«ăƒŒăƒˆ")
'mariokato'
>>> unidecode("this was đŸ’©, is now 💖")
'this was , is now '

In My Data Chameleon we let users know when they've entered non-ASCII characters which would cause problems with the CPUC, but we don't force them to correct every single character. We imagine that the vast majority of non-ASCII characters will be smart quotes and the like injected by tools like Excel, and just transliterating that data will make our users' experience much simpler.