How to sanitize a string with mixed encodings – UTF-8 and Latin1
Dealing with the predominant encodings in the western world (UTF-8 and Latin1 aka ISO-8859-1) isn’t that hard – in theory. But whenever I have to deal with different encodings there’s always happening something strange. Both editors (Textmate and Smultron) I use on the Mac don’t handle files with different encodings as good as Ultraedit does on Windows, where it’s as easy as choosing a menu entry to change the encoding of an opened file. But I digress.
The really painful part of dealing with encodings is when you have text files (or database entries) where the encoding is mixed. Now, in theory, there is no such thing as a „mixed encoding“: A given text is either encoded in UTF-8 or in Latin1 – or in any other encoding, but there is no such thing as „mixing“. In practice however, there can be several situations where you get text which partly is encoded in UTF-8 and partly in Latin1. Today, for example, I started to upgrade an old blog from WordPress 1.5 to the most recent version 2.2.1. During this process I looked at the various trackbacks and learned that they were sometimes encoded in UTF-8 and sometimes in Latin1.
To sanitize those 700+ trackbacks I wrote a little PHP class called Latin1UTF8. The class has two methods: mixed_to_latin1($text) and mixed_to_utf8($text), which do what their names say. Just give them some text which may contain characters encoded in UTF-8 and/or Latin1 and you’ll get back a sanitized version.
macbook:~/Documents/source/php/utf8 sf$ /Applications/MAMP/bin/php5/bin/php Latin1UTF8.php
Original: Fischerländer. FischerlÃ¤nder.
Latin1: Fischerländer. Fischerländer.
UTF-8: FischerlÃ¤nder. FischerlÃ¤nder.
Please be aware that there is no error checking! UTF-8 characters which can not be displayed in Latin1 will return garbage. While Latin1UTF8 may give the impression that the garbage in – garbage out rule is no longer appropriate, be sure to give reasonable data to my little class or you’ll get some surprises.
Download class Latin1UTF8