How to sanitize a string with mixed encodings – UTF-8 and Latin1

Stefan - 10. Juli 2007

Dealing with the predominant encodings in the western world (UTF-8 and Latin1 aka ISO-8859-1) isn’t that hard – in theory. But whenever I have to deal with different encodings there’s always happening something strange. Both editors (Textmate and Smultron) I use on the Mac don’t handle files with different encodings as good as Ultraedit does on Windows, where it’s as easy as choosing a menu entry to change the encoding of an opened file. But I digress.

The really painful part of dealing with encodings is when you have text files (or database entries) where the encoding is mixed. Now, in theory, there is no such thing as a “mixed encoding”: A given text is either encoded in UTF-8 or in Latin1 – or in any other encoding, but there is no such thing as “mixing”. In practice however, there can be several situations where you get text which partly is encoded in UTF-8 and partly in Latin1. Today, for example, I started to upgrade an old blog from WordPress 1.5 to the most recent version 2.2.1. During this process I looked at the various trackbacks and learned that they were sometimes encoded in UTF-8 and sometimes in Latin1.

To sanitize those 700+ trackbacks I wrote a little PHP class called Latin1UTF8. The class has two methods: mixed_to_latin1($text) and mixed_to_utf8($text), which do what their names say. Just give them some text which may contain characters encoded in UTF-8 and/or Latin1 and you’ll get back a sanitized version.

macbook:~/Documents/source/php/utf8 sf$ /Applications/MAMP/bin/php5/bin/php Latin1UTF8.php
Original: Fischerländer. Fischerländer.
Latin1: Fischerländer. Fischerländer.
UTF-8: Fischerländer. Fischerländer.

Please be aware that there is no error checking! UTF-8 characters which can not be displayed in Latin1 will return garbage. While Latin1UTF8 may give the impression that the garbage in – garbage out rule is no longer appropriate, be sure to give reasonable data to my little class or you’ll get some surprises.

Abgelegt in: PHP

7 Kommentare:

Hi, Can i do it in ASP?, Can you help Me?

Thanxz

Sanitizing WordPress UTF-8 – or Howto get rid of mixed Latin1 and UTF8 mysql exports…

Actually my atantion comes to some weird characters in my wordpress blog. Such as ü or ö as represantants of ä and ö. So i had a look into my mysql-db and saw that it was still on latin1. On my way to the clearance i got over that explanation. But …

Thanks a bunch!

This don’t work with some charaters like Ñ or €

Thank you, this really saved me some time. I had a MySQL database that was converted in latin1 and some funny characters popped-up and this quickly solved my problem, without me needing to code a solution.

Thanks again

“This don’t work with some charaters like Ñ or €”
I’m researching now, but PHP seems to not be able to convert chars 127-159, including the euro symbol. Not sure of answer. Author – test your code(!?)

Thanks! I was struggling with some badly coded software that used mixed Latin1 and UTF-8. Your little class works as a charm :)

Schreibe einen Kommentar
benötigt
benötigt (wird nicht angezeigt)
optional

Suchen