16/03/2009
Convert greek characters from latin1 mysql database fields to pure utf8
The Problem
To sum it up, the case is this: many many many web applications were programmed so that they used latin1 collation for their fields inside mysql databases. But most users now use utf8 from within their browsers. What happens is that utf8 characters are getting stored inside latin1 fields, which in return produces chaos! A huge web application that used that kind of madness was WordPress. Luckily (or not) WordPress now uses utf8 everywhere. I’ve known many many many people that got so frustrated when they tried to move from their old WordPress installation to a newer one because all their greek posts couldn’t be exported “easily”, I won’t say “properly” because there are always solutions to problems like this, but all the solutions were not straightforward at all, that they finally dumped the idea of moving the posts and started a new blog.
This is a HUGE problem for many greek (and not only) users and I hope I now have an elegant(?) solution to it.
The solution that I provide does not require any use of the mysqldump utility at all. Most solutions to the problem I’ve seen so far were more or less using the mysqldump utility like this:
$ mysqldump --default-character-set=latin1 --opt -u user -p dbname> latin1-dbname.sql
Since many people have their blogs on shared hosting or have very limited shell access, the previous solution is a no-go for them because it requires that they contact their hosting support, explain to them what they want and wait for their reply. If they are lucky they might get the .sql file…else…they are back to where they started.
My solution:
First of all it is based purely on this post: http://combatwombat.7doves.com/2008/10/26/mysql-latin1-to-utf8-issues .While that post does not mention greek characters at all, it gave me an idea of how it should be done.
In order to solve the problem using my solution you need a Linux or a MacosX host. This is because the solution is based on a bash script that needs the sed utility for character conversion. Both bash and sed are of course not included on a default windows installation. So if you are a windows only user you can either install those tools through cygwin and try if it works (never tested), ask a friend of yours that uses linux or macosX to help you, boot a linux live cd, or install linux 😀
What every hosting solution definitely has is access to mysql databases through phpmyadmin. Even if your hosting provider or control panel does not provide it for you, you can always install it manually. One of the easiest things to do on phpmyadmin is export a database. Just open phpmyadmin, select your database, click on export on the right, select some or all of the database tables you want, select “Save as File” and click on zipped. Click on “Go” and after a few seconds you will have your .sql.zip file sent to you. If you find that hard to do please ask a friend. Please don’t blame me for blowing up your mysql database if you don’t know how to handle these simple directions..
Let’s say that the db name was sample-db, then you should have gotten a file named: sample-db.sql.zip
Unzip it:
$ unzip sample-db.sql.zip
and then edit it with a text editor,vim for example:
$ vim sample-db.sql
If you are suffering from the problem mentioned before you will probably see things like:
To start the conversion you need to download the following script: greek-convert-latin1-to-utf8.sh or greek-convert-latin1-to-utf8.sh.gz (you need to extract the .gz)
The make it an executable: $ chmod +x greek-convert-latin1-to-utf8.sh
And then execute the script with the database as an input:
$ ./greek-convert-latin1-to-utf8.sh sample-db.sql
I'll work on sample-db.sql ...
sample-db.sql...done
Then you will have a new file named sample-db.sql.clean as output on the same dir you ran the script.
Open it and you should now see every post in pure utf8 greek like this:
As you can see 99% percent of the characters were correctly converted to proper greek utf8 ones. I don’t currently have the time to investigate why a few characters don’t get properly converted, but I’ll soon find a solution for that too 🙂
What’s now left to do is to import the sample-db.sql.clean file to your new hosting…you can do that of course through phpmyadmin…
The conversion table that I used is here:greek-replacement-latin1-to-utf8.ods
This might be a late solution, since the problem is quite old, but I am sure that there are many people still having headaches over issues like this. Enjoy 🙂
Downloads:
greek-convert-latin1-to-utf8.sh
greek-convert-latin1-to-utf8.sh.gz
Filed by kargig at 21:14 under Greek,Internet,Linux
Tags: bash, convert, Greek, latin1, mysql, sed, utf8, wordpress
22 Comments | 38,654 views
I recall I had a hard time with this issue in WordPress. I was not aware of the existence of that script (greek-convert-latin1-to-utf8.sh), but as I think about it right now I wouldn’t feel confident enough to use search&replace techniques in order to convert the database to utf8. It seems like a hack to me and it’s not 100% successful (as you have written above). The correct UTF-8 characters are right there, but stored in a latin1 table.
There was a document in the wordpress codex describing the required steps to successfully convert the database. After much effort, I had finally fixed the utf8 issue by using the UTF-8 Database Converter plugin. That’s a clean solution and requires the minimum effort. It worked 100% for me, so I highly recommend it.
Found that codex document. Be ware, this is what it states about the plugin I referred to:
I guess data corruption happens because of changes in the database structure in the recent wordpress releases.
You couldn’t know of the script back then because it didn’t exist until yesterday. I created it yesterday based on the url I mention in the post.
I didn’t want to be toooo technical in the post, but since you argue about the script’s efficiency let me explain how it works. The Greek UTF8 characters inside latin1 fields are stored as 2 bytes (utf8 is multi-byte), if you know which 2 bytes/characters these are, you can replace these duets with a proper single UTF8 one. I found out about these duets using the technique explained on the reference post. Then took those chars, that are also seen on the conversion table, and extended the original bash script. The same thing can of course be done by others for other languages too. The sed commands replace these “known” duets from the conversion table to the proper UTF8 characters. It’s not a hack. It’s a script.
The solution I provide works with all web apps and not just with wordpress (as a plugin). As it is not a plugin, it is much cleaner, in the sense that one can easily see what exactly is going on.
P.S. I’ve just taken a look at the old database the screenshots were from, and it also contains these corrupted chars, like “¬” inside some words. So the script’s efficiency is not 99% but 100% 🙂
P.S.2 If you find some case that it doesn’t work as expected please let me know.
Thanks for the clarification and the explanation of the character conversion to utf8. I must have overlooked some parts of your post and the posts you refer to. Probably search&replace is not that bad as I had initially thought 🙂
PS: I had also sent another comment with the link to the document on the wordpress codex page, but it has probably been caught by the spam plugin. This document is interesting and I think the procedure it outlines could be used to fix the databases of other web applications (besides WordPress). Some of the benefits of that procedure are that it is not bound to a specific language and it does not manipulate the data, since the data is already stored in utf8. I think you should definitely have a look.
I got you off the spam database…sorry..I have to retrain Spam Karma 2 again…
Thanks. I’ve always been using Akismet, but… it’s not any better. Causes false positives/negatives from time to time.
The essense of the problem with the encodings is that the problematic text is UTF-8, autoconverted from windows-1251/2, although the source encoding was windows-1253 (or iso-8859-7).
So, the canonical way to fix this is to convert the UTF-8 text back to windows-1252 and then, convert from windows-1253 to UTF-8.
You can perform these conversions with iconv, avoiding the use of ‘sed’.
For more, see
http://simos.info/blog/archives/469
@Simos: If I understand corrrectly, you are talking about another difficult case. I am strictly talking about UTF8 characters inside latin1_swedish_ci fields. The problem you describe is a bit older when CMSs used win1253/iso-8859-7 encodings. The source encoding I am talking about is always UTF8.
While on this subject, you may also want to read http://www.mysqlperformanceblog.com/2009/03/17/converting-character-sets/
Thanks a lot! (and for the email ;))
Is it possible I send someone my sql dump to do the conversion for me … and then simply send it back to me?
I am trying for months to do this but I cant do it … I dont have a Linux box.
Regards
I guess I could do that…send your database to the email posted on this page: http://www.void.gr/kargig/blog/contact/
paidia meta apo apeiro psaximo pano se afto to provlima, i lysi einai i exis:
https://sourceforge.net/project/showfiles.php?group_id=185125&package_id=215515&release_id=480184
trexete to programmataki,
kanete paste to keimeno sto kato dexia box kai epilegete to 3o tab iso-8859-7
kai patate to koumpi akrivos apo pano tou -> transform iso-8859-7
sto prwto pedio twra (pano dexia) vgazei ta ellinika kanonika etoima gia import sti vasi..
thymitheite na kanete ena donation, pragmatika to software einai swtiria..
manolol
Anyone know of a way to do this in windows?
I have a couple of exported .sql files, but I need to convert them. I do not run linux at home so I can’t use this .sh 🙁
Today was my day of trying to convert my latin1 IPB board from ISO-8859-7 to UTF-8.
So I exported my whole db using phpmyadmin, unziped the file, downloaded your script, run the script across my dbdump, but the new file (.clean) remains the same as the original one! 🙁
Heres a six screenshot set where you can see the process:
http://www.flickr.com/photos/gtasioulis/sets/72157622841270887/
Am I doing something wrong? Any ideas?
Sorry for the LONG delay but your comment was eaten by spamkarma and was waaaay too deep in the spam queue…
Which encoding did you use on phpmyadmin when you exported the file ? With which encoding are you able to see the data inside the (old) database (using phpmyadmin of course) correctly ?
Try to switch encoding when exporting the data
I dont understand how to do this. If its this easy, can I give someone 20-30 euros to do this for me please?
I appreciate for your script. I had an SQL file that needed conversion and your script worked miracles.
Much appreciated!
Yo mate, this works like a charm! Thanks a lot
Cheers
paidia meta apo apeiro psaximo pano se afto to provlima, i lysi einai i exis:
https://sourceforge.net/project/showfiles.php?group_id=185125&package_id=215515&release_id=480184
trexete to programmataki,
kanete paste to keimeno sto kato dexia box kai epilegete to 3o tab iso-8859-7
kai patate to koumpi akrivos apo pano tou -> transform iso-8859-7
sto prwto pedio twra (pano dexia) vgazei ta ellinika kanonika etoima gia import sti vasi..
thymitheite na kanete ena donation, pragmatika to software einai swtiria..
I ‘ve been searching and trying many different things last 3 days. Nothing worked but your script. Thank you
Works like charm! Thanks dude
Great clean solution