Detecting and changing the encoding of text files.

Jun 212011

Today post is by Juan Valencia, originally posted on his blog (available also in spanish there), i’ve found it really interesting with in deep articles regarding rsync,ssh and other commands.

When you receive and need to handle multiple text files that use characters that are not natural to the English language, you may run into the problem that is dealing with different character encodings. This is particularly noticeable in websites, where if the browser try to interpret the text file with an encoding that differs from the actual encoding that the file is using, we can see strange symbols where this characters were supposed to show, but it is not limited to websites, any program that is made to work with languages other than English may present a similar problem if it is not appropriately handled.

In the case of HTML archives, many people, and several programs by default, opt for change this foreign characters with either HTML entities (e.g. á to place an á) or Iso Latin-1 code (e.g. á to place an á), but the truth is that nowadays every modern (and not so modern) browser can successfully handle encodings such as iso-8859-1 or utf-8, all that we have to do is choose an encoding and use that same encoding for all files to avoid conflicts, and specify to the browser that we are using that encoding. Personally I prefer to use utf-8 as I consider it a much more flexible and complete character set, and unless it is otherwise required I have standardized the use of utf-8 in all my projects and in my systems in general.

To detect the encoding that is being used within a file, we can use the command “file“. This command try to autodetect the encoding that a file is using. If no special characters are detected inside the text file, “file” will tell us that the encoding is us-ascii, and our editor can use whatever character encoding it is set to use by default. Of course, I set my editors to work with utf-8 by default.

file --mime-encoding file.txt

Once we have the encoding of the file, then we can transform it to a different character encoding if it’s necessary, by using:

iconv --from-code=iso-8859-1 --to-code=utf-8 file.txt > file.txt.utf8 mv file.txt.utf8 file.txt

Changing the character encoding of multiple files

When we need to change the character encoding of one file, more often than not we have to change the character encoding of other files as well, to do this operation to several files at once we can use:

for old in *.txt; do iconv --from-code=iso-8859-1 --to-code=utf-8 $old > $old.utf8; done

Once this is done, we can rename all the converted files to the name that they were generated from, in effect, replacing the original with the reencoded version:

for old in *.utf8; do cp $old `basename $old .utf8`; done

basename give us the name of the file minus the “.utf8” part. If everything is ok, we can remove the temporal files that we created.

rm *.utf8

Here end Juan article

Some more iconv examples:

find . -name "*.php" -exec iconv -f ISO-8859-1 -t UTF-8 {} -o ../newdir_utf8/{} \;

Batch convert files to utf-8 taken from http://blog.ofirpicazo.com/linux/batch-convert-files-to-utf-8/

mysqldump --add-drop-table -uroot -p "DB_name"  | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | mysql -uroot -p "DB_name"

Convert mysql database from latin1 to utf8

2 Responses to “Detecting and changing the encoding of text files.”

fraterneo says:

Wednesday June 22nd, 2011 at 01:03 AM

Excelent post!

Reply
cornel panceac says:

Wednesday June 22nd, 2011 at 08:05 AM

Thank you!

Reply

Linuxaria

Detecting and changing the encoding of text files.

Changing the character encoding of multiple files

Some more iconv examples:

Popular Posts:

2 Responses to “Detecting and changing the encoding of text files.”

Leave a Reply to fraterneo Cancel reply