top pagina

<< home <<


Peculiarities Internet

logo

 << internet home <<

<< dutch version of this text <<

 UTF-8 and Unicode

by Harry Koopman

20 July 2006


UTF-8 and Unicode

Some answers to a number of unexpected problems I encountered about the use of UTF-8 as charset in (x)html documents.

For simple html pages the next charset will do:
<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">

Do you want more, for example include text, or cms solutions, utf-8 as charset is a better option.

On this page I list some problems I encountered. The first one gave me some headache finding the cause...

FTP-program default not set to pass through utf-8

Do you work with Fetch (Macintosh) as FTP-application? Watch out. In Fetch version 4.03 the option 'Translate ISO characters' should be unchecked. Default it is checked. This option you can find in Preferences, then Miscelanious.

If you do not uncheck 'Translate ISO characters' all UTF-8 documents you send to the server are distorted. The lesson: Use a modern FTP-client and check its settings.

Make sure that your (x)html docs but also texts to include are utf-8 encoded.

Because this is not always automatically the case. Example, in BBEdit 7, save the text with "Save as', and then choose in the options 'UTF-8'.

UTF-8 with Byte Order Mark ?

You can write UTF-8 with or without Byte Order Mark (BOM). If you work with PHP, write the UTF-8 file without BOM, as in some cases problems can occur. The BOM is written at the beginning of the file. In some circumstances PHP cannot handle this.

For example, a command like this:
<?php
header("Content-Type:text/html;charset=utf-8");
?>
causes a PHP parse error if there are BOM-bytes in the beginning. This is because PHP expects this command at the very beginning.

Which encoding does your server send? Check the http-header.

We presume you have an external ISP (Internet Service Provider).
The Apache server gives default in the http-header an encoding.
http://www.rexswain.com/httpview.html
Via this URL you can watch the http-header of a document. In the http-header the Content-Type is defined. If this is:
Content-Type:·text/html
then it's ok. Because in this case it is up to you to define the content-type in the header of your document, for example with:
<meta http-equiv="content-type" content="text/html;charset=utf-8">

If however, the http-header says:
Content-Type:·text/html;·charset=iso-8859-1
then Apache is being too patronizing. Because Apache (by a setting in the httpd.conf) defines the charset in the http-header, you cannot override the charset with the meta-tag described above.

change charset with htaccess

In this scenario you should ask your ISP to change the setting in their http-conf. If this is a problem, a possible option in your hands is a .htacces file with the following content:

AddCharset UTF-8 .html
AddType "text/html; charset=UTF-8" html

This way you can overrule the charset. The content as listed worked for me. It may be possible you can use only the AddCharset line or only the AddType line. You can control the result checking the Content-type of the http-header.

eliminate charset with htaccess

Another solution is an .htaccess in the root of your website with the content:

AddDefaultCharset Off

This way, the charset definition by Apache is made undone, and makes it possible for you to change the charset per page with the meta tag. So, in the head section of your (x)html:

<meta http-equiv="content-type" content="text/html;charset=utf-8">

to change the character set to utf-8 for that page.

htaccess problems

may occur, because it is possible an .htacces has no effect, or gives an internal server error (error 500). This depends on the settings on ISP-level in the httpd.conf file.

In case you have a default iso-8859-1 Content-type header sent by the server , and you can not change the content-type with .htaccess, you have a problem. UTF-8 is not possible then (except with PHP, see below). You should solve this with your ISP (or choose another one...)

send header with PHP

By giving in a php document as the first command
<?php
header("Content-Type:text/html;charset=utf-8");
?>
you can force the utf-8 encoding for that php document. But be careful for the Byte Order Mark, as said in the beginning of this article.

20 july 2006, Harry Koopman, update 13 december 2008

top

logo
www.marsandmc.nl | harry koopman
rights reserved