Solving UTF-8 problems

UTF-8 in webpages

Some answers to a number of unexpected problems I encountered using the unicode-charset UTF-8 in (x)html documents.

UTF-8

These days (2013) the best choice for a webpage character set is the unicode charset UTF-8. Before that the character set ISO-8859-1 was the most common charset in western orientated languages. The advantage of UTF-8 is that the character set is wider, and that html-entities are seldom needed.

FTP-program default not set to pass through utf-8

Note: this experience dates from about 7 years ago. FTP programs have made progression since then.

Do you work with Fetch (Macintosh) as FTP-application? Then pay attention to this: in Fetch version 4.03 the option 'Translate ISO characters' should be unchecked. Default it is checked. This option you can find in Preferences, then Miscelanious.

If you do not uncheck 'Translate ISO characters' all UTF-8 documents you send to the server are distorted. The lesson: Use a modern FTP-client and check its settings.

Start UTF-8 at the source

Make sure that your (x)html docs but also texts to include are utf-8 encoded. A text editor is capable of storing texts in different encodings. There is a good change that if you do nothing, a 'Western-ISO' encoding is used when saving a file. Be sure to use 'Save as' and check the used encoding that is used. Make sure it is UTF-8.

UTF-8 with Byte Order Mark ?

From text-editors you often can save UTF-8 with or without Byte Order Mark (BOM). Please refer to google if you want to know what a Byte Order Mark is. When using UTF-8, write your texts without it. ("Save as UTF-8, No BOM").
If you work with PHP the Byte Order Mark can cause problems when PHP is used to send headers in the beginning of the document. PHP does not allow a webpage to send anything before headers are sent.

For example, a command like this:
<?php
header("Content-Type:text/html;charset=utf-8");
?>
causes a PHP parse error if there are BOM-bytes sent before.

Does server send an encoding? Check the http-header

There are websites that can help you see the http-header that is sent by your webpage. One of them is:

www.rexswain.com/httpview.html (opens in new tab or window)

There are also browser extentions that will show you the http-header.

In the http-header the Content-Type of the document is defined (among other things). If this is (for a normal webpage)
Content-Type:·text/html
then it's ok. In that case you are always able to determine the charset of your document, for example by the meta tag:
<meta http-equiv="content-type" content="text/html;charset=utf-8">
or in html5:
<meta charset="utf-8">

Sometimes however the Content-Type sent by the http-header is this one:
Content-Type:·text/html;·charset=iso-8859-1
In this case Apache (we presume Apache on the webserver) is too patronizing. Because Apache (by a setting in the httpd.conf, a file on the server with Apache settings) defines the charset in the http-header with this extended Content-Type definition, you cannot override the charset with the meta-tag described above.

I do not deliver hosting services, I only use them. From what I have read, Apache version 2 has this extended Content-Type as the default setting, for some obscure reason. Apache 3 has not. ISP's (Internet Service Providers) should always change the Apache Content-Type setting so that it does not send a default encoding. But many ISP's do not adapt this ominous setting.

If the http-header sends an encoding (most often the character set ISO-8859-1) next to the document type, you are NOT able to define a charset with the meta tag. The http-header encoding will always override the meta tag encoding.

Htaccess: set UTF-8

If the webservers accepts htaccess settings, you are able to replace the charset sent by Apache. This can be done in the following ways:

AddCharset UTF-8 .html
AddType "text/html; charset=UTF-8" html


This way you can overrule the charset. The content as listed worked for me. It may be that you can use only the AddCharset line or only the AddType line. If you get Error 500 you should experiment using one of them.
You can control the result checking the Content-type of the http-header.

Htaccess: eliminate charset

AddDefaultCharset Off

With this content in a htacces file (preferably in the root of your site) you eliminate a charset sent by Apache. In that case you always have to define your charset in your html, with a meta tag, or by sending a php-header (see later in this document about sending a charset php header). Anyhow, it's always best to define a charset in every html document. (Why? If you store webbpages on your computer the pages are not server by the remote server, so you'd better not rely on a charset that is only sent by the remote server.)

Htaccess problems

may occur, because it is possible an .htacces has no effect, or gives an internal server error (error 500). This depends on the settings on ISP-level in the httpd.conf file.

In case you have a default ISO-8859-1 Content-type header sent by the server and you can not change the content-type with .htaccess, you have a problem. UTF-8 is not possible then (except with PHP, see below). You should contact your ISP and ask them to change the setting in the httpd.conf, and if they don't want to do that, you'd better look for a better ISP...

ISP's are sometimes reluctant to change settings, because they don't want undesired effects for other users. But in the case of the Apache Content-Type definition: the removal of the charset ISO-8859-1 will never have negative results for their costumers:
• if they did not set any charset in their html, every browser will fall back on a Western encoding like ISO-8859-1
• if they set a Western encoding like ISO-8859-1 in their html, nothing will change for them
• if they set UTF-8 in their html, UTF-8 will become in effect. So the page will render according to the charset they intended. Mismatches are not to expected: UTF-8 also deals with html-entities.

Send header with PHP

By giving in a php document as the first command:
<?php
header("Content-Type:text/html;charset=utf-8");
?>
you can force the utf-8 encoding for that php document. It will overrule the http-header setting. If you must depend on this because of the problems mentioned above, you are bound to serve your pages as php. On pure html pages you still have the encoding problem.

2006, Harry Koopman, update 2013