Why autodetection sucks and why you should always declare what character encoding your document is using

Published: 2006-06-16 09:06:45

This neat trick was published on wincustomize.com:

Create a text file in Notepad (or another text editor, do not use Wordpad, Word or any another word processor).
Type this sentence exatly, without the quotes: “this app can break”.
Exit the text editor and open the file in Notepad (by double-clicking, or by File→Open).
Notice that the text has transformed into “桴獩愠灰挠湡戠敲歡” (a nonsensical Chinese text).

Why did it do that? Michael Kaplan has the full explanation, but in short it is because Notepad takes a stab at auto-detecting what character encoding the file was saved in, and fails horribly. The same happens all the time on the Web, which is why browsers have implemented various ways of guessing what the author meant. It often works well, but sometimes it fails. Perhaps not as completely as in the Notepad example above, but enough to make pages difficult or impossible to read. The only solution to the problem is for Web authors to make sure they declare the character encoding for the documents, scripts and style sheets they create. The easiest way to do this is to make the server software add the tag to the HTTP header, Apache can, for instance, do this with the configuration flag AddDefaultCharset. If you cannot control the server, you can also add it as a <meta> tag for HTML, an encoding declaration for XML, or a @charset at-rule for CSS. There is no way to declare the character encoding for a piece of JavaScript or a plain text file, so there you really, really should configure the server to send the correct information.

Tags: browsers unicode encodings

Comments

Date: 2006-07-06 17:07:07
Name: burma

On the other side - when you browse 99% of the web without thinking about encoding - isn't it wonderful? :)

peter@softwolves.pp.se

This was originally posted on My Opera at http://my.opera.com/nafmo/blog/show.dml/300508
Please note that links may be outdated and any information included here may be obsolete.

← International installation | My three favourite features in Opera (9) → | Back to the post index | Back to the archive index | Peter's homepage

Archived copy of A Swedish wolf in Norway

Why autodetection sucks and why you should always declare what character encoding your document is using

Comments