OT: exercises in character encodings (was: Re: DOCTYPE)



By the way, if you want a nice introduction on the matter, here's a good start:

http://www.joelonsoftware.com/articles/Unicode.html

... and beware of onions ;) There is one thing I strongly disagree with the above site: the remark that character encodings would be easy. They are not. Especially if you take some quirky behaviours of Windows and MySQL into account.

IMHO it _can_ be easy - you just have to do it consistently.
With UTF-8 you have to make sure that

* your data is stored as UTF-8 in the DB
* correctly transfered to your script (SET NAMES utf8)
* correctly transfered to the browser (header())

In short: UTF-8 all the way from the source to the reader.


That is the theory, yes.

But I have a few "exercises" if you like:
- Try to configure MySQL in a way that the server will assume client connections in utf-8 by default.
- You forgot e-mail. Try to send a utf-8 encoded e-mail to MS-Outlook.
- Nice one: have your PHP-enabled server tell that it accepts both iso-8859-1 and utf-8. Not hard, eh? Now create a form. Fill in some data and post it. Do this again, but now select the other encoding in your browser. Look at the submitted headers, the submitted data and tell me how on earth the server is to know which one the browser used. Tried this with Firefox, IE and Safari...
- See above. Modify your form such that it will tell the server the encoding (being right or not). Tell me what happens.
- MySQL has so many encoding translator settings. Explain how you can see what is actually stored in the database.
- If you really want to make sure you are getting the characters right, you can use a unicode escape in some languages. How would you do this for MySQL if you know that your source file passes at least 3 programs that use different encodings but do not say anything in their documentation?
- Try and find a nice database front-end program that can reliably render encodings. For Windows, Linux and Mac if you think it is too easy...
- Well, utf-8 is relatively easy, off course. How would you make MySQL accept utf-16?
- MySQL again: Try to convert latin-1 tables to utf-8. Now do the same if they have compound indexes.
- FPDF: Just create a nice PDF with utf-8 as it's base. What does this do to your targeted consistency?
- Explain why Windows and IE render a euro sign in iso-8859-1 while that character is not even covered by it.
- An advanced one: what can a hacker abuse with character encodings? (No short answer, I'm afraid).

Happy puzzling!
.


Quantcast