Re: How to upload form data containing special characters correctly?



On Mon, 04 Sep 2006 11:24:04 +0200, Wim Cossement <wcosseme@xxxxxxxxxxxxxx> wrote:

Hello,

I was wondering if there are a few good pages and/or examples on how to
process form data correctly for putting it in a MySQL DB.

Since I'm not used to using PHP a lot, I already found out that
addslashes() can be used escape some characters, but I'm having some
more problems with for instance ä, å and µ (since the text is scientifical)
Now some people also throw in htmlspecialchars() to convert those to
HTML entities, but some nest htmlspecialchars() in addslashes() and
others do the opposite.

Is there a good and error proof way of ensuring that what one puts in a
textarea gets stored and can be retrieved safe and sound?

Thanks in advance,

Wimmy



i found user comments in the php manual under htmlspecialchar
think these might help

also if you need to save special characters I sugget turning off magic quotes and that supresses
the backslashes normally adds with set_magic_quote_runtime(0);

After inspecting the non-native encoding problem, I noticed that for example, if the encoding is
cyrillic, and I write Latin characters that are not part of the encoding (æ for example -
ae-ligature), the browser will send the real entity, such as &aelig; for this case.
Therefore, the only way I see to display multilingual text that is encoded with entities is by:
<?php
echo str_replace('&amp;', '&', htmlspecialchars($txt));
?>
The regex for numeric entities will skip the Latin-1 textual entities.







A sample function, if anybody want to turn html entities (and special characters) back to simple.
(eg: "&egrave;", "<" etc)
function html2specialchars($str){
$trans_table = array_flip(get_html_translation_table(HTML_ENTITIES));
return strtr($str, $trans_table);
}






Quite often, on HTML pages that are not encoded as UTF-8, and people write in not native encoding,
some browser (for sure IExplorer) will send the different charset characters using HTML Entities,
such as &#1073; for small russian 'b'.
htmlspecialchars() will convert this character to the entity, since it changes all & to &amp;
What I usually do, is either turn &amp; back to & so the correct characters will appear in the
output, or I use some regex to replace all entities of characters back to their original entity:
<?php
// treat this as pseudo-code, it hasn't been tested...
$result = preg_replace('/&amp;#(x[a-f0-9]+|[0-9]+);/i', '&#$1;', $source);
?>





Why &#39;? The HTML and XML DTDs proposed &apos; for this.
See http://www.w3.org/TR/html/dtds.html#a_dtd_Special_characters
So better use this:
$text = htmlspecialchars($text, ENT_QUOTES);
$text = preg_replace('/&#0*39;/', '&apos;', $text);

.



Relevant Pages