Re: [PHP] First stupid post of the year. [SOLVED]



On Thu, 3 Jan 2008 12:39:36 -0500, tedd wrote:

At 4:24 PM +0100 1/3/08, Nisse =?utf-8?Q?Engstr=C3=B6m?= wrote:
On Wed, 2 Jan 2008 19:36:56 -0500, tedd wrote:

To find out, I did put the operation through FireFox and reversed the
POST/GET operations to get a look at the string -- it is:

%C2%A0%C2%A0%C2%A0Z%C2%A0%C2%A0%C2%A0 < where Z is the value passed.

Now, C2 (HEX) is a linefeed (194 DEC)

By the way, C2 is not a linefeed as far as I know.

And, A0 (HEX) is a non-breaking space (160 DEC;) which is a &nbsp;

Not quite. <A0> is non-breaking space in *some* character
encodings, such as the ISO-8859-... encodings. It may
be different in other encodings. In UTF-8, it is <C2 A0>,
which is exactly what you're seing.

Well considering that UTF-8 encompasses/includes all of the code
points found ISO-8859, then I think that both encodings would
reference the same character. After all, if they didn't then what's
the point of Unicode?

Now, one can argue how many bytes are needed to represent a character
in what encoding, but that doesn't change the character. In the end,
I believe that <A0> is the same regardless of what charset or
encoding you're using.

You have a point here: the character is the same. In
Unicode it is called U+00A0. But Unicode alone does not
tell you how to represent the character in bytes. You
need an encoding for this.

Unicode specifies a few different encodings, called
transformation formats (the T and F in UTF). The actual
bytes representing U+00A0 are as follows:

UTF-32: <00 00 00 A0>
UTF-16BE: <00 A0>
UTF-16LE: <A0 00>
UTF-8: <C2 A0>

(where the <xx ...> syntax denotes *byte* sequences.
A byte sequence and a character are different things.)

The fact that the byte <A0> occurs in UTF-8 is just
an interesting, and easily confusing, coincident.

In other encodings, the character U+00A0 may be
encoded differently. For example, in CP850 for DOS,
U+00A0 is encoded using the single byte <ff>.

- - -

In HTML, there are a few ways to encode U+00A0. If
you have specified a character encoding for the document,
you can use the encoded character directly. You can
specify the encoding in HTTP (preferable) using PHP:

header ('Content-Type: text/html; charset=utf-8')

or .htaccess files (Apache 2):

AddDefaultCharset utf-8

Richard Lynch would tell you to also use a <meta> element:

<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">


If you don't want, or can't, use the encoded character
directly, you can also use HTML character references, such
as `&nbsp;´, `&#160´ or `&#x00a0´. Numerical character
references *always* refer to Unicode characters, *regardless*
of the encoding used in the document. For example, if your
document is encoded in CP850, you would use `&#xa0´ and not
`&#xff´ to represent U+00A0.


- - -

But let's go back to your problem again:

I just don't understand where C2 comes from or why it's there. I
would think that <00 A0> would be more appropriate.

When your document (web page) doesn't specify which
character encoding it is using, the browser will have to
guess. Many browser will use cp1252 or similar. Others
might use UTF-8, or inspect the document and guess which
is more apropriate. Some browsers can be configured to
prefer a particular encoding.

When the form is submitted, the form control values
are encoded using whichever character encoding the
browser has settled on. If your browser has settled on
UTF-8, the `&nbsp;´ in your form will be sent as <C2 A0>,
because character references can only be used in the HTML
document. In URLs they are encoded using numerical
references (eg. %C2%A0).

And here's what is going wrong: Your server side
script is expecting the form submission to be encoded
in an single-byte encoding (such as cp1252 or iso-8859-1
or similar). The sequence %C2%A0 is interpreted as two
character rather than one character.

Which two character would that be then? Well that,
again, depends on which character encoding your script
expects from the form submission:

Encoding Characters
-------- ----------
iso-8859-1: U+00C2, U+00A0 (A-circumflex, nbsp)
cp850: U+252C, U+00E1 (box drawing character, a-acute)
cp1252: U+00C2, U+00A0 (A-circumflex, nbsp)
cp874: U+0E22, U+00A0 (Thai YO YAK, nbsp)
KSC5601: U+D63B (Hangul HIEUH-O-KIYEOKSIOS)

> Therefore, if I simply use:

$submit = str_replace( chr(194), '', $submit );
$submit = str_replace( chr(160), '', $submit );

This is the solution.

Hardly.

If you mean my solution doesn't work, then you are mistaken -- for
works for me.

``This seems to work but I really have no idea what's
going on, so I'll just make random guesses´´

is very far from *the* solution in my mind. :-)

This entire encoding process is more involved than it looks, or so it
appears to me.

More reading in no particular order:

The Unicode Standard:
<http://unicode.org/>
Unicode character repertoire:
<http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>
Unicode encodings:
<http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf>
Other encodings:
<http://www.unicode.org/Public/MAPPINGS/>
RFC 3629 (UTF-8):
<http://www.rfc-editor.org/rfc/rfc3629.txt>
HTML, Character sets and encodings:
<http://www.w3.org/TR/html401/charset.html>
HTML, Form submission:
<http://www.w3.org/TR/html401/interact/forms.html#h-17.13>
Jukka K. Korpela on Characters and Encodings:
<http://www.cs.tut.fi/~jkorpela/chars/index.html>
the late Alan J. Flavell on internationalization:

<http://web.archive.org/web/20060924054022/ppewww.ph.gla.ac.uk/~flavell/charset/internat.html>


/Nisse
.



Relevant Pages

  • C# and encodings
    ... Can code page support Unicode coded character set, ... Are there also 8-bit code pages which use Unicode character ... encoding, and thus have only 255 code points matched to characters? ... mark written in UTF-8. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: POSTing Chinese characters
    ... For the example string I mention, simply encode as ... the client locale could be anywhere... ... > The basic idea of %-encoding is to treat character encoding as a sequence ...
    (microsoft.public.inetserver.iis)
  • Re: C# and encodings
    ... different encoding than Unicode does ... encoded into a binary stream using an encoding that either supports the ... So if code page supports only a subset of Unicode character set… ... characters as those in Unicode coded character set, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Try this
    ... Because that's the absence of encoding? ... If you want to understand what happens here: The Unicode block for 'CJK ... Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the ... would collapse each two letters into a single character, ...
    (comp.lang.python)
  • Re: Encoding/decoding: Still dont get it :-/
    ... Unicode-encoded, 2) whether I should use encodeor decodeto solve ... 4: character maps to ... It seems the database gives you the strings as unicode. ... characters the cannot be expressed in that encoding. ...
    (comp.lang.python)

Loading