Re: Binary-safe?



| >
| > | This to me implies that the function need not know what the bytes
| > | represent, it operates on the data as a raw byte stream.
| >
| > that's correct.
| >
|
| Thus, by this definition, wouldn't strpos NOT be binary safe, since it
| needs to know something about what is represented by the raw byte
| stream? In particular, that the byte 0x00 represents the end of a
| string.

not really. that function does not rely on character encoding for the bytes
being interpreted. look at strcoll...it does, and for that reason (data
needs interpreting) it is not considered 'safe'.


But (IMO) the function does rely on bytes being interpreted, in
particular the byte 0. I can see how strcmp and strcoll differ in their
implementations (since, in strcoll, a lower byte value doesn't
necessarily imply a higher alphabetic precedence, as it does for
ASCII), but I'm still a bit lost how one is binary-safe and the other
isn't.

| Going back to my example, say we pass in strpos('a','cat') with the
| strings encoded in UCS-2.
| So, in terms of bytes, strpos would be passed in 0x00 0x16 as the first
| parameter. Because the function imposes some meaning on specific bytes,
| in particular 0x00, the function would conclude that the first
| parameter was an empty string. Strpos can't blindly operate on the
| bytes it receives, it must interpret them to find the end of strings.

no...'00 16' (the letter 'a' in ucs-2) would be seen as ascii character 48
followed by another 48, followed by the asc char for a space followed by the
asc char for 1, etc. that's the literal string contents for 'a' in ucs-2. if
you searched that literal string for 'a', you would find nothing. if you
converted the string value of '00 16' from ucs-2 then you'd have the letter
'a'...and completely different search results. as for blindly searching for
\0, that's just not what is happening.


Hmm, I would still have to disagree with you.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 states that

"An ASCII or Latin-1 file can be transformed into a UCS-2 file by
simply inserting a 0x00 byte in front of every ASCII byte."

That's one example. All other explanations of UCS-2 have agreed with
this. Note that the '0x' just says that the following number is written
in hexidecimal, thus 0x10 actually is 16 in decimal.

'a' in ASCII is 0x61 (or decimal 97).
'a' in UCS-2 is 0x0061 (as said above, we just inserted a 0x00 byte in
front of the ASCII byte)

What you're proposing reminds me of quoted-printable encoding. If it
was the case that 'a', when encoded in UCS-2, was stored in memory as
0x30 30 36 61, which is the ASCII encoding for the string '0061', then
I see how the function strpos would not assume that an empty string was
passed into the first parameter. But I don't think this is the case.

This link (http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8) also
supports my view. It talks about how using UCS-2 would lead the
existing C functions in Unix to not work because 0x00 has a special
meaning in these functions (in particular, to indicate the end of a
string/array). As you have noted, C and PHP both view strings as arrays
of byte values that are null terminated. Thus PHP would have the same
problems described with using UCS-2 strings.


what is it that you're trying to do. perhaps i can give an example that will
work and clear up your questions at the same time.

All I'm trying to do is understand what is meant by a function being
'binary-safe'. I initially got onto the topic because I need to write a
Chinese website in PHP, but I think my curiosity has diverged me from
my original course. A clear and concise definition of what a
binary-safe function is along with a few examples of either would be
great.


your first example and this one, as far as strings go, are completely
different. they both, however, are interpreted one character at a time. in
this case, a 0 followed by an x, two more zeros, a 1, a 6, then two more
zeros. the string has no particular meaning. php does not know that it is a
particular encoded represenation of data (such as ucs-2). you could likewise
represent 0x001600 in octal format and php would be equally unaware of the
string's particular meaning.


The 0x00 16 00 was supposed to be 0x00 61 00. With the '0x' I'm
indicating that the numbers I am writing are in hex. Looking at it
another way:

first_parameter[0] == 0
first_parameter[1] == 97 (in decimal)
first_parameter[2] == 0 (the null termination)


this is why you must somehow tell php that a string is to be interpreted a
certain way...such that the value would then become (or be seen as) the
letter 'a'. make sense?

This makes sense. What doesn't is the definition(s) of binary-safe :)

Taras

.



Relevant Pages

  • Re: Binary-safe?
    ... | I'm lost as to the meaning of a function being 'binary-safe'. ... the null-terminator tells php where the array ends. ... | that were passed in were encoded in UCS-2, where the string 'a' would ...
    (alt.php)
  • Re: UTF-8 encoding in AJAX web application.
    ... And if so how come the result is still in UTF-8 when I retrieve the ... actually UTF-16, which is very similar to UCS-2, but you can ... a string is "UTF-8 encoded". ... When you fetch it from the database, the driver ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Yet more on Unicode VCL...
    ... You can pass UTF-16 as UCS-2 and vice-versa. ... handling UTF16, but UCS-2, and so can blissfully cut a string in the middle of a character, rendering it invalid. ... If I remember correctly this will also be the case for the new UnicodeString, so you would have to use functions that are aware of surrogate pairs to ensure correct handling of UTF16. ...
    (borland.public.delphi.non-technical)
  • Converting to UCS-2 or UTF-16 for use by a C extension
    ... to convert a Ruby input string into UCS-2 or possibly UTF-16 encoding. ... encoded internally as UTF-8... ...
    (comp.lang.ruby)
  • Re: UTF-8 encoding in AJAX web application.
    ... you can do is set the collation. ... what collation should I use to be able to use UCS-2? ... I was thinking I could convert my UTF-8 encoded string to UCS-2 and save it ...
    (microsoft.public.dotnet.languages.csharp)