Re: Strange 'В' character output when using simplexml load string



Greetings, bizt.
In reply to Your message dated Monday, April 7, 2008, 17:19:28,

I converting an XML string using simplexml_load_string function. It is
giving me a В character for some reason dotted around the text.

 simplexml always outputs in UTF-8. Is your page's encoding UTF-8?

At a guess, ISO-8859-1 or perhaps ISO-8859-15.

In UTF-8, a "prefix" of an 0xC2 byte is used to access the top half of the
"Latin-1 Supplement" block which includes a lot of juicy characters such
as currency symbols, fractions, superscript 2 and 3, the copyright and
registered trademark symbols, and the non-breaking space.

However in ISO-8859-1 and -15, the byte 0xC2 represents an В, so if UTF-8
is misinterpreted as one of those, then you get В followed by some other
nonsense character.

Probably the easiest solution would be to take the output from SimpleXML
and pass it through iconv():

        $xmlout = iconv('UTF-8', 'ISO-8859-15//TRANSLIT', $xmlout);

Note that UTF-8 is capable of representing a far greater range of
characters than ISO-8859-1/-15 are, so certain characters may not properly
survive conversion. (Using the '//TRANSLIT' option tells iconv to do its
best, and if, say, a particular accented character is not available in
ISO-8859-1, then to substitute an unaccented one in its place.)

Hi, ive tried what you said which worked for one of my pages but when
i tried it on another i got the following:

Notice: iconv() [function.iconv]: Detected an illegal character in
input string in /home/public_html/search_apartments.php on line 67

Im using the following to convert my XML string which is fetched via
cUrl:

$result = iconv('UTF-8', 'ISO-8859-15//TRANSLIT', $result);

Would it be the case that my $result string, im not providing the
iconv() with the correct input encoding? If so, is there a way for me
to detect the input encoding?

As a guess, Your "B" probably followed by space and represent a non-breaking
space.

To Your trouble with iconv on $result, I think You should take care of the
SOURCE BEFORE using simplexml_load_string.
And see what the encoding it use. Because if Your source in, say, ISO-8859-15,
You can't have any untranslatable characters in UTF-8 what You can't convert
back to ISO-8859-15.


--
Sincerely Yours, AnrDaemon <anrdaemon@xxxxxxxxxxx>

.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... Simply make a straight decision now - you will use UTF-8. ... character format) much like UTF-8 which itself ... I would have little more than UNICODE left. ... generator is assembly language. ...
    (comp.arch.embedded)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)
  • Re: Loading a data file containing character fields with different encodings
    ... UTF-8 characters along with Latin-1 characters. ... One containing the latin-1 character set column, the second containing the utf-8 column and of course both files containing the primary key information. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ... This caused that no conversion was done, but you were puting CP1252 characters into an 819 database! ...
    (comp.databases.informix)
  • [PATCH] console UTF-8 fixes
    ... I send a patch to the UTF-8 part of the vt driver. ... If a certain character is not found in the glyph ... characters) is to simply display the glyph loaded in that position. ...
    (Linux-Kernel)