Re: what does "serialization" mean?
RobertMaas_at_YahooGroups.Com
Date: 07/07/04
- Next message: Randy Howard: "Re: something like switch in c"
- Previous message: Isaac Gouy: "Re: OOP and memory management"
- In reply to: Corey Murtagh: "Re: what does "serialization" mean?"
- Next in thread: Christopher Barber: "Re: what does "serialization" mean?"
- Reply: Christopher Barber: "Re: what does "serialization" mean?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 07 Jul 2004 09:55:29 -0700
> From: Corey Murtagh <emonk@slingshot.no.uce>
> Both headers are of course UTF-8 compatible... specifically, neither
> uses any multi-byte character sequences or multi-character
> representations.
You might be missing the point. UTF-8 means that each unit is 8 bits
(one byte), and specifics about what these 8 bits represent. UTF-16
means that each unit is 16 bits (two bytes) etc. If you look at just
the printed representation of some text, which appears to be the subset
of characters common to ASCII UTF-8 and UTF-16, you can't tell how many
bytes were used to represent each character you see. You'd have to look
at the binary dump of the data, or "know" a priori how many bytes each
character uses.
I made the assumption that because I could read the text via Google
Groups and lynx, the data as posted was using one byte per character.
But if NUL bytes are simply invisible, don't show up via lynx, then
it's possible the poster deliberately put NUL bytes between each pair
of adjacent visible characters and we weren't the wiser. I could ask
Google Groups to show me the original format of the message, and
download that to a file, and inspect it with some program that shows
NUL bytes explicitly, but if Google Groups deleted NULs that were in
the original message, then I wouldn't prove anything.
> Of course one could argue that both headers are also ASCII compliant
> in that they use only characters that exist in the 0..127 range.
> While this is also true, it should be noted that UTF-8 is basically a
> superset of ASCII, and that the UTF-8 characters in the range 0..127
> are /largely/ congruent with the ASCII characters in that range. Not
> completely, but largely.
Well let's say there's a large-majority subset of that 1..127 range
where ASCII and UTF-8 are congruent, and a few odd characters within
that range where ASCII and UTF-8 disagree as to meaning. But the two
lines we were talking about were among that large-majority subset where
ASCII and UTF-8 are congruent, providing that the data was consuming
one byte per character! If that data was actually using two bytes per
character, low-order the same as UTF-8 or ASCII and high-order
all-bits-zero, then it wasn't compatible with UTF-8 or ASCII at all, it
was only compatible with UTF-16. And if it actually used four bytes per
character, low-order compatible with UTF-8 or ASCII and the other three
all-bits-zero, then it was actually UTF-32!! When a program is reading
in a data file from some unknown source, it's really important to know
whether to input one byte or two bytes or four bytes per character! So
my first question remains: What input mode is used initially to read
the XML-version-characterset header line? If the congruent intersection
of ASCII and UTF-8 is used, then my second question remains: at what
point does the data switch from that single-byte-per-character format
to the declared UTF-16 (two bytes per code unit which is usually a full
Unicode character) or UTF-32 (four bytes per code unit which is always
a full Unicode character)?
I've created a plain-text Web page that actually contains that XML
declaration in three formats, one in UTF-8 format as posted here, one
in UTF-16 format, and one in backwards UTF-16 format.
http://www.rawbw.com/~rem/utf.txt
I set up a link to that file, and the NULs don't show up on screen in
lynx, but then made a link to that file and used lynx to download
through that link, and the NUL characters were there in the resultant
downloaded file, so I know our HTTP server retains them. So can anybody
tell me which of the three formats in that file is correct for that
UTF-16 declaration?
- Next message: Randy Howard: "Re: something like switch in c"
- Previous message: Isaac Gouy: "Re: OOP and memory management"
- In reply to: Corey Murtagh: "Re: what does "serialization" mean?"
- Next in thread: Christopher Barber: "Re: what does "serialization" mean?"
- Reply: Christopher Barber: "Re: what does "serialization" mean?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|