Re: Strings and bindary data



aaronfude@xxxxxxxxx wrote:

Are strings designed to hold binary data? For example, can I read an
arbitrary finle into a String and then print (or, I guess, write?) the
String to another file, are those files guaranteed to be identical?

No. Strings are designed to hold textual data, and that /always/ is subject to
some form of transformation (possibly an identity transformation) when it is
converted to or from binary.

You can use instances of java.lang.String (or char[]) to hold arbitrary
unsigned 16-bit data, but you should be careful not to let the "system" try to
interpret it as text (which it will do if you try to write a String out). In
general, unless you happen to have a specific need for unsigned 16-bit
quantities, it's better to stick with a pure binary representation (such as a
byte[] array) in such cases.


On a somewhat related subject, what is an "encoding"? Meaning, when
does enter into consideration. I always thought of files as just
collection of bytes.

Files /are/ just collections of bytes ;-)

But Strings, char[] arrays, and even char values, are not. They are
representations of textual data. The textual data has a meaning above and
beyond what is in its representation. Think of it like this, say we start with
a word:
Snark

If we want to manipulate that in a Java program, then we'll probably use a
String object:
"Snark"

(Internally that is represented in the computer's memory by a sequence of
unsigned 16-bit integer values:
0x0053 0x006E 0x0061 0x0072 0x006B
but that is not important to you for most purposes -- what matters are the
characters in the string, not how they are represented physically.)

Now suppose we want to put that word, Snark, into a file. It's an abstraction,
not something that's made of bits and bytes, so we have to /represent/ it as
such before we can put those bytes into a file. This applies especially in
Java (which distinguishes between Sting and binary data better than many
languages). When you put the word into a file you have to choose a
representation -- that's to say a mapping from abstract characters (or
whatever) to actual bytes. One widely used representation is ASCII which
assigns byte values to a small set of the characters used in English.
Another, which covers roughly the same range, but uses different numerical
values is EBCDIC. These mappings are called character encodings, character
sets (charsets), or sometimes "code pages". The big daddy of character
encodings is Unicode (of which more below).

Let's say that we want to represent the word Snark as ASCII in a file. The
corresponding bytes would be:
0x53 0x6E 0x61 0x72 006B
If we wanted to use EDBCID then the bytes would be different (I can't be
bothered to look up what they would be). If we wanted to the Unicode format
called UTF-16, then we have two variants, one using Intel byte order
(little-endian):
0x53 0x00 0x6E 0x00 0x61 0x00 0x72 0x00 0x6B 0x00
the other using "network byte order" (big-endian)
0x00 0x53 0x00 0x6E 0x00 0x61 0x00 0x72 0x00 0x6B

The "Snark" example doesn't really show what's going on. So let's rename
Snarks:
Snørk

(On this machine that's using an o-with-a-slash-through-it instead of the 'a'.
I hope that's how it looks where you are reading it, if not then just pretend
it does...)

The corresponding Java String object would contain the integer values:
0x0053 0x006E 0x00F8 0x0072 0x006B 0x00A9

Now if I want to write our new word to a file, then I may have a problem. I
can't use the ASCII representation, because ASCII doesn't have a mapping for
the slashed-o character! So I have to use a different mapping. My machine is
set up, as it happens, to use a mapping called 'windows-1252' (which is one of
the Microsoft code pages), it is almost identical to the ISO charset called
ISO-8859-1. In either of those our word would be represented as:
0x53 0x6E 0xF8 0x72 0x6B 0xA9

(The similarity to Java's internal representation is mostly just coincidental.)
But if I were using a different machine, one which used a different
representation by default, then the word would be represented by different
bytes. For instance, if this machine were set up to expect me to be writing
Polish (Windows code page 1250, or ISO-8859-2) then I'd be in trouble again,
because neither of those code pages have a representation of that character
(they use the number 0xF8 to represent the letter r-with-a-caron instead). So
if someone in Poland attempted to read the file that I wrote in code page
windows-1252, then they wouldn't see the right characters if their machine was
using the Windows-1250 encoding.

So there are two problems with code pages and charsets generally. One is that
they don't contain the same characters, and the other is that they may not map
the same characters to the same numbers. That's why you always have to specify
a charset when you are converting between binary data and textual data in Java
(even if you don't specify one explicitly, the system will be using a
default -- which might or might not be correct).

This is where Unicode comes in. It provides a fixed mapping that is supposed
to be complete (for some given meaning of "complete") and universal. So the
problems of knowing which charset to use just go away. But there are still two
problems: one is that not everybody uses Unicode, so you will very certainly
have to deal with text files containing ISO-8858-1 data (for instance), as well
as nice reliable Unicode -- in fact they are so common that Unicode can't even
be made the default :-( The second problem is that there are a /lot/ of
Unicode characters defined, too many to fit into 8-bits, or even into 16. So
Unicode defines several physical representations of the abstract numbers, which
have various tradeoffs between complexity and space. In the physical encoding
known as "UTF-8", for instance, which attempts to provide a compact
representation of mostly English text, our word would be written to file as:
0x53 0x6E 0xC3 0xB8 0x72 0x6B

In the encoding known as UTF-16, which is optimised for text which mostly needs
16-bits per character, there are two variants, big-endian and little-endian.
The little endian representation is:
0x53 0x00 0x6E 0x00 0xF8 0x00 0x72 0x00 0x6B 0x00

(BTW. The variations I have shown are all quite similar -- that's because most
character encodings tend to be similar for English characters. The further
away from English you get, the more the various encodings diverge.)

Finally we come to the bottom line. Even with Unicode, you always have to have
a mapping between text and binary. If you get the mapping wrong then you are
in trouble. Don't, if you can possibly help it, manipulate binary data as
text, or textual data as binary.

-- chris


.



Relevant Pages

  • Re: Convert Binary String to Hexadecimal
    ... character representation of an integer value using binary notation. ... The hexadecimal equivalent of the 32-bit binary string ... the characters. ... You don't want your conversion function to open the file and read the ...
    (comp.lang.c)
  • Re: .99999....=1
    ... does actually mean an infinite number ... of characters, one digit "3" for each natural number. ... string) are another representation of a number, ...
    (sci.math)
  • Re: strings with formatted characters in %ARGV
    ... strings to a perl program and have them print with the formatting. ... special characters when used from the command line? ... this into the text FOO followed by three newlines in whatever representation ... I could see this in the case of variables in a string -- how would ...
    (comp.lang.perl.misc)
  • Re: What string encoding to pick as standard for a programming language?
    ... in the sense of the unit that string ... Is efficient iteration via string indexing supported? ... practical as the basic representation. ... representations depending on characters found in the string. ...
    (comp.lang.misc)
  • Re: C++ Compiling Problems
    ... The conception of a C++ "string" is quite different from that in other ... languages and leaves a lot of room for argument of "if C++ has ... datatype explicitly specialized on "representation of textual data". ...
    (Fedora)