Re: Stream and Encoding Confusion




"Rhino" <no.offline.contact.please@xxxxxxxxxx> wrote in message
news:a0Kgg.326$Wy.24191@xxxxxxxxxxxxxxxxxxxxxxxx
A friend and I are having a friendly competition that is causing me some
conceptual confusion. I am hoping someone can help me clarify things a
little.

We are each writing programs to read an input file and count the number of
each distinct character in the in the file; he is writing his program in
Perl and I am writing mine in Java. The main output of the prgram will be
a simple list that says we the program found so many of each character; we
want to report the letters of the alphabet as well as accented letters,
punctuation, and whitespace characters, including carriage returns and
linefeeds. We have two input files at the moment, a text file and an MP3
file. There is no money or serious rivalry invoved; we are simply curious
about how each will look if properly written. We also wonder how the
performance will compare, although that is quite unimportant to both of
us.

I have a couple of areas of confusion:
a. character streams vs. byte streams
b. the issue of encoding.

Since I'd like to be able to read any type of file in any language,
including text files, MP3s, and many others, should I always be treating
the input file as a character stream or do I need to somehow detect which
ones are best read as character streams and which are best read as byte
streams? If I need to treat the two types differently, how do I detect
which type the input file is? I would rather not rely on the user knowing
whether a file that he wants to give the program is best suited to being
treated as a character stream or a byte stream. I've read the conceptual
information about this in the Java Tutorial and find that it really
doesn't address this issue clearly.

I'm also somewhat concerned about encoding. I honestly don't understand
exactly how encoding works and apologize if this is a dumb question but
this seemed like a good place to get someone to point me to a proper
discussion of this issue. Do I need to know how a file is encoded before I
open it and decide which kind of stream it is? Or is there some way to
determine what encoding the file is using by simply examing the file?
Again, I want to be able to read a file and count the characters without
the provider of the file having to tell me what encoding it uses since the
provider, quite likely, wouldn't know.

This issue was addressed not long here:

http://groups.google.com/group/comp.lang.java.programmer/browse_thread/thread/1d2a1d6bb48b681/08095f861a95f75a?lnk=st&q=%22recognising+file+type%22+group%3Acomp.lang.java.programmer&rnum=1&hl=en#08095f861a95f75a

You can find more about character encoding here

http://mindprod.com/jgloss/encoding.html

In summary, there is no way to perfectly distinguish between character and
non-character data and you must be able to distinguish them in order to use
the right kind of stream. All data is binary data. By convention (common
agreement) some binary patterns are used to represent text, characters
(including digits), numbers of all types (not digits), application data
structures, etc. In particular, the conventions for characters are given
names that identify the encoding--the mapping of byte values or code point
numbers to specific logical characters. You can always read binary data,
but you must know the encoding in order to make any sense out of it.

What makes this problem troublesome is that the identity of the encoding is
not in the data itself. Well, it is for some (e.g. XML has encoding
attribute and some kinds of application data files like GIF start with a
specific 4-byte signature), but because it's not there for all of them there
is no reliable way to distinguish whether you have (for example), a MP3 file
or some other unknown type of data file. Often you're better off just
checking the file extension, although that won't tell you the text encoding.

But with some smart decisions, you can often guess reasonably (if
imperfectly) at the format. Check out the links above to see how this might
work.

The bottom line is that you have to know the encoding in order to read the
file. To simplify your problem, you could, for example, limit yourselves to
one of the standard encodings (UTF-8, UTF-16) and work just with text. Or
designate one or two kinds of specific file types, e.g. MP3. No one has a
general-purpose interpreter that will give the correct answer to this
question for every file everywhere. And if they did I can make it give the
wrong answer by cooking up a data file for any new format that happens to
correspond to any existing format.

Cheers,
Matt Humphrey matth@xxxxxxxxxxxxxx http://www.iviz.com/


.



Relevant Pages

  • Re: Problem with encoding a character
    ... pound symbol is looking like 2 bytes instead of 1. ... I thought the pound sign was a unicode character, but when I tried to change ... encoding, so the receiving newsreader has to assume something...my ... I suspect that for whatever reason, your request stream is not getting the ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Unicode support in Smalltalk
    ... text coming from files on specific platforms). ... that do the mapping between the value stored and the Character ... with a single byte encoding. ... text Stream to be still a String ...
    (comp.lang.smalltalk)
  • Re: C# and encodings
    ... and they can be encoded into a binary stream using an encoding that either supports the full Unicode character set or an encoding that supports the subset that a codepage represents. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: String.Replace Anomoly
    ... found that I need to replace multiple instances of the character ... The input file is a simple one line ... You need to match the encoding ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Redefining how a standard object prints
    ... > by redefining the double-quote macro character. ... Why is this a property of how strings are printed? ... characters to whatever stream, and the stream deals with encoding ...
    (comp.lang.lisp)