Re: Binary or Ascii Text?



"osmium" <r124c4u102@xxxxxxxxxxx> writes:
"Claude Yih" writes:

The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are
present
in text files. So if you have quite a few of the other 25 or so codes, it
is
probably not a text file - but it's only an educated guess, no real proof.


Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.

It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.

The quoting is completely messed up. The paragraph starting with "The
best you can do" was written by osmium, the next three paragraphs
where written by Claude Yih, and the last paragraph, starting with "It
doesn't work", was written by osmium (who usually gets this stuff
right).

--
Keith Thompson (The_Other_Keith) kst-u@xxxxxxx <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
.