Re: Binary or Ascii Text?
- From: Keith Thompson <kst-u@xxxxxxx>
- Date: Fri, 31 Mar 2006 20:44:15 GMT
"Claude Yih" <wing0630@xxxxxxxxx> writes:
osmium writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.
Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.
Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.
I think that's fairly close to what the Unix "file" command does.
(Versions of the command are available as open source; see
<ftp://ftp.astron.com/pub/file/>.)
As mentioned above, you should also check for control characters.
However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.
Multi-byte characters aren't the only problem. ISO-8859-1 is an
extension of ASCII that uses codes from 161 to 255 for printable
characters (there are several ISO-8859-N standards).
And none of this is portable to all possible C implementations. Some
systems distinguish between text and binary files at the filesystem
level.
Whatever it is you're trying to do, your first line of defense should
be to arrange to know what type a file is before you open it. If that
fails, as it inevitably will in some cases, you can check the contents
as a fallback, but there's no 100% reliable way to do so.
If you're writing a program that's intended to work only on text
files, it might be best to decide what's acceptable *for that
program*. If you're displaying the contents of the file, for example,
you can establish a convention for displaying non-printable characters
in some readable form. If an input line is very long, you can wrap it
or truncate it. And so on.
--
Keith Thompson (The_Other_Keith) kst-u@xxxxxxx <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
.
- References:
- Binary or Ascii Text?
- From: Claude Yih
- Re: Binary or Ascii Text?
- From: Claude Yih
- Binary or Ascii Text?
- Prev by Date: Re: What will happen if main called in side main function?
- Previous by thread: Re: Binary or Ascii Text?
- Next by thread: Re: Binary or Ascii Text?
- Index(es):
Relevant Pages
|