Re: Binary or Ascii Text?



"Claude Yih" <wing0630@xxxxxxxxx> writes:
osmium writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.

Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

I think that's fairly close to what the Unix "file" command does.
(Versions of the command are available as open source; see
<ftp://ftp.astron.com/pub/file/>.)

As mentioned above, you should also check for control characters.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.

Multi-byte characters aren't the only problem. ISO-8859-1 is an
extension of ASCII that uses codes from 161 to 255 for printable
characters (there are several ISO-8859-N standards).

And none of this is portable to all possible C implementations. Some
systems distinguish between text and binary files at the filesystem
level.

Whatever it is you're trying to do, your first line of defense should
be to arrange to know what type a file is before you open it. If that
fails, as it inevitably will in some cases, you can check the contents
as a fallback, but there's no 100% reliable way to do so.

If you're writing a program that's intended to work only on text
files, it might be best to decide what's acceptable *for that
program*. If you're displaying the contents of the file, for example,
you can establish a convention for displaying non-printable characters
in some readable form. If an input line is very long, you can wrap it
or truncate it. And so on.

--
Keith Thompson (The_Other_Keith) kst-u@xxxxxxx <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
.



Relevant Pages

  • Re: POS Printer
    ... printer These special codes are non-printable ... characters ... I see no reason to be concerned with it at all. ... I know that a POS printer has an internal buffer so just because we send ...
    (microsoft.public.vc.mfc)
  • Re: Open Source on OpenVMS - A Progress Report
    ... *english* alphabet or that it must be 7-bit plain ASCII. ... ASCII codes must be stripped of those characters or encoded somehow to ... codes, Google Groups does so, and encodes with MIME using Quoted- ...
    (comp.os.vms)
  • Re: =?ISO-8859-1?Q?Soup=E7on_of_cedilles_and_aper=E7us?=
    ... Above that are the so-called "extended ASCII codes", ... neither of the non-7-bit-ASCII characters display. ... Unfortunately Mike does not have MIME enabled in his software, ... they are superior to the modern newsreaders. ...
    (alt.usage.english)
  • Re: Open Source on OpenVMS - A Progress Report
    ... only supporting characters that happens to be in the ... *english* alphabet or that it must be 7-bit plain ASCII. ... ASCII codes must be stripped of those characters or encoded somehow to ... codes, Google Groups does so, and encodes with MIME using Quoted- ...
    (comp.os.vms)
  • Re: ?? ??
    ... When the original ASCII standard was defined, ... When the control codes were defined, ... characters as and end-of-line, as it said in the official ignored standard. ... and an operating system requiring CR LF. ...
    (microsoft.public.dotnet.languages.csharp)