Re: Recognising file type (ascii/binary)




"Matt Humphrey" <matth@xxxxxxxxxxxxxx> wrote in message
news:XpSdneq4Oe9H0PzeRVn-gw@xxxxxxxxxxxxxxx
>
> "Bruce Lee" <blah@xxxxxxxxxxxxxxxxxxxx> wrote in message
> news:P068f.23301$Ih5.7913@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
>> Is there any easy way to get Java to determine whether a file is a binary
>> file or plain text ascii file?
>
> Files are simply sequences of (binary) bytes--there's no way to tell
> whether it's supposed to contain only bytes that represent printable ascii
> (or unicode) or any particular binary pattern. You can read the file to
> find out--if you find values that signify unlikely or non-printable
> characters you can deem the file binary or corrupt. Similarly, there are
> heuristics (based on convention) for guessing the "type" of the file based
> on the first few bytes, but there's no guarantee these are correct either.
> (And files with 2-byte UNICODE characters can really confuse things.)
>
> Of course, you could require that text files end in "txt" or
> something--it's no worse than any of the above and significantly easier.

Matt Humphrey is completely correct. However as an additional check to
the heuristic of looking for unprintable characters, another trick is to
check if the newline string is consistent. It should always be either "\n"
(for UNIX-like systems), "\r" (for Mac-like systems) or "\r\n" (for
Windows-like systems). If the file starts switching around between these, it
probably isn't a valid ASCII file on any of the above three platforms.

You could also disregard 2-byte UNICODE characters as being "non-ASCII",
and lump them in with the category of "binary files".

- Oliver


.



Relevant Pages

  • Re: EBCDIC to ASCII file conversion
    ... I've used cygwin and UnixUtils' dd to verify that we can routinely convert further EBCDIC files, and both apps generate the same output for the EBCDIC file supplied. ... All lines in the ASCII file they supplied contain 451 characters, most of which contain zeroes for the 2nd half of the line. ... In the output from dd, there are numerous instances of a left brace, followed by nine 0's, followed by another left brace. ...
    (comp.sys.ibm.as400.misc)
  • Merge errors
    ... We are getting errors when we try to do a mail merge in Word 2003 SP1 ... with an ASCII file generated in a custom database. ... name to 39 or less by removing the spaces, ... The "#%$" are actually displayed as chinese characters?. ...
    (microsoft.public.word.mailmerge.fields)
  • Re: How to distinguish between binary and ASCII file on file opening?
    ... "CFF" wrote in message ... > binary and ASCII file to be loaded from hard disk so that different ... The more characters you check, ...
    (microsoft.public.vc.mfc)
  • Re: Words default formatting for .TXT files
    ... a synonym for ASCII file, a file in which characters are represented by ... Contrast with a binary file, ... is no one-to-one mapping between bytes and characters. ... binary files to preserve the formatting. ...
    (microsoft.public.word.conversions)