Re: Binary v. Text, why is it faster?



Arctic Fidelity wrote:
I have constantly seen and heard that reading binary data is faster than
reading textual data. I have always presumed this to be a fact. But now I
am at the point where I would like to understand why.

I was trying to think about it, and it has rather confused me. To my
understanding, reading a text file is reading in the bytes which
correspond to, for example, ASCII character codes. But if we are dealing
with a 1-byte character encoding, how is it slower to read in 'a' rather
than some binary representation of that?

And in addition to this, what is the actual difference between binary and
textual files? I had always thought that a binary file was simply a file
composed of any combination of bytes, whereas a text file was a file
composed of a limited subset of the bytes available to a binary file. Am I
misunderstanding something here?

Actually, there is no difference. But since text files are often
interpreted as such, it is wise to limit the contents to the proper
subset. For example:

dir tt1.txt
02/06/2006 07:05p 10 tt1.txt
1 File(s) 10 bytes

The file tt1.txt contains 10 bytes, but we only see 7 if we type it:

type tt1.txt
abcdefg

because "type" expects the file to be text and not binary and
interprets some of the contents instead of printing them. A dump
reveals why we only see 7:

DUMP.EXE version 8-MAR-91
Block # 0 0
0 61 62 63 64 65 66 67 0D 0A 1A FF FF FF FF FF FF abcdefg...

Bytes 8, 9 & 10 are carraige return (0D), line feed (0A) and EOF (1A).

The EOF character is not strictly required, since the OS knows
there are exactly 10 bytes (the FFs are sector padding bytes not
part of the file).

But watch what happens when I concatenate two copies together:

copy tt1.txt+tt1.txt tt2.txt
tt1.txt
tt1.txt
1 file(s) copied.
dir tt2.txt
02/06/2006 07:19p 19 tt2.txt
1 File(s) 19 bytes

10 bytes + 10 bytes = 19 bytes ??

A dump reveals what happened:

DUMP.EXE version 8-MAR-91
Block # 0 0
0 61 62 63 64 65 66 67 0D 0A 61 62 63 64 65 66 67
abcdefg..abcdefg
16 0D 0A 1A FF FF FF FF FF FF FF FF FF FF FF FF FF
....

The terminating EOF of the first copy of tt1.txt was dropped
as part of the concatenation. The OS expects only one (if any)
EOF character per file and it better be the last one.

I could simply insert the original EOF back into the file

dir tt3.txt
02/06/2006 07:25p 20 TT3.TXT
1 File(s) 20 bytes
DUMP.EXE version 8-MAR-91
Block # 0 0
0 61 62 63 64 65 66 67 0D 0A 1A 61 62 63 64 65 66
abcdefg...abcdef
16 67 0D 0A 1A FF FF FF FF FF FF FF FF FF FF FF FF
g...

But the OS won't like it:

type tt3.txt
abcdefg

Even though the files is now 20 bytes long, "type" won't go past
the first EOF character.

The copy command has a binary option that will concatenate
without trying to interpret the contents:

copy /b tt1.txt+tt1.txt tt4.txt
tt1.txt
tt1.txt
1 file(s) copied.
dir tt4.txt
02/06/2006 07:29p 20 tt4.txt
1 File(s) 20 bytes

But that doesn't help the "type" command.

type tt4.txt
abcdefg

These kind of problems can also occur if you use FTP to send
a binary file in text mode.

So, generally, assuming the content is ok, it's best to never
let a program think a binary file is a text file.


I guess I just don't see how reading in AF would be slower just because AF
appears in a text file instead of a "binary" file?

- Arctic

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

.



Relevant Pages

  • Re: File.read() sometimes not reading everything
    ... Your file may contain an EOF character before the actual end ... > this is a filesystem problem or a misuse of this method. ... > When reading the file.read docstring I see that there can be some problems ... > Any other idea of this strange behaviour? ...
    (comp.lang.python)
  • Re: How to fputc EOF to a FILE stream.
    ... I should keep on reading a FILE stream ... There is no EOF character. ... confusion. ...
    (comp.lang.c)
  • Re: How to fputc EOF to a FILE stream.
    ... Barry Schwarz writes: ... I should keep on reading a FILE stream until ... C does not define an EOF character. ...
    (comp.lang.c)
  • Re: How to fputc EOF to a FILE stream.
    ... I should keep on reading a FILE stream until ... C does not define an EOF character. ... Remove del for email ...
    (comp.lang.c)
  • Re: Binary v. Text, why is it faster?
    ... reading textual data. ... To my understanding, reading a text file is reading in the bytes which correspond to, for example, ASCII character codes. ... Also some editors will helpfully display an unaccompanied 0x0a as ^M which is really helpful, thanks guys, really makes things a lot more readable, as if you didn't understand what had happened here. ...
    (comp.programming)