Re: Scanf and number formats

From: Walter Roberson (roberson_at_ibd.nrc-cnrc.gc.ca)
Date: 03/14/05


Date: 14 Mar 2005 17:25:03 GMT

In article <d14c1v$ogl$1@news-int.gatech.edu>,
Vig <gtg121p@mail.gatech.edu> wrote:
:Also, I cannot directly replace an e with a d
:because Scientific notation is usually written as 0.123456e+01 while d is
:1.23456d0 (I am not completely sure, which is why I want C to handle it all
:for me :) )

On output, C's e format,

  is converted to the style [-]d.ddde+dd, where there is one digit
  before the decimal-point character (which is nonzero if the
  argument is nonzero)

On input, a string of digits is accepted before the decimal point.
The sign after the 'e' on input is optional. Thus, 0.123456e+01
and 1.23456e0 are equivilent [except perhaps in the last bit or two
when one is at the limit of precision.]

:Almost everything we read from files are numbers. Currently, it is scanned
:with a %lf unless otherwise specified. If we are to handle the problem of
:the 'd' that would mean almost multiplying our time for reading even good
:files without d's by 3.

No, that doesn't follow. The time required to read data from a file is
largely dominated by the disk I/O rate... modified by operating
system predictive reads, direct I/O or not, DMA block size, SCSI
Command Tag Queuing (CTQ), ability of the OS to flip a DMA page
directly into user space without having to copy it, and so on.

When you use scanf(), then unless you have specifically turned off
buffering, the C I/O library will usually [but not promised in the
standard] fill a block from the I/O subsytem (or I/O cache),
putting the block into your memory space; the block size is often
8 Kb. Once the block has been read in, scanf() is really just
reading the data from memory, as if it were using getc() to fetch
each character. [It has to be that way because you are allowed
to mix getc() and scanf(), so they both have to read from the
same input buffer, and it usually isn't worth duplicating the
logic.] getc() is usually a macro that works with the FILE
structure.

The slow part of reading is getting the data from disk to your
program the first time; once there, you could examine the data a
number of times before the next batch was ready. For example if your
disk subsystem is SCSI-2 Fast, your disk might be limited to
20 megabytes per second; on a 2 GHz CPU, you could run 100
cycles per character and still keep up with the disk.

If you are sufficiently starved for CPU resources that
doing a quick scan-and-replace over the buffer is slowing you
down, then you should probably already have done a bunch
of work on custom I/O (e.g., using "real time" partitions,
using a raw partition instead of a block device, using
scatter-gather buffering, using any available O/S
facilities to bypass caching; ensuring your input data
is always a multiple of an I/O page and always reading
in full blocks instead of going through the per-character
end-of-buffer checks imposed by getc().) You should not
presume that a simple scan over the buffer will prove
to be the limiting speed factor on your program: it
probably won't.

Speaking of limiting speed factors: consider having a
pre-pass program that does nothing other than reading in
the data and converting it to binary and storing the
binary as a file with fixed length records. Such a program
could probably run asynchronously with whatever calculation
you are doing -- and if you are reading the input file
multiple times in different programs, you will have
saved having to convert the ASCII multiple times.
You will get about a 3:1 compression ratio by converting
the input to binary.

-- 
Any sufficiently old bug becomes a feature.


Relevant Pages

  • Re: Discovering variable types...
    ... >memory it points to is on the heap. ... sequentially reading data, if one is randomly reading records, then a ... >project is what's prompting me to improve disk access. ... from a memory buffer I can do it in about a second. ...
    (comp.lang.pascal.delphi.misc)
  • Re: replacing read with mmap
    ... mallocs a buffer of 16k-ish (trying to be a multiple of the block ... Of course mmap() performs exactly the same ... In the normal I/O case, your program at least can anticipate on ... a blocking disk read/write. ...
    (comp.unix.programmer)
  • Re: fast multiple file access
    ... >everything worked out but am curious if using fgetc() is the fastest ... Generally speaking, when you know you are reading a number of bytes, ... >this just provide a Pointer to the file on disk or does ... the size of buffer that has been configured. ...
    (comp.lang.c)
  • Re: I/O file operations efficiency
    ... > I have some questions regarding the I/O file operations efficiency. ... > Consider I/O operations involving the disk. ... > will be stored in the memory buffer first then moved ... it's in the JVM's memory, not in the O/S. ...
    (comp.lang.java.programmer)
  • OS I/O operations concepts
    ... I have some questions regarding the I/O file operations efficiency. ... Consider I/O operations involving the disk. ... memory buffer first then moved into the disk once the buffer ...
    (Linux-Kernel)