Re: teaching a child - console or GUI

From: J French (erewhon_at_nowhere.com)
Date: 07/29/04


Date: Thu, 29 Jul 2004 08:17:55 +0000 (UTC)

On Wed, 28 Jul 2004 15:23:12 +0000 (UTC), Marco van de Voort
<marcov@stack.nl> wrote:

>On 2004-07-28, J French <erewhon@nowhere.com> wrote:
>
>>>>
>>>> What bit of it does not perform
>>>
>>>Unfinished sentence on my part, I think. I think I meant that the
>>>application doesn't perform int 21h's, or any other legacy technique.
>>
>> I understood you meant 'perform' in the sense of work well enough
>> - eg: too slow

>Definitely not :-)
Uh ...

>>>> - or rather what is it that makes it slow if you keep most of the data
>>>> on disk ?
>>>
>>>Filters/queries over 6 million objects (in 8 entities/tables) must be in an
>>>acceptable time. Multiple users might run queries at the same time, but not
>>>many (say 1-4 users)
>>
>> I can see that, however one can make a system 'learn'
>> eg: a simple query can store its results in a BitMap that it slaps to
>> disk under a file name that is ... well the query
>
>Sure. But the queries here are not likely to reoccur before the next update.

Right

>> It does not need to be bitmaps, sparse results (eg: all 2003 records)
>> can be just a list of 4 byte pointers
>
>This could be done for us too. Keep in mem even. However my own situation
>doesn't benefit from this.

Not even if you can 'add' known data sets together
eg: All 2004 transactions if Corporate clients sorted by Alpha

>> A place I once worked used to thrive on selling databases to financial
>> institutions (they still do) and we developed a whole load of ways of
>> accelerating searching and sorting
>
>Why bother? Taking an RDBMDS should make things easier, not more difficult.

I am not proposing an RDBMS
- I'm not convinced they make things faster
- mostly they just save time through working on the server rather than
passing gigs of raw data through the network

>>>Also note that 6 million objects are a lot more lines in a RDBMS, because you
>>>have all kind of coupling tables for 1 to many relations.
>>
>> Are these objects very complex, or are they really a bunch of pointers
>
>pointers, strings, dates,

Right ...

>>>The use of indexes is limited, or you really need a lot of indexes.
>>
>> At its simplest an 'index' is only a list of sorted 4 byte pointers
>> One can hold a lot of those on disk ...
>
>And loading them in a system under load (with constant disk io) is worse
>than the _real_ querytime in our system.

surprisingly little, because one is doing very few large disk reads
rather than thousands of small disk reads

>Keep in mind that we keep the total mem<->disk bandwidth free. (except
>for some really minor logging/spooling). In a RDBMS this bandwidth is already
>under stress.

Yes - almost a diskless system

>Implementing tricks to make a RDBMS compete with an in-mem solution is not
>smart, since similar tricks benefit the in-mem solution too (and usually
>more)

Sure they do - however since memory is finite ....

>>>No, 5 years max. And even that is unlikely. They probably could do with 2,3
>>>years real data, and global stats from the other 2,3 years. But it is not
>>>worthwhile to code that.
>>
>> They could just pull up the 5 years minus system from a DVD
>> I also 'purge' my files on my main system, in the knowledge that the
>> clients have numerous archive data sets
>
>Yes, of course the data is retained, but it is no longer online. Moreover,
>the main app is single .exe, and the CPU power needed is +/- 2GHz. Memory
>req is now 800 MB, but that grows 300MB/year.

Sounds pretty much like home user kit !

>Having a temp machine for some special purpose (e.g. analysis by a trainee)
>only requires adding some memory to some old server or workstation and copying
>data + server .exe on it.

>>>However that 300MB/year figure and the five year figure is the
>>>current situation.
>>
>> Any chance of it going wild ?
>
>Not in the coming 1/2 years. The system is build to scale to 64-bit. Even
>if it contained all of Holland, 16GB would do the trick.

Right....

>Moreover the data is quite partitionable, so a cluster solution is also
>possible (though not prefered)

You mean cluster of PCs - yes I also wondered about that
Personally, if going down thate route, I would have one for preparing
an ordered list of selected records, and another for pumping the data
back to the client

>>>We didn't know the exact sizes and date ranges yet when we made the
>>>decision. At the time we were afraid of hitting the Delphi limit of 2-3 GB.
>>>It was more 780MB/year in the initial version, and application memory
>>>on top of that. Improved indexing, and packing some data decreased the
>>>size of the data.
>>
>> eg: trashing white space and tokenizing longer fields I guess
>
>Pretty much half of it yes, whitespace, tokenizing, string2datetime etc.

Right, I wonder whether you have looked into replacing the string
system - I should imagine the data is pretty repetitive

>Rethinking the container system was the other half.

>> It sounds as if you have some raw files that you crunch and slap into
>> RAM - rather like building a CD 'database'

>We start from .DBFs of the old system. What is a CD database according to
>you?

To me a CD database is a collection of R/O files that have been
heavily pre-processed so that one has numerous sort orders stored as
lists of pointers on disk, extract files of frequent search fields in
a normalized format .... basically any trick to make searching and
sorting a matter of adding/removing/merging pre-formed sets of data

>>>> Yes, well as we get older, we get craftier
>>>> Maybe it is time to look at it again
>>>
>>>I had to be convinced too. (by my collegue). But now I have done a few projects with the mentality,
>>>I wonder why I never saw it myself.
>>
>> You mean that you had to be convinced of the 'CD in RAM' approach ?
>> It kind of makes sense, but it /is/ possible to emulate that with very
>> little speed hit by making the App think that it is just using RAM,
>> while it is really using a large 'window' onto a file
>
>You mean memory mapping ?

By you - not by the machine

>> After all, your current system myst be paging memory in and out, even
>> if you have a heavily chipped up 'data server'
>
>Nope. Mem costs Eur 250/GB. We simply bought 2 GB. That's less than a
>programmer costs a week.

I sincerely hope so

>>>We are still thinking in a DOS way about memory I think. Memory as precious
>>>resource. It is a commodity now.
>>
>> You mean you are, or I am, or both of us ?
>
>Programmers in general.

I'm not so sure from looking at the horrors some coders come up with,
but even so, I do prefer to be mean with memory.

>> I agree it is a commodity, but if anything, that is the problem
>> It stops people looking at the underlying data structure
>
>The underlying datastructure is what is in memory. Trying to stuff it
>in a RDBMS, and then making it more complex is what is unnatural.

I really was not advocating a conventional RDBMS

>> I can, for example, see that if your data is what I think it is, then
>> you could have multiple copies of the relevant stuff sorted in
>> different order.
>
>Sure. But there are a lot of crazy optimisations that one could do. The
>fundamental question remains. Why would I ?

It could improve performance several hundred fold
Just using a BChop on a sorted list is many times faster than
sequentially scanning a list

>I only have to make sure that mutations are journaled and flushed. I don't
>need complex transactional support.

Yes, I had figured that

>> I really would not use a 3rd party RDBMS for anything (unless
>> seriously well paid) however building ones own filing system is not
>> particularly hard, and is rather interesting
>>
>> Your system has rather caught my interest, probably because it sounds
>> similar to problems I've worked on in the past.
>>
>> I really do believe that the key to speed is algorithms, not RAM
>
>That's a common mistake. It is an equation, and algorithms is a variable
>in that. Language, compiler speed, hardware are all variables too.

True - but the wrong algorithm can have a dramatic effect

>If your algorithms totally suck, it is the limiting factor sure. But it
>makes no sense to build an own custom system, while one weeks of wages can
>pay for the hardware to run it.

Yes - but once you have the hardware it seems to make sense to get
things going faster

>Of course a fundamental difference is if you sell the same product again and
>again, or, like me, use a single instance only for an inhouse job.

Hmm... perhaps ... not sure
- after all speed /is/ a major factor for you

I was interested in several things you mentioned, the Strings are
rather interesting - I'm assuming AnsiStrings here not effectively
arrays of Chars.

>From digesting data in the past I have generally found that 'String
Fields' tend to be very repetitive, and that one is often better off
just having a 4 byte pointer into a 'Lexicon'

Another thing I think I mentioned earlier is that 'RAM' devices are
getting very large and very cheap, one could literally stick a few
Memory Stick devices into some USB ports and get a vast amount of very
fast 'near RAM'

Sounds an interesting system
- thanks for putting up with my curiousity



Relevant Pages

  • Re: teaching a child - console or GUI
    ... I use a RDBMS to reuse existing code and optimizations. ... >>And loading them in a system under load (with constant disk io) is worse ... quality stuff), and quality power supply, quality memory etc. ... > lists of pointers on disk, extract files of frequent search fields in ...
    (comp.lang.pascal.delphi.misc)
  • Re: Garbage collection
    ... "I *could* store pointers in disk and read it later, ... or not being allowed to use dynamically-allocated memory (I didn't ...
    (comp.lang.c)
  • Re: How to allocate memory for a linked list of pointers in a kernel process
    ... lists using pointers of course. ... getting memory allocation with the ExAllocatePoolXxx function. ... When the kernel routines need to allocate a ...
    (microsoft.public.win32.programmer.kernel)
  • Re: teaching a child - console or GUI
    ... because one is doing very few large disk reads ... >quality stuff), and quality power supply, quality memory etc. ... >> lists of pointers on disk, extract files of frequent search fields in ... >> From digesting data in the past I have generally found that 'String ...
    (comp.lang.pascal.delphi.misc)
  • Re: MS Office file formats
    ... because pointers would never line up. ... with simply belting data structures on and off disk. ... respectable on a virtual memory model. ... Given that disk has always been *much* slower than cpu and memory, it is not too difficult for a competent programmer to shuffle his data structures around to make a neater binary file format and still be able to write the data as fast as the disk can accept it. ...
    (uk.comp.sys.mac)