Re: Nearest Common Ancestor Report (XDb1's $1000 Challenge)

From: Hugo Kornelis (hugo_at_pe_NO_rFact.in_SPAM_fo)
Date: 05/25/04


Date: Tue, 25 May 2004 01:25:49 +0200

On Mon, 24 May 2004 02:33:10 GMT, Nick Landsberg wrote:

Hi Nick,

(snip)
>
>Thanks Hugo,
>
>The reason why I posed the query is not so much to
>make work for you, but to get a flavor of how the
>two technolgies compare /as they are now/.

Your remark made me curious to find out how my model would hold against
more data. I created a script to generate lots of random data, ensuring
that the hierarchy would remain connected, strictly top-down yet
distributed as random as possible. I think I succeeded (generation script
available, in case anyone is interested).

The setup I used for these tests is:
* SQL Server 2000 SP3a, running as a service on my desktop computer, with
only one active application (Query Analyzer).
* 1.3 GHz Athlon processor / 256 MB RAM.
* Two 40GB 7200 rpm IDE Harddisks. The LOG file for my test DB is on one
of them. The DATA file for my test DB, plus the DATA and LOG file for
tempdb (used by SQL Server to store intermediate result sets) is on the
other harddisk. My Windows page file is on that hard disk as well (though
on another partition).
* I made sure that both the test database (DATA and LOG!!) and tempdb were
large enough to hold all intermittant data and the result. SQL Server will
automatically allocate extra disk space if a database needs more room, but
repeatedly allocating extra disk space is a performance killer (for the
first execution only!!), so I decided to preallocate the space.
* For all tests, I first recreated the tables and procedure, filled the
tables with the original data and then added extra rows in the things and
hierarchies tables.
* The results below show: number of rows in the things and hierarchies
table (indicating input volume), number of rows in ancestors and
NCancestor tables (indicating size of intermediate and output table) and
time elapsed for one execution of the stored procedure.
* I did some minor tweaking to the procedure after carrying out these
tests but before posting Neo the code (see other message). I think that
the improvement is only marginal - at least not enough to warrant
repeating all tests again.

Here are the results I gathered:

things hierarchies ancestors NCancestor elapsed
----------- ----------- ----------- ----------- --------------
18 31 81 153 00:00:00:017
108 210 1307 5778 00:00:00:513
208 505 4566 21528 00:00:02:407
408 1101 16938 83028 00:00:14:377
608 1700 36619 184528 00:00:51:877
1008 2900 94318 507528 00:05:03:577

And two larger tests in a very a-typical setup (running SQL Server in
single-user mode - normally only used for maintenance, but apparantly also
quicker)

things hierarchies ancestors NCancestor elapsed
----------- ----------- ----------- ----------- --------------
1000 2883 94255 499500 00:04:00:250
2500 7381 464258 3123750 01:02:13:580

As you can see, elapsed time grows exponentially as the size of the input
increases.

Can the above performance be improved upon? Yes, of course. If I had
unlimited money to spare, I'd start with the following setup:

* Have SQL Server run on a dedicated server. Enable only the services that
SQL Server needs, shut down everything else.
* Fastest processor currently available. Though SQL Server can use
multiple processors, I don't think they would help for this particular
query (but if money is not an issue, I'd still try, just to be sure).
* As much RAM as the server will hold. Over 2GB is only used by SQL Server
Enterprise Edition, so get a license for that version as well.
* At least five seperate I/O devices: two stripe sets (for test DB Data
and tempdb Data) (the more physical disks in the stripe set, the better),
and three non-striped disks (one for test DB Log, one for tempdb Log and
one for operating system, including pagefile).
* I'd hire a tuning expert. Though I do know a few tricks, I'm definitely
not an expert on fine-tuning SQL Server.
* And I'd buy Joe Celko's latest book on trees and hierarchies in SQL. His
alternative approach I hinted at in an earlier message will not do
unfortunately - I expect it to be blindingly fast, but it can only handle
hierarchies where each thing has at most one parent. But maybe he's got
some other neat tricks in his book that I might want to steal.

Another thing I'd consider is moving some work from the stored procedure
to an insert trigger. I didn't do this for the challenge, of course, but
if this were to become a real production system, I'd definitely try to
find out if insert/update/delete performance for typical daily data entry
tasks would be acceptable if I move some preparations for this report to
the trigger.

But I'm also sure that there's no way that I would ever get this procedure
to perform adequately (unless I'd find a completely different approach in
Celko's book). All performance gains that better hardware can buy are
linear; the performance hit for extra input data is exponential. That
means that there's no way to keep the performance aceptable as the data
continues to grow.

Is this a surprise to me? No, it isn't. Frankly, I've never considered a
relational database as a good tool for storing and traversing trees. There
are many other tools out there more suited for that task.

I never entered this discussion to prove that the relational model is well
suited for tree-structures, because frankly, I think it isn't.I only
entered this discussion because Neo set a challenge and I decided to take
it on - and succeeded.

>
>If Neo's techniques prove out, they may
>be interesting for commercial work in several
>years. One never knows. As I alluded to upthread,
>these discussion remind me a lot about the
>CODASYL vs. RDBMS discussions of roughly 20-25
>years ago. Assuming that Neo's ideas pan out,
>I would not presume that they would supplant
>traditional RDBMS's, just as RDBMS's have
>not completely supplanted hierarchical and
>network models. There is more than one way
>to skin a cat, and you use the best tools
>available depending on the size of the cat :)

Agreed. There is no "one size fits all" in our industry.

>
>You and Neo have had an interesting discussion.
>And, what is rare on usenet, haven't let the
>discussion degenerate into a flame war.
>I congratulate you both!

Thanks, Nick!

Best, Hugo

-- 
(Remove _NO_ and _SPAM_ to get my e-mail address)


Relevant Pages

  • Re: SQL Server 2000 - Hierarchical data - Versus - Post Relational Databases
    ... SQL Server doesn't handle hierarchies out of the box. ... The database is storing hierarchical-based data from Exchange Server, ... Messages are stored in Folders, and so the Folders Table represents another ...
    (microsoft.public.sqlserver.programming)
  • Re: multiple hierarchies of a dim in the same cube?
    ... multiple hierarchies can be defined for the attributes of a dimension. ... >From SQL Server 2005 BOL: ... attributes are arranged in hierarchies that provide the ... hierarchically organize the members of a dimension. ...
    (microsoft.public.sqlserver.olap)
  • Re: Tracking Record Lineage, Family, etc.
    ... It is indeed a common issue, as you are modeling a hierarchy. ... There are several models commonly used for representing hierarchies, ... "Joe Celok's Trees and Hierarchies in SQL for Smarties". ... My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis ...
    (comp.databases.ms-sqlserver)
  • Re: can access do Hierarchical queries
    ... Dim con As Object ... Show Expanding Hierarchies by Using SQL Server ...
    (microsoft.public.access.queries)
  • Re: CONTAINS performance
    ... That said, and with the query plan, I can start to give you more ... relational join in the context of the free-text optimization, ... SQL Server tables. ...
    (microsoft.public.sqlserver.fulltext)