Re: DBMS and lisp, etc.

From: Will Hartung (willh_at_msoft.com)
Date: 05/21/04


Date: Fri, 21 May 2004 14:08:09 -0700


"Chris C apel" <chris@ibanktech.net> wrote in message
news:69d0f981.0405201630.5d97a155@posting.google.com...

> That's a great point. I'm imagining a sort of data access layer where
> I have around one function for every different time I access my data,
> tailored to that specific instance. Otherwise, it's too easy to write
> a bunch of data access functions that do small pieces and then try to
> combine multiple calls to these methods. That's, as you say, a *bad
> thing* when dealing with RDBMS's. But with (almost) one function for
> each data access point, I force myself to think about how I can get my
> data over the line in the most efficient way possible. Is this a good
> design principle for DAL's in general?

First, ANY DAL is better than none. It's one thing to have SQL bundled in
your code, but it's another to have it bundled in your logic. Specifically,
having at least 1 level of abstraction between how you get your data, and
where you use your data is, I think, important, as it gives you that much
more of an opportunity to refactor things later compared to having the
access directly in your logic.

i.e.

(let ((orders (sql-results (format nil "SELECT * FROM ORDERTABLE WHERE
CUSTCODE='~A`" custcode)))
   ....)

is worse than

(let ((orders (get-orders-for-cust custcode)))
  ...)

Even if get-orders-for-cust is a simple wrapper.

In Days Of Yore, with ISAM style databases, this was the typical paradigm
for processing.

use OrderCustomerIndex
Find Order 'CUST001'
Fetch Order
while(order.custcode = 'CUST001')
    use OrderDetailOrderIndex
    Find OrderDetail order.orderno
    Fetch OrderDetail
    while(orderdetail.orderno = order.orderno)
        use ItemItemCodeIndex
        Find Item orderdetail.itemcode
        Fetch Item
        ...
        Fetch Next OrderDetail
    end while
    Fetch Next Order
end while

So, if you had 10 orders with 5 lines items each, this code would "hit" the
DB library 100 times (10 for each order, once for the OrderDetail and once
for Item data).

When DB calls are "cheap", this kind of thing isn't that expensive (since
every DBMS (Relational and Object) on the planet essentially has an ISAM
(B-Tree) core, they do this kind of thing all day long).

So, if you wrote your code with a DAL, you could easily come up with
something like this:

(defun process-orders (cust-code)
  (let ((orders (find-orders cust-code)))
    (dolist (order orders)
      (let ((details (find-order-details (order-no order))))
        (dolist (detail details)
          (let ((item (find-item (detail-item-code detail))))
            (do-stuff order detail item)))))))

But this is a trap, and this is where THE problem occurs with most any
relational mapping process. Naively implemented with SQL, again for 10
orders with 5 line items each, you end up with SIXTY(!!!) SQL hits to the
database. (1 query for the initial orders, 1 query for each order for its
detail, 1 query for each detail for the item data). This is Bad. Round trips
to the DB are SLLOOWWW, and this will KILL your performance.

Instead, you want to suck in the entire graph with a single query:

SELECT orders.*, orderdetails.*, items.*
FROM orders, orderdetails, items
WHERE orders.custcode = ?
AND orderdetails.orderno = orders.orderno
AND items.itemcode = orderdetails.itemcode

Then, you use that to build your graph so you can do this:

(defun process-orders (cust-code)
  (let ((orders (find-orders cust-code)))
    (dolist (order orders)
      (let ((details (order-details order)))
        (dolist (detail details)
          (let ((item (detail-item detail)))
            (do-stuff order detail item)))))))

And THAT's all well and good, but say you wanted to get the total of all
canceled orders:

(defun canceled-orders-total (cust-code)
  (let ((orders (find-orders cust-code))
        (total 0))
    (dolist (order orders)
      (when (equal (order-status order) "C")
        (setf total (+ total (order-total order)))))
    total))

This is fine, but you just sucked the bulk of ALL of the orders, including
their detail and item information, to get at a simple header field.

Now, certainly this could have been done with a simple query as well:

SELECT sum(total) FROM orders WHERE status = 'C' AND custcode=?

The point is that balancing the granularity of what your DAL does, and how
it does it, needs to be determined up front, at least in a rough edged way.
This is what I mean by being aware that an RDBMS is going to be your
eventual persistent store. If you don't assume that, you may very well end
up with a lot of code like the first example that will immediately suck as
soon as you upgrade to the SQL database.

But it's all a trade off. For example, the canceled-orders-total function.
Sure it slurps in far more information than it needs, but on the other hand,
it's called once a night, at 2am for a status report, so..who cares? Cue the
"Curse of early optimization" thread.

True story, we had a client that would run a Account Receivables Aging
report as part of their End Of Day processing. They watch it run for 45
minutes, then wait for the 150 page printout, grab it off the printer, go to
the LAST page and grab the total outstanding dollar amount, then toss the
report. That's an extreme example of this kind of balance. All they wanted
was that last number (which was a 10 second query to the DB, vs 45 minutes
and a small rainforest). It's also the kind of thing that runs great with 10
rows in the DB, but is just horrible in production. This was simple misuse
of the report, but I can guarantee you that if this was run nightly, and
then consulted in the morning (vs a barrier to closing the daily
operations), they'd have never complained about this, and we would have
never given them a better option.

That's the game of balancing abstraction, functionality and performance in
the world of data persistence.

> > You don't need a "generic OO -> Relational" mapping system. Just a layer
> > that separates your data from the back end, and unless you have 1000
tables,
> > you can write this yourself, by hand. By hand you can specialize your
> > queries with a lot less complexity than using an OR mapper.
>
> Agreed. I do have a question about data representation, though. While
> it's tempting to just have references to related objects as sublists
> in my data structure, in the DBMS they're going to be keyed and ID'd.
> Is it OK, practical, and good practice to hide this in the data layer,
> with the application code treating the references as transparent? I
> would tend to think so, but, as I've hinted, either SLIME or SBCL in
> combination with TBNL (and maybe Emacs, too) seem to have problems
> with my circular references (regardless of *print-circle*, it tries at
> random to print out my structure at the SLIME REPL). The alternative
> would be to use a hash table and ID generator for my "tables".

I like Object Graphs that are as pre-populated as necessary for the task,
and part of the DAL is to manage those relationships for me when the data is
persisted.

Depending on the system and the platform, object sharing of data with the
same identity may or may not be an issue.

If I have two complete Order objects, both for the same Customer, but
fetched at a different time:

(let ((orderA (find-order 1234))
      (orderB (find-order 3456)))
  ...)

The question is how important is this concern:
(eq (order-customer orderA) (order-customer orderB)) vs
(equal (order-customer orderA) (order-customer orderB))

Most folks don't really have an issue with the two Customers in this case
having the same data, but not sharing identity. On the other hand, some
systems want to make sure that if someone updates the Customer Address,
every instance is updated as well. Depends on the application of course.

If you use actual CLOS objects vs plain lists, you can control how your
items print out, and how much detail you'll see for any particular instance.
There are also means to not print entire huge lists as well.

> This train of thought leads me to believe it would be difficult to
> write a /generic/ data access layer that would be significantly less
> complicated than the equivalent SQL in any case, except for
> marshalling issues, and still perform acceptably under most real-world
> conditions.
>
> And all that seems like a shame, because data access seems, on its
> face, to be something that could be macro-ized and library-ized to a
> huge extent.

Yes, you'd like to think so. To a certain level, it can be macro'd and
library'd to make it easier, but at that point you're still dealing with
SQL, and not necessarily your internal application datastructures.

Managing that interface between application data and the database is the
trick, and you'd like to think that IT could be macro'd and library'd
easily. But the dark side is the nature of SQL databases really throws a
wrench in the works.

A simple example from our current application.

We store a lot of tagged data in our DB, identified by a string Alias. The
detail is that we keep the entire history of that data in the DB. So, if you
update a value 5 times, there's 5 rows in the database.

Typically we need "the latest" value for several of these aliases. As a
generic SQL query, getting "the latest" is, essentially, a horrible query:

SELECT * FROM Table WHERE alias = 'Alais' and date = (Select max(date) from
table where alias = 'Alias')

But, we also tend to need "the previous value" as well (which would be the
one before the latest). Also, in one part of are system, we'll want several
hundred of these values. Which pretty much means:

SELECT * FROM Table WHERE alias = 'Alias' ORDER BY date DESC

You'll note that this in fact potentially reads ALL of the historical data,
even though in fact we only want the second row, particularly if DATE is not
indexed properly. Not a problem with 5 rows. A real problem with 5000. So,
somewhere between 5 and 5000 Happiness converts to Blood Boiling Rage.

With our current data load, it's is basically "cheaper" for us to load up
essentially the entire history for each data item, and toss most of it away
in the program.

e.g. SELECT * FROM Table where alias in ('Alias1','Alias2', ..., 'Alias100')

Even though we may well be only interested in 10% of the final data, it is
cheaper for us to throw data away than to send multiple queries to the
database. You can imagine how unintuitive that seems, but when a large part
of the cost of getting the data is the query itself, moreso than the actual
data it returns, you can see how this complicates the problem significantly.
As we get more and more data, though, we'll need to be smarter about this.

So, you have O/R tools that spend a lot of time on transactions, caching,
pre-fetching, lazy loading, backfilling writes, stale data, etc. Add in the
need to Join tables, and the problem space becomes very large. Dealing with
managing the granularity of the data access to the back end (as coarse as
possible) with the granularity of the data access at the front end (as fine
as possible) is quite a challenge.

> For instance, in my day job, I find myself dealing with a
> huge program (for what it is) whose job is simply to manipulate the
> contents of five simple tables. Though there's probably a better way,
> (especially in C#, with its datasets and SqlBuilders and whatnot) I
> find myself writing a function that manually generates the SQL for
> each changed row. And it's broken, too, because the INSERTs need to go
> in order of table A, B, C, D, E, while the deletes need to go in E, D,
> C, B, A, something I didn't realize until today. That means iterating
> twice over each table-cache. I don't know that /I/ could do much
> better if I were faced with the same task in Lisp. But faced with such
> large amounts of repetition, one really does have to wonder how much
> of it would be necessary with an optimal database library.

The key here is to realize that you consider things like iterating over your
table-cache as "expensive". Compared to a hit against the DB, odds are
pretty good, they're not even on the same orders of magnitude. Iterate away,
and make it perfect for the DB.

> > Finally, you'll need some way to filter and order your data. As you can
> > imagine, it's fairly easy to simply use hash tables to get basic
references,
> > but one of the features of the relational model is being able to query
and
> > order on arbitrary data elements, including those that are not indexed
at
> > all. Now, you may not feel you'll need to do that at all, but it's a
shame
> > to toss out a real benefit of the DBMS and design around it at this
point.
> > Kind of depends on your data model.
>
> So you're saying what here? That my data access layer should include a
> way to order the data that translates well into SQL?

If you want to leverage that capability from the DB, filtering and
sequencing, absolutely. Mind, that doesn't necessarily mean you need to
expose the filtering and ordering to you DAL clients, you can use simple
application specific options instead.

(find-orders-for-cust-code cust-code :canceled t)

vs

(find-orders-for-cust-code cust-code :with-filter '(= order.status "C"))

The first lets the DAL interpret "canceled = T" however it wishes, whereas
the second is more "generic", and perhaps exposes more about the data model
that may well change later.

The key is that the DB doesn't have to marshal the data in order to filter
and order it, whereas you do. Since the cost is typically in the query
itself and the marshalling, you may as well let the DB do what it does best
before you incur that cost. Doing stuff in your code that the DB does better
and cheaper seems excessive, but it certainly makes your queries more
complicated. Odds are for many queries, the runtime is going to be similar
with or without filtering and ordering (of course there are exceptions, but
that's the dark art of SQL optimizations), so since you "get it for free",
no reason for you to do it yourself.

Its a real puzzle. On the one hand I say that "the DB is evil. Treat it like
City Hall. A nice thing to have, but don't go there if you don't have to",
but then I assert that you should let the DB do what it does best.

If your DB is small enough (and will ALWAYS be small enough) that it can fit
into RAM, is cheap enough to load on startup, and you simply need to persist
the changes (ala the "Prevalance" concept), then you can approach your data
model in a much more "native" fashion (lists of maps of lists). The downside
being that as soon as you run out of RAM, all of your models break, and
break hard. That's why the assumption about using an RDBMS needs to be made
up front, and kept in your mind throughout the entire project. If you're
NEVER going to do that, the data will never be that big, and basically
you'll never have to really worry about querying the database dynamically on
the fly (vs, say, simply using it as a persistent store incrementally and
then loading from it at startup), then you can ignore the way an RDBMS
works. But if having more data than RAM is a very real possibility, design
up front with this in mind.

So, after all that rambling, the goal of the DAL is to enable you as a data
user to have as fine and simple access to the data as practical.

I'd love to be able to do stuff like this all day long:

(defun update-order-total (orderno)
  (let ((order (find-order orderno)))
    (when order
      (setf (order-total order) 0)
      (dolist (detail (order-details order))
        (incf (order-total order) (detail-line-total detail))))))

If that was all in memory, that's all I would need to do and I'd be done.
But think through all that is going on here, in this simple example, and
toss in persistence, locking, transaction boundaries, minimizing database
access yet ensure data integrity and cache coherency, and this problem
becomes MUCH more difficult.

How much of the order do I really load in get-order? When do I load the
order-details? How do I manage the change to order-total? What if this order
is locked? If this loop fails half way through, what is the actual value of
the order-total? Was it stored in the DB?

Consider the (incf (order-total order) (detail-line-total detail))
expression. I can easily see a data layer hitting a db, or other persistent
change mechanism, for each change to order-total. And we'd all agree that
would be terrible (so, Don't Do That, use a temp variable, whatever). The
point being that there is nothing wrong with this code, but depending on how
your DAL works, it may be good, bad or simply indifferent code.

The trick is finding the happy medium for your project, something that lets
you start now and move forward later. As long as you keep the "limitations"
of an RDBMS in the back of your mind while developing, you shouldn't have
too much of a problem when you actually do switch over. But if you disregard
those limitations, thinking you'll "burn that bridge when you come to it",
there will be more than that bridge ablaze in the end I fear.

Regards,

Will Hartung
(willh@msoft.com)



Relevant Pages

  • Re: complex filter and calculations in access
    ... switch to SQL view. ... query by switching to datasheet view, ... of your database using the from address in this post. ... pre-op infections yes/no ...
    (microsoft.public.access.queries)
  • Re: A little more meat this week
    ... implementation with a query processor capable of returning a result set ... then we can call it a result bag (SQL ... I'm looking at the data model and not database tools at this ... for a s/w developer and a s/w developer simplifies for the end user). ...
    (comp.databases.pick)
  • Re: complex filter and calculations in access
    ... when using a subquery on the same table as the main query, ... switch to SQL view. ... of your database using the from address in this post. ...
    (microsoft.public.access.queries)
  • Re: A little more meat this week
    ... said data includes lists." ... implementation with a query processor capable of returning a result set ... then we can call it a result bag (SQL ... I'm looking at the data model and not database tools at this ...
    (comp.databases.pick)
  • Re: dbdebunk Quote of Week comment
    ... > a lot of really bad SQL programmers. ... But SQL does not have a pointer data type or the ... > being told to design a database. ... But why is little Cindy Lou Who employee ...
    (comp.databases.theory)