Re: Large Database System
- From: user923005 <dcorbit@xxxxxxxxx>
- Date: Mon, 22 Oct 2007 12:32:52 -0700
On Oct 20, 10:53 pm, raidv...@xxxxxxxxx wrote:
Hi there,
Thank you for taking the time to answer to the post.
Say that last sentence out loud in front of a group of DBAs and I
guess you will get a little bit of mirth. This statement alone is
proof that your project will fail.
We are using the BOINC distributed architecture for now, which seems
to already be working on hundreds of thousands of machines. We want to
add database capabilities to the data files that are being processed.
Every database system (even simple
keysets like the Sleepycat database) needs administration.
Because of the sheer number of machines involved in computations we
need to avoid that. If it will not be possible we'll have to stick
with XML until we find something better.
I do not understand why you don't want to store your thousands of
files on a single database server and then let the machines check out
problems from the database server. It seems a much less complicated
solution to me. The administration is now confined to a single
machine.
I imagine it like this:
The data is loaded into a single, whopping database on a very hefty
machine (Ultra320 SCSI disks, 20 GB ram, 4 cores or more). There is a
"problem" table that stores the data set information and also the
status of the problem (e.g. 'verified', 'solved', 'checked out',
'unsolved'). The users connect to the database and check out unsolved
problems until all are solved and then check out solved problems until
all of them are verified.
Listen, you are going to have tens of thousands of points of failure
in your system. Is that what you really want? If you have (for
instance) 20,000 machines getting a big pile of data shoved down their
throat, you pretty much have a guarantee that a few hundred are going
to be out of space and that once a month a disk drive is going to fail
somewhere
That is not a concern. When you are going to have the same data
replicated on 1 to 10 machines, reliability is no longer becoming an
issue.
What if you get 4 different answers? What if the data is damaged on
three of them? Reliability is always an issue. The more complicated
the system, the more difficult it will become to verify validity of
your answers.
Do you know what happens to performance when you put thousands of
active files on a machine? Pretend that you are a disk head and
imagine the jostling you are going to receive.
I haven't been able to provide more details about the project but most
of the data will be historic in nature. Once a calculation is
performed that data will be stored and in most cases will no longer be
active. It will still be needed though. So having thousands of files
on a machine is not so bad. This is a not a classic database
application and that is why it probably seems strange, that features
like reliability which should be on top, are listed as last and are
not a concern.
Data reliability is always a concern. If you cannot verify the
reliability of the data, then nobody should trust your answers.
They are the size that they are for a reason. It's not fat that gets
trimmed off to scale things down, it's muscle.
You do know that SQLite is a single user database?
That is exactly what we need. Data will be sent over the Internet to
other machines which will also use a single user database.
How will you coordinate who is working on what steps of the problem?
The right thing to do is go to SourceForge and execute a few
searches. The pedagogic answer to to refer to newsgroup
news:comp.sources.wanted, but it's a ghost town.
I have looked over there but I should probably search again.
Thank you.
I suspect that you have no idea what you are doing. Do you have any
concept about what is going to happen when your problem scales to
10GB?
If things go bad at 10 GB we can just go with 1 GB or if 1 GB is not
good we can go with 100 MB. We can always increase the number of files
and distribute the data on more machines. The ideal solution is to
have the data in large compact files.
Get a consultant who understands the problem space or you'll be
sorry. By the way, this is definitely not the right forum for your
post -- which does not exactly make it appear that you have anything
on the ball. (Really a newsgroup post in general is the wrong
approach here).
I guess that FastDB or GigaBase might be suitable (WARNING! One
writer at a time). I also guess that you are going to severely need
the capabilities that you do not think you need at some point.http://www.garret.ru/~knizhnik/databases.html
Finding out about these two databases is a step forward and it seems
that is was worthy to post it in here. Again, I appreciate your time.
We will look closer at these.
Another possibility is QDBM:http://sourceforge.net/projects/qdbm/
I guess that you will like this one but also that it is the wrong
choice.
I have looked at it before. It appears to be quite new and there are
not many people using it, and we do not want to go a narrow road that
is less traveled.
I don't know anything about your project but I think you need to
rethink your big picture of how you are going to solve it.
Since single user data access is what you are after, FastDB might be
interesting. If you compile it for 64 bit UNIX you can have files of
arbitrary size, and they are memory mapped so access should be very
fast. I have done experiments with FastDB and its performance is
quite good. You can use it as a simple file source but it also has
advanced capabilities. The footprint is very small.
I think we shold move the discussions to news:comp.programming, and so
I have set the follow-ups.
.
- References:
- Large Database System
- From: raidvvan
- Re: Large Database System
- From: user923005
- Re: Large Database System
- From: raidvvan
- Large Database System
- Prev by Date: Re: Linux: Unbuffered reading from stdin
- Next by Date: Re: [OT] lcc-win32 and GNU
- Previous by thread: Re: Large Database System
- Next by thread: Socket Problem............
- Index(es):
Relevant Pages
|