Re: Large Scale PHP Application Design Questions
- From: "C. (http://symcbean.blogspot.com/)" <colin.mckinnon@xxxxxxxxx>
- Date: Sun, 19 Oct 2008 03:53:10 -0700 (PDT)
On 17 Oct, 17:02, Erwin Moller
<Since_humans_read_this_I_am_spammed_too_m...@xxxxxxxxxxxxxxxx> wrote:
Michael Vilain schreef:
In article <48f8a41c$0$195$e4fe5...@xxxxxxxxxxxxxx>,
Erwin Moller
<Since_humans_read_this_I_am_spammed_too_m...@xxxxxxxxxxxxxxxx> wrote:
C. (http://symcbean.blogspot.com/) schreef:
When doing your database design - don't use autoincrement ids - thisWhat?
closes the door on a lot of routes for scaling your application beyond
a single DBMS (use a sequence number and repository identifier (e.g.
DB host name) instead if there is no intrinsic key).
I don't think that is good advise.
IMHO it is ALWAYS a good idea to use autoincrement PRIMARY KEYS on every
table when designing relational databases.
We're getting a bit OT here, but....
Most data has an implicit primary key. I find it easier when using
concrete examples, although what is a primary key in one context does
not always apply to all contexts - given this caveat, lets say that
we're writing a web based app to replace this newsgroup (as if!). As
part of the design process we decide to not allow assertions of
identity or anonymous posts. A good primary key for identifying the
users would be their email address: its information we would need to
store about the users, is effectively unique and each user would
require one to be able to access the system. Potentially a person can
have more (or less) than one email address and an email address can
refer to a group of people as well as an individual - but within the
scope of the application its valid.
We don't gain anything by adding another identifier for the user.
Potentially we lose something if we do not require this attribute of
the relation to be unique - people could create multiple accounts
using the same email address - we're trying to keep track of people
after all - not all accounts. Of course, you could have auto-increment
as the PK and a unique index on the email address - but this increases
the amount of storage and effort required to change data.
It becomes more messy when you start looking at referential
constraints. Even though MySQL does not enforce these (I believe the
relevant keywords are parsed but last time I looked it doesn't
actually enforce them - so this might be available in future even if
it is not available today) it provides a model for maintaining a
coherent data set.
I'm not saying that there are no uses for such synthetic keys - just
that it should never be the default option - when you choose to use a
surrogate key, it should be for a valid reason.
I could go on, but google does a much better job - just search for
"surrogate key relational database".
I do not see how this closes the door if you scale your app up to
multiple databases.
Care top explain?
Vertical scalability (bigger, faster) always has a limit -
particularly with relational databases where there are always some
operations which must be atomic. Horizontal scalability (adding more
nodes) is effectively limitless. So for good scalability, you need a
data model which will scale horizontally i.e. you need to be able to
add more DBMS servers doing the same job. With the MySQL master/slave
model, all updates need to be done on the one box. This is the first
bottleneck. The next problem is that of split-brain behaviour in
clusters - you won't get very far in reading about clusters without
learning about the problems of fencing. This is particularly apposite
to the problem of web development because its a lot easier and a lot
cheaper to have servers in two different data centers with 3 9's
uptime than to find a datacenter with 6 9s uptime to put them both in.
But back to auto-increment ids. If a tuple is unique on one DB server,
then because we want a cluster which is fault tolerant, that tuple
must be unique across all members of the cluster. Two MySQL DBMS
servers with the same schemas will generate the same id for two
different tuples. Trying to sort this out when the data is replicated
is virtually impossible - a messy solution is to consolidate to a
different schema from that where the surrogate key was generated -
i.e. the nodes of the cluster are not equivalent, and replication is
asymmetric. This is particularly a problem with MySQL ISAM and DBD
tables because an auto-increment column must be the primary key
(rather than just a component of it).
Consider the following two schemas running on, say 5 database cluster
nodes:
CREATE TABLE orders (
id MEDIUMINT NOT NULL AUTO_INCREMENT,
cust_ref VARCHAR(40),
....
FOREIGN KEY (cust_ref) REFERENCES cust (ref)
PRIMARY KEY (id)
);
CREATE TABLE orders (
id MEDIUMINT NOT NULL,
cluter_node MEDIUMINT NOT NULL,
cust_ref VARCHAR(40)
....
FOREIGN KEY (cust_ref) REFERENCES cust (ref)
PRIMARY KEY (id, cluster_node)
);
(for simplicity, assume this table is insert only - never updated/
deleted)
The latter allows data to be consolidated across all servers - as each
node can work out what the last order it received from any other node
is, it can simply copy over the data:
locally we can identify the last data we got from remote servers by:
.....SELECT cluster_node, MAX(id) FROM orders GROUP BY
cluster_node....
and we can easily fetch this data from the the remote node:
.....SELECT * FROM orders WHERE id>$MAX_ID_LOADED_FOR_NODE AND
cluster_node=$NODE
(this is easily extended to accomodate updates, but not deletes by
using a timestamp field).
The former schema provides no way of replicating/synchronising data
without collisions. While you could try getting around this by seeding
the autoincrement id to different values on different nodes, there is
no way to prevent overruns (and hence collisions) and it really
becomes tricky to spot where the gaps are (node A might have ids 1-20,
node B has 100-119, how does node C tell which data it already has
from node A, when we've already copied B's data to A).
Note that if we use a local sequence number it is also possible to
identify tuples which have been omitted from the synchronisation:
SELECT a.cluster_node, a.id+1 AS MISSING
FROM orders a LEFT JOIN orders b ON (a.cluster_node
= b.cluster_node AND a.id=b.id+1)
WHERE b.id IS NULL
Most cluster capable databases will support cluster wide sequence
numbers - but they introduce additional overheads and complications
and don't solve the replication problem (you get the same issue as pre-
seeding, but these implementations allocate ranges of ids from a
central source - and usually go horribly wrong when the sequence runs
out and either the allocating node is unavailable or the cluster is
split).
Schreef (formerly known as C.)
.
- Follow-Ups:
- Re: Large Scale PHP Application Design Questions
- From: Jerry Stuckle
- Re: Large Scale PHP Application Design Questions
- References:
- Large Scale PHP Application Design Questions
- From: Jesse Burns
- Re: Large Scale PHP Application Design Questions
- From: C. (http://symcbean.blogspot.com/)
- Re: Large Scale PHP Application Design Questions
- From: Erwin Moller
- Re: Large Scale PHP Application Design Questions
- From: Erwin Moller
- Large Scale PHP Application Design Questions
- Prev by Date: Pretty Good Free PHP and MySQL Development Site
- Next by Date: Re: swfupload class problem
- Previous by thread: Re: Large Scale PHP Application Design Questions
- Next by thread: Re: Large Scale PHP Application Design Questions
- Index(es):
Relevant Pages
|
Loading