Re: "t" stands for the type true and for the boolean expression true ..
- From: "John Thingstad" <john.thingstad@xxxxxxxxx>
- Date: Thu, 27 Oct 2005 11:30:04 +0200
On Wed, 26 Oct 2005 20:43:56 +0200, Duane Rettig <duane@xxxxxxxxx> wrote:
Bulent Murtezaoglu <bm@xxxxxxx> writes:
"DR" == Duane Rettig <duane@xxxxxxxxx> writes:[...] DR> Most of the genome analysis projects I know about do indeed DR> use characters rather than symbols, and they store large DR> strings of DNA material into ... strings! [...]
Hmpf. I assume, then, they use 64 bit machines with >4gig RAM?
Yes.
It seems the approach Paul Dietz outlined might have considerable performance advantages. Maybe the ability to address bytes directly as opposed to shifting and masking is a factor? I am not familiar with the field though, perhaps someone would tell us why the seemingly obvious optimization is not used.
Yes, see also my response to Christophe on this issue - it is desirable to maximize both speed and space in this kind of problem. Going sub-octet certainly optimizes the size side, but it may de-optimize the speed size in at least two ways (both the less efficient access of shift-and-mask pre- or post- processing, and the degradation that comes from not having built-in functions which work one natural word at a time).
I am hardly an expert, but the 'obvious solution' to me is to store it in a 64 bit unsigned integer array
if you have a 64 bit computer. I would use two bit's to represent a base-pair.
Now to compare large sequences (something they seem to do a lot.) you can use integer comparison.
This compares 32 pairs in one operation.. Considering you don't have to load and store each symbol as
well the speedup should be dramatic. (50 times?)
Further there is the size of the array. Stored in this matter a 3.3 Gph would fit
in a gigabyte. In the event you had to compare two such sequences you would need twice that.
With a list of 64 bit pointers to symbols you would need 100 Gb. Imagine the time spent reading this amount of data from disk!
Even with 8 bit character strings (if your 64 bit system still supports them!)
it would take 3.3 Gb pr sequence or 6.6 Gb total probably forcing
you to swap from disc.
I normally don't waste time on bit fiddling but in this case the benefits seem
enormous. Extracting bits from a integer gives a trivial overhead compared to
having to load large chunks of data from disk during a compare or simular operation.
I figure I must have missed something. But without seeing a complete list
of desired operations, average sequence size, etc. etc it it is hard to see what.
Is there someone her with experince (or knowlege of) with bioinformathics that could
tell me what I missed?
-- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ .
- Follow-Ups:
- Re: "t" stands for the type true and for the boolean expression true ..
- From: Björn Lindberg
- Re: "t" stands for the type true and for the boolean expression true ..
- From: Christophe Rhodes
- Re: "t" stands for the type true and for the boolean expression true ..
- References:
- "t" stands for the type true and for the boolean expression true ..
- From: (typep 'nil '(satisfies identity)) => ?
- Re: "t" stands for the type true and for the boolean expression true ..
- From: eric
- Re: "t" stands for the type true and for the boolean expression true ..
- From: Alan Crowe
- Re: "t" stands for the type true and for the boolean expression true ..
- From: Marco Antoniotti
- Re: "t" stands for the type true and for the boolean expression true ..
- From: Rob Warnock
- Re: "t" stands for the type true and for the boolean expression true ..
- From: Duane Rettig
- "t" stands for the type true and for the boolean expression true ..
- Prev by Date: Re: Newbie with some GUI questions
- Next by Date: Re: "t" stands for the type true and for the boolean expression true ..
- Previous by thread: Re: "t" stands for the type true and for the boolean expression true ..
- Next by thread: Re: "t" stands for the type true and for the boolean expression true ..
- Index(es):
Relevant Pages
|
|