Re: "t" stands for the type true and for the boolean expression true ..



On Wed, 26 Oct 2005 20:43:56 +0200, Duane Rettig <duane@xxxxxxxxx> wrote:

Bulent Murtezaoglu <bm@xxxxxxx> writes:

"DR" == Duane Rettig <duane@xxxxxxxxx> writes:
[...]
    DR> Most of the genome analysis projects I know about do indeed
    DR> use characters rather than symbols, and they store large
    DR> strings of DNA material into ... strings! [...]

Hmpf. I assume, then, they use 64 bit machines with >4gig RAM?

Yes.

It
seems the approach Paul Dietz outlined might have considerable
performance advantages.  Maybe the ability to address bytes directly
as opposed to shifting and masking is a factor?  I am not familiar
with the field though, perhaps someone would tell us why the seemingly
obvious optimization is not used.

Yes, see also my response to Christophe on this issue - it is desirable to maximize both speed and space in this kind of problem. Going sub-octet certainly optimizes the size side, but it may de-optimize the speed size in at least two ways (both the less efficient access of shift-and-mask pre- or post- processing, and the degradation that comes from not having built-in functions which work one natural word at a time).


I am hardly an expert, but the 'obvious solution' to me is to store it in a 64 bit unsigned integer array
if you have a 64 bit computer. I would use two bit's to represent a base-pair.
Now to compare large sequences (something they seem to do a lot.) you can use integer comparison.
This compares 32 pairs in one operation.. Considering you don't have to load and store each symbol as
well the speedup should be dramatic. (50 times?)


Further there is the size of the array. Stored in this matter a 3.3 Gph would fit
in a gigabyte. In the event you had to compare two such sequences you would need twice that.


With a list of 64 bit pointers to symbols you would need 100 Gb.
Imagine the time spent reading this amount of data from disk!

Even with 8 bit character strings (if your 64 bit system still supports them!)
it would take 3.3 Gb pr sequence or 6.6 Gb total probably forcing
you to swap from disc.


I normally don't waste time on bit fiddling but in this case the benefits seem
enormous. Extracting bits from a integer gives a trivial overhead compared to
having to load large chunks of data from disk during a compare or simular operation.


I figure I must have missed something. But without seeing a complete list
of desired operations, average sequence size, etc. etc it it is hard to see what.


Is there someone her with experince (or knowlege of) with bioinformathics that could
tell me what I missed?


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
.



Relevant Pages

  • Re: Segmented Binary Diff
    ... > If any matching of sequences would break stuff, ... If you compare pairwise first then I think you will indeed have to throw ... Iterate over all of the edit scripts simultaniously ... The trick is to make sure you compare offsets corresponding to the same ...
    (comp.programming)
  • Re: "t" stands for the type true and for the boolean expression true ..
    ... >> I am hardly an expert, but the 'obvious solution' to me is to store it ... In the event you had to compare two such sequences you ... > The comparisons involved are rarely base by base. ...
    (comp.lang.lisp)
  • Re: abundance of irrationals
    ... >> If you compare ... > partitioning of the rationals or a collection of infinite ... > sequences of rationals (Cauchy sequences whose differences are null ... > Given either the partition form or the family of sequences form, ...
    (sci.math)
  • Re: "t" stands for the type true and for the boolean expression true ..
    ... Considering you don't have to load and store each symbol ... two sequences to be compared start at the same base-pair offset modulo ... can compare them using integer operations. ...
    (comp.lang.lisp)
  • Re: print values out of a hash - filling the hash
    ... > I want to compare two fasta-files, more precisely the IDs of two sets ... Here you seem to say that you want to compare the sequences and not the ... If you want to associate all sequences with their corresponding IDs then ...
    (perl.beginners)