Re: Data mining algorithm
- From: "Rob Thorpe" <robert.thorpe@xxxxxxxxxxxx>
- Date: 15 Apr 2006 15:51:48 -0700
Aki Tuomi wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Rob Thorpe kirjoitti:
Aki Tuomi wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I was wondering about creating a data mining algorithm that can seek up
the "best" descriptive strings for data. Specifically DNA data.
The idea behind here is to look for the longest most repetitive
sequences. So that the longest possible repetition is not the best,
instead it's the one that is longest and most repetitions. I have been
thinking about using the basic association lookup method, but somehow I
just feel that there might be a better way to do this... Any ideas?
That sounds like the first step of a compression algorithm. In a
compression algo the strings with more repetitions have small symbols
associated with them.
Look up the huffman algorithm and the Burrows-Wheeler transform.
I am somewhat familiar with the basic concepts in compression
algorithms, and I am not sure whether this is the same thing. The idea
here is to see what kind of "patterns" emerge in the data. Take DNA data
for example, I might be interested to see that a certain "pattern"
(global pattern) can be seen in the data. A human can do this with a
glance, if the data set is sufficiently small. Usually it is too large
for this though.
So, I will look up those resources you pointed out, they will, no doubt,
provide valuable information about this. Yet I somehow feel that they
are not exactly what I want. But we'll see...
Finding patterns and compression are very closely linked.
Consider a series of bases:
ATTGAACGGGTTTATGACCC
A distingushing characteristic of this series is, as you point out, the
most common run of a set of bases.
One way to find this set is by rotating the data so each character in
turn is at the front.
ATTGAACGGGTTTATGACCC
CATTGAACGGGTTTATGACC
CCATTGAACGGGTTTATGAC
CCCATTGAACGGGTTTATGA
ACCCATTGAACGGGTTTATG
GACCCATTGAACGGGTTTAT
TGACCCATTGAACGGGTTTA
ATGACCCATTGAACGGGTTT
TATGACCCATTGAACGGGTT
TTATGACCCATTGAACGGGT
TTTATGACCCATTGAACGGG
GTTTATGACCCATTGAACGG
GGTTTATGACCCATTGAACG
GGGTTTATGACCCATTGAAC
CGGGTTTATGACCCATTGAA
ACGGGTTTATGACCCATTGA
AACGGGTTTATGACCCATTG
GAACGGGTTTATGACCCATT
TGAACGGGTTTATGACCCAT
TTGAACGGGTTTATGACCCA
Now you can find the repeating sequences (runs) by scanning from top to
bottom looking for repeats of the same string. But you can do one
better, you can sort the block from top to bottom
AACGGGTTTATGACCCATTG
ACCCATTGAACGGGTTTATG
ACGGGTTTATGACCCATTGA
ATGACCCATTGAACGGGTTT
ATTGAACGGGTTTATGACCC
CATTGAACGGGTTTATGACC
....
This eventually separates out the runs so they occur in blocks and
their grouping can easily be seen. The commonality of a run is the
number of rows it fills in the above table. (Its length is the number
of columns).
Another interesting metric of this table is the contents of the left
hand column. When written out it is very repetative at the beginning.
eg AAAAAC.. above.
In a similar data set this column will show similar repetativeness at
the beginning. This level of repetativeness represents the same
information as the # of rows for a run mentioned above.
In terms of compression the rightmost column is important because, even
after the sort, it contains enough information to reconstruct the
original string. This is operation is the Burrows-Wheeler transform. I
expect writings on the Burrows-Wheeler transform would be useful in
doing some of the above efficiently.
.
- References:
- Data mining algorithm
- From: Aki Tuomi
- Re: Data mining algorithm
- From: Rob Thorpe
- Re: Data mining algorithm
- From: Aki Tuomi
- Data mining algorithm
- Prev by Date: Re: free c compilers
- Next by Date: Re: Help!! Standalone, browser, client/server? Visual Studio or Java?
- Previous by thread: Re: Data mining algorithm
- Next by thread: Re: Data mining algorithm
- Index(es):
Relevant Pages
|