Re: Efficient Integer Comparison Algorithm



On 05.03.2008 16:49, JThangiah wrote:
Hi Robert,
Thank you so much for your detailed reply. I am so glad I found
some help.
I have the answers for your questions.

1. Nature of data:
I am doing this for linking two Bio-Informatics databases. One is my
local system which uses Oracle. The other database is external and I
only get the data in a file format.
If you want, you can look at the file here:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

It is nearly 430MB and following are the fields of interest to me:
tax_id Integer
GeneID Integer
start_position_on_the_genomic_accession Integer
end_position_on_the_genomic_accession Integer


The corresponding data in my local database has the following fields:
Taxon Integer
EXTFEATUREID String
LEFTEND Integer
RIGHTEND Integer

If the leftend and rightend here is the same as the start_position and
end_position in the external database, then I know that both represent
the same Gene.

These are just positive numbers and only one pair of co-ordinates.

PreCondition:
For a particular genome(taxon), if I sort both set of records in
ascending order of start position, the indexes in both should be
identical.

I am not sure what you mean by "indexes". I certainly cannot see the termin "index" mentioned above.

So, I cannot sort the entire DB that way. It should only be sorted for
one genome at a time.

I don't understand. You said, you wanted to match records based on identical value pairs (start, end). You can (in a relational database and also in Java code) easily sort by two criteria.

2. Database products:
Do you think there is a way to link my Oracle DB and the file input
directly to match the records?

Oracle can work with files, this is called "external tables". Please have a look at the docs at http://tahiti.oracle.com e.g.

http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref777

That's probably what I'd do if the format of your data can be easily made available via Oracle. Note though that the file needs to be on the server IIRC.

Since I have a key and set of values for each entry in both DB's,
should I use a Map instead of a List? By using a TreeMap, I can keep
the data sorted on addition and so I'm just left with the mapping of
coordinates in the corresponding indexes.

Since you only want to compare according to equality of your two values a HashMap is more efficient.

Will this be faster or is there a better way to do it like using a
SortedArrayList?

Don't even think about using lists for this. That's too slow. You probably should get yourself a book about data structures and algorithms to get a solid foundation.

By "pair type", do you mean an object with two ints and a String value
as members?

.... and with properly implemented equals() and hashCode() methods, exactly.

Is HashSet faster than any other datastructure for this purpose? Do
you think this will be the best one for this?

Yes.

3. Volume of data:
It is not too large as I exaggerated and is only around 3,48000
records in my local DB. The external DB has much more data than this

Did you mean 348,000 records? That's easily handled in memory.

which I do not need to deal with since I am only looking for a match
on all records in my local DB.

Then it's easy: create your class (see above), read in your local DB, store it in a HashSet or HashMap, then query the DB and for each row create a temporary instance of the class and look it up in the map / set.

Alternative: query both databases ordered by both coordinates and
compare records individually. You can then decide which ResultSet you
need to step forward based on the current comparison.

Sorry, I could not understand your alternate solution. Is this
comparing done by first sorting by start positions and then comparing
corresponding index values?

I don't understand why you bring up these indexes. What indexes?

No, select from the DB with an ORDER BY with two columns. Same for the local store. Then iterate through the ResultSet and your local data and compare. If the first ordered column in the local store data is less than in the DB data, increase it until they are same or greater. If same do the same for the second column until a match is found or the end of one of the two sources is reached.

Yes, certainly. You could, for example, partition your data set into
smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
number of threads which each processes one chunk. You only need to make
sure that you only have one thread that will generate the output XML.

I am not used to Concurrent programming till now. Do you know of any
sample code that I can refer?

http://www.amazon.com/Concurrent-Programming-Java-Principles-Patterns/dp/0201695812
http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601

But it's probably not worthwhile in this case.

Cheers

robert
.



Relevant Pages

  • Re: NULLs
    ... >will two columns compare equal if they are both NULL? ... recent version of Oracle, and any reasonably compliant SQL-based database. ... Oracle has a non-standard way of dealing with empty strings that you have to ...
    (comp.databases.oracle.misc)
  • Re: NULLs
    ... >I have an application whose database was recently upgraded from 7 to 9i. ... If you compare NULL to NULL the result is NULL, ... This has been the case in any version of Oracle. ... Sybrand Bakker, Senior Oracle DBA ...
    (comp.databases.oracle.server)
  • Re: File Auditing - Fast DB import and data manipulation
    ... the files into the database is not the best way forward. ... on either system (you'll need to download gunzip from gnuwin32 ... The databases I've dealt with (Oracle, SQL Server, Access, Postgres) ... So, I need some way to compare todays file, to yesterdays and see ...
    (comp.lang.ruby)
  • Re: rules engine ideas? Trying to prevent tons of conditional branches in a logic filter.
    ... One thing you could do is to place your rules in a database. ... consists of an XPath expression for the operand, and operator, the value you ... compare against, and the score. ... apply each of the Xpath queries to ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Divining the full pathname of a file, all logicals translated
    ... compare it to the "current" state when an audit report is run. ... drive name needs to be retained also, unmasked from any logicals. ... The auditors may never see the actual database. ...
    (comp.os.vms)