Re: Synchronization routine



Joris wrote:
) I'm looking into a writitng a efficient routine and am looking for different
) views of the problem:
)
) Basically I have two lists, one on my firewall and one in a database. The
) list on the firewall must become the same as the list in the database. The
) current sizes are over 1 million entries and I expect to see 2 million rows
) in the future.
) Unfortuanlly the firewall vendor (Microsoft) made loading that list very
) very slow when it gets large enough. Loading this list will take several
) days on a very decent machine. Unfortuanlly I have only a slow system for
) todays standards (would be top notch 6-7 years ago) but for the purpose.
)
)
) So I have:
)
) from the firewall:
) Set A, which I can randomly access (though slowly)

How do you randomly access it ? Do you access the Nth entry, or can you
access matching entries or something ?

) from the database:
) Set B, which is ordered, but can only be traversed sequentially once(!).
) I expect the sets to scale up to 2 million elements.
)
) The results should be:
) C = A / B = all elements from A that are not in B
) D = B / A = all elements from B that are not in A
)
) I do not update set A until after retreiving the results. I have absolutely
) no clue how they remove or add an enty to the list. So I don't want to jump
) to make any assumptions in this regard.
)
) Of course this must be done with only very limited computing power and I
) mind the memory use.
)
)
) My current solutions to the issue were:
)
) Under the assumption that set A was ordered I used a 'merge join' which was
) very fast. Unfortunally the assumption was flawed.
)
) Using a sort routine has not yet worked out. I used quicksort, but I don't
) believe its the correct solution. The entries in A are largely ordered so
) quicksort wasn't too efficient and finally caused a stack overflow.

You should tweak your quicksort so that it is efficient on largely ordered
sets(1). If it causes a stack overflow, you implemented it badly enyhow(2).

Or did you use some standard function ?

1) To make quicksort behave on almost-ordered sets, pick several items
from your sublist and take the median one as your pivot. Three should
suffice.

2) If you want to make sure that quicksort doesn't use more than O(log N)
stackspace, you have to implement it as tail recursion, and recurse into
the smaller partition first. Alternatively, you could implement your
own stack, allocating fixed space beforehand in the order log N entries.

) Next is using a 'hash join'. In my previous (probably inefficient)
) implementation I saw memory usage reach over 100 MB for only 40000 entries.

How much space would one entry take normally ? And how similar are your
entries ? If they often share the same prefix, you should look into tries.

) Maybe anyone has a good suggestion which direction I should look for?
) I thought it is a very nice problem, but hard to get it right.

It depends a lot on what your data looks like exactly.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
.



Relevant Pages

  • Re: The Anime Primer "Pre-Approved List"
    ... but aren't yet in the Primer. ... only write entries for shows you like enough to recommend to ... quarterly anime review lists, three stars on Manbow Papa's quarterly ... Ouran High School Host Club ...
    (rec.arts.anime.fandom)
  • Re: Best writers I havent yet read
    ... Science Fiction> and. ... considered was to look for authors represented by multiple books in my ... So I've spent much of this past week compiling a list of author entries ... If anyone wants the full lists, or whatever, well, e-mail or post, ...
    (rec.arts.sf.written)
  • [info] The Anime Primer "Pre-Approved List"
    ... but aren't yet in the Primer. ... only write entries for shows you like enough to recommend to ... quarterly anime review lists, three stars on Manbow Papa's quarterly ... Tokyo Mew Mew / Mew Mew Power ...
    (rec.arts.anime.misc)
  • Re: OL2002: Adding list of contacts to a Distribution List after search
    ... My Contacts entries have multiple Categories, so let's say I want to ... or simply Friends New York. ... Categories - once in Friends, again in Business, and a 3rd time in New ... |> I need to create email distribution lists of people that meet ...
    (microsoft.public.outlook.contacts)
  • Re: Sort Map on Value
    ... they get sorted via an array). ... work with linked lists, unless one implements his own quicksort. ... choose a suitable partitioning key for an N-node sub-list. ...
    (comp.lang.java.programmer)

Loading