Re: Synchronization routine
- From: Willem <willem@xxxxxxxx>
- Date: Mon, 31 Oct 2005 00:27:49 +0000 (UTC)
Joris wrote:
) I'm looking into a writitng a efficient routine and am looking for different
) views of the problem:
)
) Basically I have two lists, one on my firewall and one in a database. The
) list on the firewall must become the same as the list in the database. The
) current sizes are over 1 million entries and I expect to see 2 million rows
) in the future.
) Unfortuanlly the firewall vendor (Microsoft) made loading that list very
) very slow when it gets large enough. Loading this list will take several
) days on a very decent machine. Unfortuanlly I have only a slow system for
) todays standards (would be top notch 6-7 years ago) but for the purpose.
)
)
) So I have:
)
) from the firewall:
) Set A, which I can randomly access (though slowly)
How do you randomly access it ? Do you access the Nth entry, or can you
access matching entries or something ?
) from the database:
) Set B, which is ordered, but can only be traversed sequentially once(!).
) I expect the sets to scale up to 2 million elements.
)
) The results should be:
) C = A / B = all elements from A that are not in B
) D = B / A = all elements from B that are not in A
)
) I do not update set A until after retreiving the results. I have absolutely
) no clue how they remove or add an enty to the list. So I don't want to jump
) to make any assumptions in this regard.
)
) Of course this must be done with only very limited computing power and I
) mind the memory use.
)
)
) My current solutions to the issue were:
)
) Under the assumption that set A was ordered I used a 'merge join' which was
) very fast. Unfortunally the assumption was flawed.
)
) Using a sort routine has not yet worked out. I used quicksort, but I don't
) believe its the correct solution. The entries in A are largely ordered so
) quicksort wasn't too efficient and finally caused a stack overflow.
You should tweak your quicksort so that it is efficient on largely ordered
sets(1). If it causes a stack overflow, you implemented it badly enyhow(2).
Or did you use some standard function ?
1) To make quicksort behave on almost-ordered sets, pick several items
from your sublist and take the median one as your pivot. Three should
suffice.
2) If you want to make sure that quicksort doesn't use more than O(log N)
stackspace, you have to implement it as tail recursion, and recurse into
the smaller partition first. Alternatively, you could implement your
own stack, allocating fixed space beforehand in the order log N entries.
) Next is using a 'hash join'. In my previous (probably inefficient)
) implementation I saw memory usage reach over 100 MB for only 40000 entries.
How much space would one entry take normally ? And how similar are your
entries ? If they often share the same prefix, you should look into tries.
) Maybe anyone has a good suggestion which direction I should look for?
) I thought it is a very nice problem, but hard to get it right.
It depends a lot on what your data looks like exactly.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
.
- References:
- Synchronization routine
- From: Joris Dobbelsteen
- Synchronization routine
- Prev by Date: Re: Red Black Trees
- Next by Date: Re: What is a "null loop"?
- Previous by thread: Synchronization routine
- Next by thread: Re: Synchronization routine
- Index(es):
Relevant Pages
|
Loading