Re: Can Binary Search Be Improved with Threading



jehugaleahsa@xxxxxxxxx wrote:
My boss suggested that he could write a faster binary search than I
could by creating a new thread and searching the data on each half on
the range. In other words, he would create a new thread to check the
other half of the range. I am assuming this is a binary search that
operates on contigious memory like an array, C++ vector, etc or in a
binary search tree.

Now, I thought this was rediculous but I kept it to myself. My
coworkers aren't trained programmers or Computer Scientists. As I saw
it, every time you split the list, you already know which way to go,
so why would you create a thread to check the other side? The overhead
of the thread is overkill, and one of the threads will always lead to
a dead end (it also destroys the invariant).

I think, basically, you're right, and the idea makes little to no sense.

However... I could actually see where it *might* be possible to squeeze
a little better performance (by which I mean shorter total turnaround time)
by using threads on a multi-CPU system. The reason for this is that in some
cases, a binary search is a whole series of cache misses. You're not
accessing memory linearly, and you're not really accessing in any sequence
that the processor/cache expects.

So you could conceivably have a couple of threads pre-fetching the next
two possible locations, incurring the cache misses just a little bit
earlier, and thus causing the cache to start working to hit main memory
just a little bit earlier. While the main thread is looking at position
1/2*N, the two other threads are trying to read positions 1/4*N and 3/4*N,
solely to cause the cache miss earlier and thus cause the memory to be
fetched a couple of cycles earlier.

I have no idea whether this could really be done successfully on a real
processor architecture that already exists. Most likely, the
synchronization overhead would kill all the gains, but you never know
for sure until you try it. There might be some clever trick that
could reduce the overhead and make it worthwhile.

Incidentally, one could also imagine a sort of trinary search: given an
interval of N items, instead of looking at item 1/2*N and dividing the
interval in half, use 2 CPUs to simultaneously look at items 1/3*N and
2/3*N, and the divide the interval in thirds. The problem is, once
again, the synchronization overhead is probably overwhelming. But
maybe not if it takes a long time to compare two elements, and this
could potentially give you memory access patterns benefits as well, if
the memory controller can simultaneously do two streams of arbitrary
reads from main memory.

So basically, if it can help to add another processor, it's probably
really tough and probably won't help much if it helps at all.

- Logan
.



Relevant Pages

  • Re: Probability distribution from an image
    ... Memory references and cache misses almost certainly dominate. ... I'd move from a binary search, ... At the leaves of the tree, ...
    (comp.graphics.algorithms)
  • Re: Cached memory never gets released
    ... Stock linux 2.4.26 kernel. ... Due to flash bug 3M of memory gets lost due to font memory getting lost ... The output of "free" cache number steadily grows. ... longer to exhaust all of system memory with the cache. ...
    (Linux-Kernel)
  • Re: Problem: Creating a raw binary string
    ... > While its true that a 64-bit cpu will move twice the data per instruction it ... > Memory bus width plays an important role here and unless it too is widened / ... You are forgetting the two levels of cache in the processor. ... The memory chips are addressed in Row col fashion. ...
    (alt.comp.lang.borland-delphi)
  • Re: Is Greenspun enough?
    ... Most OSes memory map executables directly from the file system so code doesn't pollute the file cache or swap space. ...
    (comp.lang.lisp)
  • Re: Superstitious learning in Computer Architecture
    ... Without a LOT of logic or some other better approach, re-executing the instructions requires re-decoding and it ties up the cache memory bus transferring more data as instructions than the instructions are working on. ... The concept of cache is fundamentally flawed in that it STILL restricts access to one word per clock cycle, when a single modern ALU can easily use 5 plus whatever is eaten up with instruction accesses. ... The size of an optimizing compiler is proportional to the SQUARE of the size of the language times the SQUARE of the complexity of the machine - because all interactions must be considered. ...
    (comp.arch.arithmetic)