Re: About Windows address space

From: C (blackmarlin_at_asean-mail.com)
Date: 04/25/04


Date: 25 Apr 2004 06:09:12 -0700


"fa2k" <marius_b@chello.no> wrote in message news:<fLfic.9895$EV2.96808@amstwist00>...
> "C" <blackmarlin@asean-mail.com> wrote in message
> news:33d97ee5.0404230244.63d01053@posting.google.com...
> > "Beth" <BethStone21@hotmail.NOSPICEDHAM.com> wrote in message
> news:o31ic.3$9N.1@newsfe1-win...
>
> Why didn't I see that message?:( I don't like news.chello.no...

I see it fine on http://groups.google.com

> > string_lookup:
> > generate_hash
> > lookup_in_hash_table
>
> Would adding the (e.g. six first) characters together suffice as a hash
> function when the table size is ~256 entries?

If you use a fairly good hashing function, four would be enough
for such a small table. A little noddling produces...

        ; eax = name (padded as needed)
        mov ebx, eax
        and eax, 0xffff
        shr ebx, 16
        xor al, ah
        xor bl, bh
        lea eax, [ eax + ebx * 4 ]
        and eax, 0xff

Which would achieve a fairly even distribution, while being
reasonably fast. BUT this only works if the names are likely
to have unique starting sequences, ie. the first four letters
are unlikely to match -- this may be the case in your application;
it works for older versions FORTH, where only the first 3 chars
and the length are stored.

> Oh, it wil probably not be good enough, as the alphanumerics
> only span a limited range of numbers. Any good ways to avoid
> the slow mul's? Multiplying by a constant?

Use a table of pointers to the memory locations, then simply
        mov eax, [ ebx * 4 + table_base ]
(assuming ebx is the index into the table).

> Most names will be at least three characters. I would need to
> have a table with character values for case insensitivity, but
> thats easy to do.

Yes, you could just do a 'to_lower_case' style function before
generating the hash (this could be done with a
"AND EAX, 0xdfdfdfdf" instruction before the above hash function.)

> And I'm not into binary trees, so I feel like creating a linked
> list on collisions:D

Nothing wrong with that.

> Its much worse performing,

That depends on if there are collisions and the probability
distribution of requests for strings.

> but if I create the hash table really big, there will not be
> (m)any collisions:)

Yes, though that will probably be wasteful on memory. It really
depends on what strings are likely to be looked up.

[You can have a look at the current Luxasm source code from
http://luxasm.sf.net -- though the project is far from finished
the labels.asm module has a complete fast string lookup routine
using a hash table and binary tree which you may be interested
in adapting (GPL license).]

On the other hand, as you said in another post, this is being used
to lookup a series of functions in a loaded module. If we assume
that each module only has a small number of functions and
contains a dictionary of these functions we could use a funciton
to search a simple list and get the pointer to the function. As
the list is likely to be small (eg <16 entries) the extra work
needed to generate the hash not have the payoff desired. Something
like...

        mov edx, [ function_name_to_lookup ]
        xor eax, eax
        mov edi, [ loaded_module_dictionary_pointer ]
        and edx, ~0x20202000 ; to_lower_case
.lp: or eax, [ edi * 8 ]
        jz .er
        add edi, 8
        cmp eax, edx
        mov eax, 0
        jne .lp
        mov eax, [ edi * 8 - 4 ]
.er: ret
        ; returns eax = function pointer, or 0 = not found

(The function does not resolve collisions -- so you must watch
this does not happen by selecting suitable function names.)

This requires a dictionary containing the first 3 letters
and the function name length. The following NASM macro could
be used to generate each dictionary entry.

        %macro dictionary_entry 2.nolist
                %ifnstr %1
                        %error "bad %1"
                %endif
                %strlen %%l %1
                %substr %%1 %1 1
                %substr %%2 %1 2
                %substr %%3 %1 3
                db %%l, %%1, %%2, %%3
                dd %2
        %endmacro

        module_dictionary: ; usage example
                dictionary_entry "join", join_function
                dictionary_entry "ping", ping_function
        dd 0 ; end marker

This algorithm is, of course, O(n) instead of O(1) for a
perfect hash function. However, we must remember that the
complexity function is only part of the total cost of the
algorithm -- a better representation would be ...

  total_cost = cost_per_repetition * Oa(f(n)) + startup_cost

  where Oa(f(n)) is the complexity function, stating the
  average number of repetitions. Oa(1) = 1, Oa(n) = n/2.

Therefore, even though the above algorithm has a worse
complexity funciton [ O(n) vs. O(1) ] both the startup_cost
and cost_per_repetition are _much_ lower -- hence, for
short lists of data, it will be faster! Especially if you
put the most probable functions (ie. those likely to be
requested most often) towards the start of the list.

C
2004-04-25