Re: Maximum String size in Java?
websnarf_at_gmail.com
Date: 03/04/05
- Next message: websnarf_at_gmail.com: "Re: Maximum String size in Java?"
- Previous message: Tim Rentsch: "Re: The code of co-workers"
- In reply to: Randy Howard: "Re: Maximum String size in Java?"
- Next in thread: Randy Howard: "Re: Maximum String size in Java?"
- Reply: Randy Howard: "Re: Maximum String size in Java?"
- Reply: CBFalconer: "Re: Maximum String size in Java?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 3 Mar 2005 22:51:10 -0800
Wow a flame war dedicated to me. I love it!
Randy Howard wrote:
> cbfalconer@yahoo.com says...
> > IIRC I pointed out that the lack of portability was unnecessary.
> > For example, for a 32 bit quantity he uses int. He also does
> > shift operations on that. It takes very little forethought to
> > use an unsigned long here and avoid all the difficulties, yet
> > generate the same machine code.
>
> It takes about 2 seconds to change it as you describe. As such,
> it's not a big deal. It's not as if he has hundreds of variables,
> all of types you don't like.
It also does you no good. The hash function I created needs *precise*
sizing, not "at least" sizing. I.e., if int and/or long are 36 bits,
you are SOL. The hash function will *NOT* have the minimal collision
properties I claim it does. The same thing is true for One-at-a-time
hash, and Bob Jenkin's hash function.
Its the same problem for systems that decide long is 64 bits, but int
is 32 bits, or systems that decide long is 32 bits but int is 16 bits.
Do you see how I can't win no matter what I choose? I need 32 bits.
Not more, and not less. *BOTH* deviations break the code.
The problem with the CLC group's neurotic notions about the portability
of base integer types is that they don't realize that for some
applications (hashing and crypto) you can't have "at least" some number
of bits, you need *exactly* some number of bits. MAX_INT and CHAR_BIT
doesn't help unless you have a deterministic way of generating a type
with exactly x-bits which support 2s complement. You just *need* that
for crypto. If CBFalconer thinks he can bring his nonsense *into* the
crypto world (sci.crypt), then I think he ought to get a taste of his
own medicine.
> > I also suspect an alignment problem because the data passed in is
> > a char array, yet it is treated as if aligned for an int. This is
> > not good.
>
> That's true. I'd like to see a test with random strings, with
> them intentionally "misaligned", on a sparc or something and see
> what sort of sparks fly out. Unfortunately, I don't have one
> handy.
Well, I'll save you the trouble. Look more *closely* at the code. Oh
yes, the accumulator needs to deal in 32-bit blocks because of the way
it is computed. But the source data is specifically read in
little-endian chunks one-byte-at-time (except for on the x86 where
there is high performance hardware support for little endian unaligned
reading.) So in fact, by its very design, the program is actually
fairly insensitive to data alignment.
You guys seem to be able to read code for "non-ANSI compliance" with
relative ease -- now tell me what is the value of that in the face of
the fact that this "alignment scenario" is also a complete non-issue
that you can quite easily see from the code as well. Of course if you
really want to write a test for this, be my guest. (Your first
instinct about the "end cases" was a better bet to find a potential
issue, since you cannot easily determine whether or not its going to be
an isse from just reading the code.)
> > When you get down to the string lengths that usually appear as
> > keys, the end effects and overhead have a serious effect. I found
> > that the times31 (or 33) system was faster at a string length of
> > 5.
>
> And I found that it was slower at 12. [...]
On a Pentium. And remember this really isn't apples to apples. If you
want to remove all the good collision properties of my hash (and thus
make it more comparable to the 31/33 hash) you can remove all the
avalanching code at the end, and perform the following simplification:
newHash(data,n) = SfhNoAvalanche(data,n-1)^data[n-1];
If you do this, you will *still* have a hash function that is at least
as robust as the 31/33 hash, and I suspect the performance cross over
will be at an even lower number (if there is any at all.)
> [...] That's pretty contrived, for
> most applications anyway. Neither is slow. In fact, 31/33 is
> *slightly* faster for very short strings, and considerably slower
> for long strings, so on average, SFH bakes it in the performance
> category.
Yeah, admittedly I only considered hashing of strings where someone
might actually be worried about the performance impact. So defending
the 33/31 hash versus SFH is kind of like defending Bubblesort versus
Heapsort -- yeah the nonserious candidate *is* faster for really small
n.
> > Nobody has bothered with my statement that the hash distribution
> > over 32 bit values is of no interest, what is of interest is its
> > distribution over the hash table size.
Besides being an outright lie (I addressed this in the sci.crypt
thread), this is nonsense. Most hash designs use *PART* of the hash
function output for the table index, but then store the whole original
output along with the entry in the hash table in order to create a
"fast check" mechanism to see if a collision has occurred (versus a
duplication.)
> Well, when you have collisions in the hash value itself, as seen with
> 31/33, that's not going to change when you % that into a table (or
> whatever method you use). I forget the exact detail, but I think it
> was something really ugly, like "Ab" and "aB" hashing to the same
> value.
Its not hard dude -- just solve the following:
33*x + y == 33*z + u
-> 33*(x-z) == u - y
So for x and z differing by say, 1, you need to find values of u and y
that differ by 33. In the ASCII range, that's easily done by a shift
in case ('a'=97, 'A'=65 for a difference of 32 for example.) You also
get the ASCII numbers shifted to the letters P through Y. Also the
common punctuation characters ! ( ) , . all shift by 33 to the letters
B I J M O and thus also shift by 66 to the letters c j k n p. So its
really easy to come up with many many hash collisions for any pair of
strings which only differ in the last two characters.
In fact a quick back of the envelope computation suggests that in two
characters I can find 2345 collisions in the 33/31 hash function. The
number of collision in 3 characters has got to be enormous, but seeing
this 2 character flaw was enough for me to stop any further analysis.
Compare this with SFH, which (by construction) has 0 collisions for up
to 4 bytes.
> > real data, and I mentioned some ideas about designing that
> > experiment. I also offered some code useful in buiding and
> > measuring such experiments.
As opposed to me who just provided a test hoist, 5 competing hash
functions, a timer, and results from 5 different compilers.
> I definitely think it's worth pursuing further. If for no other
> reason that its interesting to play with. Or perhaps try and
> cross-pollenate the two. :-)
That would be like breeding Carl Lewis with Dr. Ruth.
> > I just don't know whether SFH is a useful idea or not. I do know
> > that there are obvious weaknesses in the published code. These
> > need to be handled before serious tests are done.
>
> Well, it's a short enough function that it shouldn't be very hard
> to get it like you want it.
Which, as I said, won't help. If he were serious about it, and
believed the CLC dillusion about C99, he could just include <stdint.h>
and use int32_t or whatever.
> > There are no such obvious weaknessis in the times 31 or 33 system.
This is clearly a mispelling of "strength". Its slow and it collides
(in *ALL* bits of output) at the least provocation.
> Apart from the major weakness of being easy to find collisions using
> it perhaps.
>
> > To handle nul terminated strings SFH as presently coded will have
> > to be combined with a call to strlen, which certainly increases
> > its running time.
A barb of the most desperate kind. A few parallel compare-branches, in
place of the address updates and length checking, would not seriously
harm the performance of SFH. Of course all this ignores the details of
the scenario you proposed, of reading a big text file and spitting the
string into smaller tokens -- meaning that the length of the
sub-strings would be incrementally available as you built them.
> By not using nul termination, his function will work with unicode
> strings, or any other arbitrary data that you wish to hash, and
> 31/33 will not. So if you are really worried about portability, then
> you need to be concerned about Unicode strings. Also, I suspect
> his bstring library carries the length around with it (I can't
> remember for sure, been a while since I looked)
Well then you should look at it again! Because yes it does! And its
got good sparkle goodness to it as well!
Buy one get one free! Limited time offer!
Seriously though, yes, in fact the only way I even think about strings
these days is in the bstring implementation, so I never give a first
thought to "calling strlen()". I didn't make explicit reference to
bstrings because I believe SFH and bstrlib stand well on their own
independently from each other. They they work extremely well together
should not surprise anyone.
> [...] so it comes along with the deal. It wouldn't be hard
> to take the SFH approach and make it work on nul terminated
> strings only if you feel that is an absolute requirement.
>
> > As far as the efficacy of those routines, I have only the word of
> > Kernighan and Pike in "The Practice of Programming". Their
> > opinions have been fairly sound in the past.
>
> If that was the case, then nothing invented since it was published
> would be worth investigating.
Well -- I take a slightly different view. Basically one of Kernighan,
Ritchie or Thompson is responsible for the abberation known as "the C
standard library". And the general syntax/design of the C core
language really borrows heavily from BCPL, so anything good about that
is unlikely to be due solely to them. So I'm pretty skeptical about
anything those people say. (Remember one of these guys wrote gets() --
*REMEMBER* this.)
> > Such applications as my version of Markov have not been obviously
> > harmed by their use.
Oh, well there's a good proof. Have you compared it to just summing
the input characters together? I suspect there hardly exists a test
that could tell the difference between the quality of that and the
31/33 hash.
> Not being harmed is not the same as "can't be improved upon". It
> may turn out that in real world testing, there is no difference
> between them, or one of them outright slays the other.
Well, you already have the performance numbers. All that remains is
your independent verification of what I already know about it from a
collision point of view. I still have access to that PPC64 running
linux, so if you write a more comprehensive test and hand it back to
me, I can get you 5 compilers worth of data.
-- Paul Hsieh http://www.pobox.com/~qed/ http://bstring.sf.net/
- Next message: websnarf_at_gmail.com: "Re: Maximum String size in Java?"
- Previous message: Tim Rentsch: "Re: The code of co-workers"
- In reply to: Randy Howard: "Re: Maximum String size in Java?"
- Next in thread: Randy Howard: "Re: Maximum String size in Java?"
- Reply: Randy Howard: "Re: Maximum String size in Java?"
- Reply: CBFalconer: "Re: Maximum String size in Java?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|