Re: Maximum String size in Java?

From: Randy Howard (randyhoward_at_FOOverizonBAR.net)
Date: 03/04/05


Date: Fri, 04 Mar 2005 16:12:39 GMT

In article <422876A1.66D5BA5@yahoo.com>, cbfalconer@yahoo.com says...
> Randy Howard wrote:
> > And I found that it was slower at 12. That's pretty contrived, for
> > most applications anyway. Neither is slow. In fact, 31/33 is
> > *slightly* faster for very short strings, and considerably slower
> > for long strings, so on average, SFH bakes it in the performance
> > category.
>
> Will it if the alignment problem is handled? I don't know.

So comment out the shortcut and find out.

> When you are pushing performance, the only thing that counts is the
> statistics on the data you are actually using. I said this several
> times in several forms. My opinion is that the use of long strings
> as keys is relatively rare.

My opinion is that agreeing that they're never long than 5 characters
is pretty rare.

> > Well, when you have collisions in the hash value itself, as seen with
> > 31/33, that's not going to change when you % that into a table (or
> > whatever method you use). I forget the exact detail, but I think it
> > was something really ugly, like "Ab" and "aB" hashing to the same
> > value. Also, I noticed some long sentence length strings that had
> > only one or two characters changed hashed to very nearly the same
> > hash value, which wasn't expected.
>
> Hashing to the 'nearly same' value is no problem for hash table
> use.

I think you ignored Paul's claim that something like 3000 collisions,
not near collisions, occur with trivially short strings using 31/33.
That's in the same league with hash = 1.

> > Apart from the major weakness of being easy to find collisions
> > using it perhaps.
>
> I don't concede that fact. That awaits tests on real data.

I note that you didn't choose to pay attention to Paul's algebraic
argument about this, I wonder why.

> > Not being harmed is not the same as "can't be improved upon". It
> > may turn out that in real world testing, there is no difference
> > between them, or one of them outright slays the other.
>
> Where did I ever make such a claim? I will not concede that an
> improvement that is totally masked by i/o is worthwhile.

Who said it was masked by I/O? Furthermore, if the improvement
is on the collision side, with performance coming along in the
bargain, that's farcical at best.

-- 
Randy Howard (2reply remove FOOBAR)
"Making it hard to do stupid things often makes it hard
 to do smart ones too." -- Andrew Koenig


Relevant Pages

  • The math of CRC functions
    ... text such that the chance of hash collisions is a small as possible ... I'm thinking CRC functions are a good answer. ... the inputs are all strings of ASCII text and thus the ... But what if all I care about is minimizing the chance of hash ...
    (sci.math)
  • Re: Basic Question
    ... example, which produces a 128-bit hash, is guaranteed not to produce any ... to produce collisions for some strings no longer than 129 bits. ... "substantially less" can be calculated given the exact meaning of ...
    (talk.politics.crypto)
  • Re: Basic Question
    ... example, which produces a 128-bit hash, is guaranteed not to produce any ... to produce collisions for some strings no longer than 129 bits. ... "substantially less" can be calculated given the exact meaning of ...
    (talk.politics.crypto)
  • Re: One-to-one Hash functions
    ... >>took the hash of all of the possible strings of that size, ... There will be a few collisions among the 2^128 ... input strings of length 128 bits. ... probability of hitting one by accident in a billion ...
    (sci.crypt)
  • Re: Thou shalt have no other gods before the ANSI C standard
    ... "Randy Howard" wrote in message ... the addition may yield a signed integer overflow. ... > typical strings ...
    (sci.crypt)