Re: Maximum String size in Java?

From: CBFalconer (
Date: 03/04/05

Date: Fri, 04 Mar 2005 15:42:33 GMT

Randy Howard wrote:
> says...
>> IIRC I pointed out that the lack of portability was unnecessary.
>> For example, for a 32 bit quantity he uses int. He also does
>> shift operations on that. It takes very little forethought to
>> use an unsigned long here and avoid all the difficulties, yet
>> generate the same machine code.
> It takes about 2 seconds to change it as you describe. As such,
> it's not a big deal. It's not as if he has hundreds of variables,
> all of types you don't like.

No, but to me at least, it is an indication of the code quality.
It means that everything in it needs to be closely examined. It
takes such little effort to avoid this particular trap in the first
place, provided one is aware. This returns to the possible
explanations of laziness or incompetence. It doesn't mean that the
idea is bad or unsound.

>> I also suspect an alignment problem because the data passed in is
>> a char array, yet it is treated as if aligned for an int. This is
>> not good.
> That's true. I'd like to see a test with random strings, with
> them intentionally "misaligned", on a sparc or something and see
> what sort of sparks fly out. Unfortunately, I don't have one
> handy.
>> When you get down to the string lengths that usually appear as
>> keys, the end effects and overhead have a serious effect. I found
>> that the times31 (or 33) system was faster at a string length of
>> 5.
> And I found that it was slower at 12. That's pretty contrived, for
> most applications anyway. Neither is slow. In fact, 31/33 is
> *slightly* faster for very short strings, and considerably slower
> for long strings, so on average, SFH bakes it in the performance
> category.

Will it if the alignment problem is handled? I don't know.

When you are pushing performance, the only thing that counts is the
statistics on the data you are actually using. I said this several
times in several forms. My opinion is that the use of long strings
as keys is relatively rare.

>> Nobody has bothered with my statement that the hash distribution
>> over 32 bit values is of no interest, what is of interest is its
>> distribution over the hash table size.
> Well, when you have collisions in the hash value itself, as seen with
> 31/33, that's not going to change when you % that into a table (or
> whatever method you use). I forget the exact detail, but I think it
> was something really ugly, like "Ab" and "aB" hashing to the same
> value. Also, I noticed some long sentence length strings that had
> only one or two characters changed hashed to very nearly the same
> hash value, which wasn't expected.

Hashing to the 'nearly same' value is no problem for hash table
use. I can conceive of cases where it is especially desirable. At
the same time hash tables expect hash collisions, and deal with
them. The worst hash function I can conceive of is "h = 1;", and
that is one of the tests applied to hashlib. It survives, but
performance is somewhat degraded for more than two entries :-)

>> real data, and I mentioned some ideas about designing that
>> experiment. I also offered some code useful in buiding and
>> measuring such experiments.
> I definitely think it's worth pursuing further. If for no other
> reason that its interesting to play with. Or perhaps try and
> cross-pollenate the two. :-)
>> I just don't know whether SFH is a useful idea or not. I do know
>> that there are obvious weaknesses in the published code. These
>> need to be handled before serious tests are done.
> Well, it's a short enough function that it shouldn't be very hard
> to get it like you want it.
>> There are no such obvious weaknessis in the times 31 or 33 system.
> Apart from the major weakness of being easy to find collisions
> using it perhaps.

I don't concede that fact. That awaits tests on real data.

>> To handle nul terminated strings SFH as presently coded will have to
>> be combined with a call to strlen, which certainly increases its
>> running time.
> By not using nul termination, his function will work with unicode
> strings, or any other arbitrary data that you wish to hash, and
> 31/33 will not. So if you are really worried about portability, then
> you need to be concerned about Unicode strings. Also, I suspect
> his bstring library carries the length around with it (I can't
> remember for sure, been a while since I looked) so it comes along
> with the deal. It wouldn't be hard to take the SFH approach and
> make it work on nul terminated strings only if you feel that is
> an absolute requirement.

I readily concede that my outlook is insular. I have never built
anything that used unicode. However I thought that normal unicode
representations were arranged such that no zero bytes ever appear
in the strings.

> > As far as the efficacy of those routines, I have only the word of
> > Kernighan and Pike in "The Practice of Programming". Their opinions
> > have been fairly sound in the past.
> If that was the case, then nothing invented since it was published
> would be worth investigating.

I thought I recommended investigating. I consider putting forth
thoughts about how to make such an investigation a fairly clear
invitation to others to jump in. A quote from K & P is:
"Experiments show that for a wide variety of strings it's hard to
construct a hash function that does appreciably better than the one
above, but it's easy to make one that does worse.". This is a bit
stronger than just an opinion.

>> Such applications as my version of Markov have not been obviously
>> harmed by their use.
> Not being harmed is not the same as "can't be improved upon". It
> may turn out that in real world testing, there is no difference
> between them, or one of them outright slays the other.

Where did I ever make such a claim? I will not concede that an
improvement that is totally masked by i/o is worthwhile.

"If you want to post a followup via, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson