new String ( byte[] , encoding ) under the hood



I was curious how new String ( byte[], encoding ) could guess the
correct size of the buffer to convert into String.

It makes an estimate based on number of bytes times the max number of
chars per byte, an attribute of the encoding. This will be slightly
on the high side if there are any multibyte chars, but accurate for
Latin-1. It then decodes, and calls trim to System.arraycopy to get an
char[] the right size. The new String then does another
System.arraycopy.

You leave in your wake the original byte[], two char[] and the string.

Going the other way String -> byte uses similar logic, but the buffer
size is not so fortunate. For UTF-8 it makes the conservative
assumption each char might need 3 bytes, making the buffer 3 times
bigger than it needs to be in the ordinary case.

Sun could optimise could streamline these operations to cut out the
intermediate objects.

Here's an idea. Why not allow strings and char arrays etc to
temporarily be too big. They are logically sized. Only on the next GC
do the objects get pruned to size if need be. You would save a lot of
copying and new object creating just to get arrays the precise correct
size. There would be a method to prune an array to size that just
logically chopped it and marked it for later true pruning. Most of the
time though such objects will soon be discarded, and you then get away
without ever doing the copy.


--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
.



Relevant Pages

  • Re: Cannot return values of char variable
    ... - buffer = ... Since you seem to be trying to return a char pointer ... int id = random; ... content is interpreted as a string. ...
    (comp.lang.c)
  • Re: detecting characters on RS232-Interface
    ... read data into string variable ... > splitted at the end of the receive buffer). ... examine the next char in turn. ... When a character ...
    (microsoft.public.vb.general.discussion)
  • Re: Something wrong in my program
    ... what becomes of the memory block starting at this address is no ... our text buffer can contain 15 characters ... a string is a char array *terminated ...
    (comp.lang.c)
  • Re: String parsing in VB.net
    ... refactor your code so that myDataBuffer is an array of Byte, ... than a string. ... receives a data buffer. ... encoding - it is only defined for 0-127. ...
    (microsoft.public.dotnet.languages.vb)
  • Re: How to add thousand separators
    ... First, this code is obsolete as written, because char is a dead data type and should not ... Note that both of these should be stored as string resources since they might need to be ... 18 digits for any reason. ... you have made a VERY SERIOUS DESIGN ERROR. ...
    (microsoft.public.vc.mfc)