Re: length in (utf8) characters ?



Peter Billam wrote:
I'm confused... in "perldoc length" it says

if the EXPR is in Unicode, you will get the
number of characters, not the number of bytes.

which is what I would want. But (in a one-line demo
of a problem I have in a much larger module):

$> perl -e '$l=length "ö"; print "length=$l\n";'
length=2

But I want to see length=1 here... (in case your news-client
doesn't do utf8, that string was a o-umlaut) I'm using v5.10.1
on debian squeeze and everything else works fine in utf8.



When I paste the character in my newsreader, I am using ISO-8859-1, not UTF-8.
This works fine:

5.8.8 gives:
perl -e '$s="ö"; $l=length $s; print "length $s =$l\n";'
length ö =1

5.10.0 gives:
perl -e '$s="ö"; $l=length $s; print "length $s =$l\n";'
length ö =1
.



Relevant Pages

  • Re: ??Difference Between utf8encoder.GetBytes and Encoding.ASCII.GetBytes
    ... UTF8 and Unicode are just two ... you need to use non-ASCII characters in your test. ... >> uses only one for ASCII characters, so it generally uses much less space ...
    (microsoft.public.dotnet.framework.aspnet.security)
  • Re: RSS feeds and HTML special characters
    ... So it's safe to assume that browsers handle HTML ... *Unicode*, not UTF8. ... and when I say 'Unicode', ... Unicode is a big old list of characters, with a number for each one. ...
    (comp.lang.perl.misc)
  • Re: Unicode support
    ... >> that you couldn't support unicode file names unless ... > been surprised by a message indicating invalid UTF8 characters. ...
    (comp.lang.fortran)
  • Re: Unicode support
    ... This seems to suggest that you couldn't support unicode file names unless unicode was the default kind. ... been surprised by a message indicating invalid UTF8 characters. ... used to encode unicode characters using eight bit codes. ... Java is the only language that I know of where the default character type is unicode, possibly converted to UTF8 for file names. ...
    (comp.lang.fortran)
  • Re: thank you very much,Joseph M. Newcomer
    ... are 8-bit characters, Unicode characters, or a sequence of DWORDs interlaced with an ... fact that these are Unicode bytes is irrelevant. ... Now, if you have an 8-bit app otherwise, and you are reading Unicode, you have some ... out UTF8 when needed, but only after converting from Unicode, and to write UTF8 ...
    (microsoft.public.vc.mfc)