Reading UTF-8 string from file with read() function.

From: Sergei (sergeisn-tma_at_yahoo.com)
Date: 08/31/04


Date: 31 Aug 2004 09:38:54 -0700

Hi,
I need to read a string from UTF-8 encoded text file.
I know at which byte position the string starts and its length (also
in byte units).
The problem is that read( FILEHANDLE,SCALAR,LENGTH) function takes
LENGTH in character units, not in bytes.
I've tried to open the file in binary mode instead of UTF-8, so I can
read the correct length, but then I can't process the string with
regular expressions correctly as Perl thinks it's in binary encoding,
not UTF-8.
Also, I've tried to read the string using getc() function, but it is
unacceptably slow.
Is there any solution ?
Thanks a lot,
--Sergei



Relevant Pages

  • Re: Interpretation of extensions different from Unix/Linux?
    ... the use of UTF-8 in this way is the recommendation of the ARG. ... (UTF-8 is a problem of its own in Ada. ... a UTF-8 encoded string is a String. ... You can't enumerate roots in Windows, ...
    (comp.lang.ada)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • Re: UTF-8 encoding
    ... I need to pass a UTF-8 encoded writer ... reading that file with the system's default encoding. ... String), but used elsewhere as if it were a StringBuffer. ... There's a very good reason that ...
    (comp.lang.java.programmer)
  • Re: Chinese filenames
    ... Always use simple ASCII characters. ... Ensure your PHP script be properly UTF-8 encoded. ... The name of the file can be acquired as a UTF-8 string: ...
    (comp.lang.php)
  • Seed7 (was: Program compression)
    ... Does Seed7 include a parser that reads Seed7 source-code syntax ... ] structures with string elements) the memory allocated for all ... | The type 'char' describes UNICODE characters. ... UTF-8 coding of a single character, ...
    (comp.programming)