Re: How to NOT use utf8.

From: pkaluski (pkaluski_at_piotrkaluski.com)
Date: 02/25/05


Date: Fri, 25 Feb 2005 21:21:54 +0100

Alan J. Flavell wrote:
>
> (...) I'd hoped for more detail so that the
> real problem could be understood...
>
> (...) you asked what is really a very complicated question -
> especially considering that it was almost entirely lacking any context
> in terms of problem domain, circumstances, external modules called,
> etcetera etcetera etcetera.
>
> If you're processing text, then you *need* to know what encoding has
> been used. If you're processing binary data, then you shouldn't be
> treating it as text. That's been my attitude since, well, around 1965
> I suppose it was, when I first grasped the difference, although I'd
> been doing it - in a sense - without realising the point, since I met
> my first computer in 1958.
>
>
> (...) I asked you several supplementary
> questions, to help in understanding the problem in its context - but
> which you have chosen - it seems - to ignore.
>
> good luck

OK. I can now provide you with some details.
I did not place details in my first post, because my problem was initialy
happening in my big script which I couldn't post because it was to big, using
too many modules. I had some indications that my problems are due to Unicode. So
my thought was - "OK, the easiest way would be to make perl work as if there is
no such think like unicode". And it was my question - is it possible to make
perl totaly Unicode unaware. Since my script is supposed to run under Windows, I
added the Windows part to my question in case there is something system specific.

Now I can provide you with some details, since I managed to separate the problem
and recreate it in the smaller script.

The problem was that Carp::cluck was crashing my script. Crashing in a nasty,
uncontrolled way so Windows were killing it. What was more interesting, the
thing was happening only when running my script under debugger (which is also
scary - if something fails on debuger and works without it could be an
indication that something is terribly screwed).

When I tried to spot the problem, I have found that one of regular expressions
in Carp::format_arg function, called by cluck, jumps to other chunk of code. See
below (I've attached a call stack):

   DB<2>Carp::caller_info(C:/Perl/lib/Carp/Heavy.pm:62):
62: $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;
   DB<2> s
utf8::SWASHNEW(C:/Perl/lib/utf8_heavy.pl:21):
21: my ($class, $type, $list, $minbits, $none) = @_;
   DB<3> T
$ = utf8::SWASHNEW('utf8', '', '# comment^J+utf8::IsCntrl^J', 1, 0) called from
file `C:/Perl/lib/Carp/Heavy.pm' line 62
@ = Carp::format_arg('After value1') called from file `C:/Perl/lib/Carp/Heavy.pm
' line 31
@ = Carp::caller_info(3) called from file `C:/Perl/lib/Carp/Heavy.pm' line 142
@ = Carp::ret_backtrace(2, 'After value1') called from file `C:/Perl/lib/Carp/He
avy.pm' line 125
@ = Carp::longmess_heavy('After value1') called from file `C:/Perl/lib/Carp.pm'
line 235
@ = Carp::longmess('After value1') called from file `C:/Perl/lib/Carp.pm' line 2
72
. = Carp::cluck('After value1') called from file `test2.pl' line 11
   DB<12>

See? Steping on substitution operator moves me to utf8 module. And when stepping
further I was getting messages about malformed UTF-8.
BTW, comment in Carp::format_arg function says:

(Carp/Heavy.pm)
59 # The following handling of "control chars" is direct from
60 # the original code - I think it is broken on Unicode though.
61 # Suggestions?
62 $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;

So the author suggests that there may be a problems for unicode, and he seams
to be right.

The code snippet below makes perl crash (at least for me)

--- CODE STARTS ---
use strict;
use XML::Simple;
use Carp qw( cluck );

cluck "Before";

my $str = XMLin( "input.xml" );
my $msg = "After " . $str->{ 'tag1' }->{ 'attr1' };
cluck $msg;
--- CODE ENDS ---

The input.xml file is simple:

--- INPUT.XML STARTS ----
<opt>
     <tag1 attr1="value1"/>
</opt>
--- INPUT.XML ENDS ----

In order to have the crash effect, you have to run perl under debbuger. Like this:

##########################

M:\temp\unicode>perl -d test2.pl

Loading DB routines from perl5db.pl version 1.28
Editor support available.

Enter h or `h h' for help, or `perldoc perldebug' for more help.

main::(test2.pl:6): cluck "Before";
   DB<1> c
Before at test2.pl line 6
  at test2.pl line 6

M:\temp\unicode>

###############################

It didn't make it to the end. It crashed.

If I get rid of unicode flag from the $msg it will work:

--- CODE STARTS ---
use strict;
use XML::Simple;
use Carp qw( cluck );

cluck "Before";

my $str = XMLin( "input.xml" );
my $msg = "After " . $str->{ 'tag1' }->{ 'attr1' };
require Encode;
Encode::_utf8_off( $msg );
cluck $msg;
--- CODE ENDS ---

Of course I have tried all this stuff with PERLIO=:bytes.

After this experiments I think I can make my first question more clear (I hope)
- Can you make perl totally unaware of such thing like Unicode?

And I believe that the answer is - You can't. Perl has unicode support in its
guts. The only things you can manipulate are:

* You can make perl to treat unicode as bytes durring reading and writing(by
PERLIO and some pragmas)
* You can reset the UTF-8 flag in a string.

But if you are about to write something bigger, using many modules, then Alan is
right - it is more efficient to adjust your code to unicode, instead of avoiding it.

In order to avoid it you would have to control each string produced by any
module and downgrade it to bytes. This approach is infeasible even for medium
size projects.

In the scripts above XML::Simple returns Unicode strings (even is Unicode is not
needed and PERLIO=:bytes).

Is my reasoning correct?
And what is wrong with this regular expression used indirectly by cluck, that it
makes perl crash?

-- 
Piotr Kaluski
"It is the commitment of the individuals to excellence,
their mastery of the tools of their crafts, and their
ability to work together that makes the product, not rules."
("Testing Computer Software" by Cem Kaner, Jack Falk, Hung Quoc Nguyen)


Relevant Pages

  • Re: Perl opting for double-byte chars?
    ... sure Unicode has something to do with your problem, ... Without knowing the version of Perl you're using and the platform ... The UTF-8 encoding uses variable-length character ... perldoc Encode ...
    (comp.lang.perl.misc)
  • How to decode JavaScripts encodeURIComponent in Perl.
    ... who struggle with the Perl language and all it's myriad idiosyncracies. ... character sets, but I acknowledge that if you *are* dealing with what I ... they find they can't use their own native character-set in a URI, ... So now we have Unicode -- a vastly superior term, to some people, ...
    (comp.lang.perl.misc)
  • Re: Creating UNICODE filenames with PERL 5.8
    ... I didn't clue in from the documentation ... It comes back with a two character ... Do you know of a method of reading directories to get the UNICODE file ... >> I have been having distinct trouble creating file names in PERL ...
    (comp.lang.perl.misc)
  • Re: Help with simple script
    ... You can do that in Perl. ... file to UniCode and then to binary (you must go through unicode first ... search binary unicode with an ASCII search term, ... the file and check if your pattern matches the first X bits (where X is ...
    (comp.lang.perl.misc)
  • Wide character notation, was Re: How to NOT use utf8.
    ... > So the author suggests that there may be a problems for unicode, ... in the Perl documentation). ... The Unicode code for the desired character, in hexadecimal, ... Unicode strings ...
    (comp.lang.perl.misc)