Re: F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())



On 2009-04-22, Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
On 2009-04-22 00:32, Eric Pozharski <whynot@xxxxxxxxxxxxxx> wrote:
On 2009-04-20, Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
On 2009-04-17 00:23, Eric Pozharski <whynot@xxxxxxxxxxxxxx> wrote:
On 2009-04-15, Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
On 2009-04-14 23:45, Eric Pozharski <whynot@xxxxxxxxxxxxxx> wrote:
*SKIP*
* utf8::upgrade and utf8::downgrade aren't symmetric.

I've noted, but... F<encoding.pm> is wrong exactly how?

It isn't "wrong". It is documented. But it is surprising and illogical
behaviour. A source of subtle bugs because the programmer most likely
won't think of that. And I think encoding.pm is full of such crannies. I
learned to avoid it pretty quickly.

OK, let's leave it as a point to unnamed F<encoding.pm>'s dragons.

*SKIP*
So F<utf8.pm> utfizes symbols by accident. At least that wasn't an
intention.

I'm not sure what you mean by "utfize", but if you mean: "Symbols can
contain all Unicode letters and digits, not just A-Z, a-z, 0-9", then
that's quite intentional, not an accident. But it is a logical
consequence of interpreting the source code as a sequence of Unicode
characters instead of a sequence of ASCII characters. So, as a
programmer you don't have to remember that "use utf8" decodes string
constants from UTF-8 *and* that it allows all Unicode letters and digits
in symbols *and* that the DATA stream has an ':encoding(utf8)' layer

What seems to be undocumented BTW. However, after your explanation, I
think that it can't be any other way.

*and* whatever else may be affected. You have to remember one thing
only: Your source code consists of Unicode characters encoded in UTF-8
(or UTF-EBCDIC). Period. Nothing else. Clean and simple.

I wasn't about what to remember. I'm about "doing one thing". I think,
that neither F<utf8.pm> nor F<encoding.pm> do one thing.

*SKIP*
So you suggest that localizing Perl (or actually any other language) is
kind of online dictionary providers conspiracy?

No, not at all. What I am saying is that people will use localized
variable and subroutine names, and write comments in their native
language, no matter if the programming language makes it easy or not.
Sometimes this is because they don't speak English too well, sometimes
it's because the problem domain is language-specific (for example, when

Then they must. I don't say "should". I'm unaware of any other
language that nicely fits in 7bit. Calling it "US-ASCII" is pure accident.

You are contradicting yourself. First you say that English is the only
language that fits nicely into US-ASCII, then you say that US-ASCII is
called US-ASCII by accident. It isn't. US-ASCII was developed by an
American institute to write English texts. It is no accident at all that
it only contains characters which are frequently used in English
(technical) texts. And it is no accident that it is called ASCII -
"American Standard Code for Information Interchange". The US- in front
is somewhat redundant, but there were a lot variants of ASCII (e.g., the
ISO-646 encodings), so that serves as a reminder that this is indeed the
orginal American version of the American code.

(maybe I wasn't enough verbose this time) English fits in 7bit
encoding, whatever encoding it would have been. It could be any other
encoding (I did some reading about ASCII history (yes, I know wikipedia
is a vague source)). It could not be any other language.

That seemingly contradicts my point of having an option. Yes, but there
must be something common for all. By an accident -- it's English.

Yes. English. Not ASCII. If you write Russian in ASCII I understand it
just as little than if you write in in Cyrillic.

If you can write your programs in English, please do. Especially if you

That "if" (the latter one) is somewhat offending.

plan to make it open source. Almost every programmer on the world has at

That "open" is somewhat offending.

least a basic grasp of English. But if for some reason you have to write

"Quotation needed (tm)". Or define "programmer".

in Russian, then I think you should use the Cyrillic alphabet, not the
Latin alphabet. That will make it easier for those who understand
Russian and even for those who don't (because then at least they can
paste the stuff into an online dictionary and get a translation).

I beg to differ. I have no problem to understand what code does
(comparing to what it was supposed to do as described in comments and
symbol names) till any reasonable block of code fits on screen. When it
doesn't -- I become a way slow.

*SKIP*
My point isn't language mix; I have no problem with this.

I have. A program where all the identifiers, comments etc. are written
in Portugese or Polish is hard to figure out if you don't speak the
language. That they use the latin alphabet doesn't help much (except
that I have an inuitive (though very probably wrong) idea how to
pronounce them).

And here we have another difference between us. I look inside others
code mostly when I have problems with it, and sometimes when
documentation is incomplete, or seemingly wrong, or there's no
documentation at all. I don't look inside out of pure curiousity. And
you know what? I bet you know. There's no comments.

So if someday I step over comments written in Chinese, or Turkish, or
Portugal, or whatever else I'll just pretend there's no comments (as
ever). But I'm trying to imagine what I would do if the code would be
written in Korean. Maybe someday when F<utf8.pm> would make its way
into masses.

OK, read this (that depends on your context of course, it's possible
you would get it even without I<-Mstrict> or I<-Mwarnings>):

perl -Mutf8 -le '
print "vvv";
@OEM = qw/ 1 2 3 /;
print "@ОЕМ";
print "^^^";'
vvv

^^^


*SKIP*
I bet you've seen this before,

I've seen German versions of BASIC in the 1980's. They weren't a huge
success, to put it mildly. About the only successful localization of a
programming language i can think of is MS-Excel (and I guess it's
debatable if this is even a programming language (without VBA) - is it
turing-complete?).

That's in case you have an option. There're places you have no option.

I don't understand your objection. I was relating the historic fact that
localized programming languages (i.e., programming languages where the
keywords (and to a lesser amount, the grammar) were localized, so that
you would write "wenn ... dann ... sonst ..." instead of "if ... then
... else") were a failure. People had the option to use them, but they
didn't want to.

Then read it again (maybe my English failed this time, again). Those
provided with germanized (is it right?) had an option. The option to
reject it. Sometimes there's no option. You just don't know what does
it mean having no option.

I didn't mean that every single programmer had this option. If you work
in a shop which writes software in FORTRAN-77 (I was talking about the
1980's, remember) you don't have the option to choose your programming
language. C or COBOL is just as unavailble as a germanized version of
FORTRAN.

But the industry as a whole had the option, and it rejected it (with the
single exception of spread*** formula languages). The industry still

Watch what you're saying. Industry, community, society, whatever isn't
just a mix of protoplasm.

has the option. There are new scripting languages all the time, and
every few years one of them becomes really popular. So introducing
new languages in general is still possible. But all the popular
programming languages are based on English. Obviously there is no need
to localize the few dozen keywords - even if you don't speak English at
all, learning what "if" and "sub" do is not a problem (and the latter
isn't a proper English word anyway, so the English speaker has to learn
it as well).

(I'm still unclear) I know what does it mean -- having no option. At
all.

p.s. Are we still talking Perl?


--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
.


Quantcast