Re: UTF-8 without external modules on Perl 5.0



In article <1f3p4e.vp7.ln@xxxxxxxxxxx>, hjp-usenet2@xxxxxx says...
perl 5.005 also doesn't know about wide characters. A character is a
byte, so there is no way to have a character outside of the range
0..255. So you don't need any decoding routines because you couldn't
decode a euro sign anyway :-).

Hum, effectively, I didn't realize all the aspect about this charset
problem. In fact, in my first idea, I thought I could do that :

1/ indicate (simply by comment) that the string in code (the
configurable one I told about and others written by me) have to use
character in iso-8859-* table only.

2/ indicate a charset of utf-8 for generated html pages and convert
anything to utf-8 prior to print to browser.

3/ take anything which come from html forms as being utf-8 and, then,
convert-it back to iso-8859-* immediately on receiving to be Perl
5.00503 compliant

And for this I found a pure Perl module called Unicode::UTF8simple
containing to/from conversion sub I could copy/paste in my own script
(indicating the original author in header of course)... But as you state
: own to convert from UTF8 to an iso-8859-* when the given UTF-8
character (like euro sign) is not representable in the target charset ?

What do you think ? Does this way definitively out or is there a
workaround ?

So if you need to work with unicode strings in perl 5.005, the best way
is probably to work with raw UTF-8-encoded strings. That means that a...

Reading your list of needed changes, I'm not very ready to go toward
this nightmare. Well, maybe I could develop two version :

- A one for Perl 5.00503 with a solution not found at this time :
depending grandly of your reply about the way (if any) to use this
UTF8simple converter above, or you iso-8859-15 solution below.

- A more evoluated one for more recent interpreters. So, just a question
: how does it's simple in these Perl release : do I just have to
indicate "use utf8;" and that's all ? Not clear in my mind.

This is a quite silly policy: If you can do something stupid or harmful
with a module, you can do the same thing with a script.
But I know that some sites have such a policy, and they probably won't
change it, so you're probably stuck with it.

The reason why is very simple : the team of developer who work in
majority for these servers are a PHP ones and they have conviced the
direction of this company to privillegiate PHP *against* Perl. So, not
silly, wicked for others developers who have to use Perl a day or
another !

If you only need English and French (and won't be needing Czech next
year because your company opens a branch office in Prague) you are
probably better off using an 8-bit character set which covers those two
languages. ISO-8859-15 and Windows-1252 come immediately to mind.

Yes, we will only target English and French, and even if things could be
accessed by people from countries without these language as natives,
they will input using these two languages (and will read in these two
languages two, of course). Well, effectively, choice of a single-byte
charset could be something which could make me happy... If really right
! Also, two questions :

1/ I found some (in ng and on web) who said iso-8859-15 was not a good
choice : but I don't knw exactly. What could be wrong with this charset?

2/ Windows-1252 seems to be not often choosen : why ? because of it's
"Windows" reminder in name?

Where do people edit these strings? Directly on the server? Or do they
edit the file on their Windows machine and then upload it to the server
via FTP (or whatever)?

Both :-( These scripts will be edited under Win32 and Unix flavors, will
run under Win32 and Unix flavors.

Thank you for your help Peter, it becomes a little less confused from
your post.
.



Relevant Pages

  • Re: syntax extension, was Why context-free?
    ... >of a different character - Knuth himself gives several examples in The ... Perl has its sweet spot domains (as a child of sed, ... with every branch tested both for correctness of the ... That someone who is a skilled programmer in some languages ...
    (comp.compilers)
  • Re: ways to check for octets outside of the safe ASCII range?
    ... Space is an ordinary single-width character like any other, ... an app that wanted to know whether it was safe to assume 1 ... Unicode implementation. ... If these people are not aware that Perl scalars don't necessarily ...
    (comp.lang.perl.misc)
  • Re: VERY simple question about "?"
    ... don't see the equivalence between a string delimiter, or a character that signals the beginning of a symbol, and a symbol that is actually productive of something. ... Part of my difficulty understanding you is probably caused by the fact that you seem to try to tackle problems of computer languages with tools from a complete different domain. ... "Tom" - my name can also be called, but when one does so IT doesn't spring into action at all. ...
    (comp.lang.ruby)
  • Re: What is going on?
    ... Problem is i do not know Perl. ... $#ARGV is the array index of the last variable entered. ... $epc variable plus a '/' character. ... Save the return value of the $response->content subroutine. ...
    (comp.lang.perl.misc)
  • Re: Any arguments for keeping Yum case-sensitive?
    ... original poster to a basic course on language and locales and character ... No - in fact in English it is basically OK but in other languages it is ... is RPM name A == RPM name B ... character set, case conversion and the like are huge. ...
    (Fedora)