Re: Unicode in regexp



On 5월21일, 오후8시09분, patari <lassi.paavolai...@xxxxxxxxxx> wrote:
Hi,

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?


Hello,

Save your script in UTF-8 encoding and just use the unicode
characters, rather than \x{****} form, in the regexp:

$str =~ s/-/--/g; # First "-" is \x{2013}, not dash.

Or,

decode it first, perform substitution, and encode it back:

use Encode;
$octets = decode("UTF-8", $str);
$octets =~ s/\x{2013}/--/g;
$str =~ encode("UTF-8", $octets);

.



Relevant Pages

  • Re: write smiley to file
    ... use Encode; ... There is a difference between UTF-8 and Unicode characters. ... the same Unicode character. ... The Unicode character is higher than 127, so we can ignore the first rule. ...
    (perl.beginners)
  • Re: Microsoft Layer for Unicode on Windows 95/98/Me systems
    ... > This term is used to differentiate the Unicode character encodings from ... > standard single byte character encodings. ... > only to encode characters but also to encode administrative data, ... Code points are dependent on code pages. ...
    (microsoft.public.vb.winapi)
  • Re: displaying unicode x2258
    ... the unicode character instead as, say, a "dash" command. ... font that contains the character in question (like DejaVu Sans in your ...
    (comp.text.tex)
  • Re: What is better encoding method?
    ... the Unicode character encoding, version 2.1 or later, using the UTF-16 ... though they were performing normalisation of text, ... ECMAScript source text can contain any of the Unicode characters. ...
    (comp.lang.javascript)
  • Re: Perl opting for double-byte chars?
    ... If by "a Unicode character" you mean one whose code value is greater ... incur some processing overhead due to the extra work of Perl handling ... because Perl takes care of it for you (if you're ...
    (comp.lang.perl.misc)