Re: Unicode in regexp
- From: gypark2@xxxxxxxxx
- Date: 21 May 2007 05:37:37 -0700
On 5월21일, 오후8시09분, patari <lassi.paavolai...@xxxxxxxxxx> wrote:
Hi,
I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY
How can I find this character and change it to two - characters for
LaTeX?
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:
$str =~ s/\x{2013}/--/g;
If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?
Hello,
Save your script in UTF-8 encoding and just use the unicode
characters, rather than \x{****} form, in the regexp:
$str =~ s/-/--/g; # First "-" is \x{2013}, not dash.
Or,
decode it first, perform substitution, and encode it back:
use Encode;
$octets = decode("UTF-8", $str);
$octets =~ s/\x{2013}/--/g;
$str =~ encode("UTF-8", $octets);
.
- Follow-Ups:
- Re: Unicode in regexp
- From: Brian McCauley
- Re: Unicode in regexp
- From: patari
- Re: Unicode in regexp
- References:
- Unicode in regexp
- From: patari
- Unicode in regexp
- Prev by Date: Unicode in regexp
- Next by Date: FAQ 6.13 How do I process each word on each line?
- Previous by thread: Unicode in regexp
- Next by thread: Re: Unicode in regexp
- Index(es):
Relevant Pages
|