Re: Converting "’" to an Apostrophe?



RedGrittyBrick wrote:
maria wrote:
On Wed, 27 Feb 2008 22:45:02 -0500, "John W. Kennedy"
<jwkenne@xxxxxxxxxxxxx> wrote:

maria wrote:
I am using a CGI program to read XML files and extract their various
items. Somehow, my program converts the apostrophe "&#x2019;" to ...
"\â\€\™". How do I program my CGI program to convert "&#x2019;" to
an apostrophe, "'"? Is there a little CGI code that will convert
all these different strings (including dagger, ellipsis, euro symbol, double quote, etc.) to their ASCII equivalents?
Thank you very much.

maria
You have a serious misunderstanding that is much too complicated to explain here. Learn about Unicode.

The whole modern world is filled with people who feel compelled to
respond to other people's messages when they have absolutely nothing
to say.


Oh dear. Replying to percieved rudeness with more rudeness just puts off potential helpers.

John's reply *did* contain something useful to you.

AIUI John is pointing out that "\â\€\™" is your Unicode apostrophe encoded in UTF-8 but displayed using an incorrect encoding such as Latin-1.

Unicode code-point u2019 is represented in UTF8 as the byte sequence e2 80 99 (shown here in hexadecimal), that same byte sequence, when interpreted as Latin-1 is the three characters ’ (a acute, euro, trademark).

You can learn more about Perl's handling of unicode by typing the command `perldoc perlunicode`


It's a while since I've read the posting guidelines for this newsgroup but I'm pretty sure they suggest you include a short example program that demonstrates your problem. That would make it easier for people to help you identify what you are doing wrong.



#!perl
#
# Demonstrate handling of Unicode characters in a UTF8 encoded file
#
# RGB 2008-02-28
#
use strict;
use warnings;
use CGI qw/:standard/;
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);

#
# First we write some Unicode to a file using UTF-8 encoding.
#
my $tempfile = "unicode.txt";
open (my $out, '>:utf8', $tempfile)
or die "can't open $tempfile because $!\n";
print $out "Here is a Unicode RIGHT SINGLE QUOTE MARK ->\x{2019}<-\n";
close $out;

#
# Now we read our UTF-8 encoded text file and use it in a web-page.
#
open (my $in, '<:utf8', $tempfile)
or die "can't open $tempfile because $!\n";
my $line = <$in>;
close $in;

print header(-charset=>'utf-8'), # NOTE - Default is NOT utf-8
start_html(), h1("Unicode example"), p($line), hr(), end_html();
.



Relevant Pages

  • Re: Converting "&#x2019;" to an Apostrophe?
    ... maria wrote: ... an apostrophe, "'"? ... all these different strings (including dagger, ellipsis, euro symbol, double quote, etc.) to their ASCII equivalents? ... Learn about Unicode. ...
    (comp.lang.perl.misc)
  • Re: Confusion between UTF-8 and Unicode
    ... > Lets take the Euro symbol. ... Moreover the last "hex letter" in UTF-8 is always the same ... as the Unicode codepoint. ...
    (comp.lang.java.programmer)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • Re: Zeichenkodierung in der shell
    ... UTF-8 umstellt. ... Unicode bauen. ... Entscheidung in "Egozentrik, Frechheit, Ignoranz und Arroganz" ... Dass Juergen (er nennt sich selber nur Juergen und nicht Jürgen, ...
    (de.comp.os.unix.linux.misc)
  • [PATCH] UTF-8 input: composing non-latin1 characters, and copy-paste
    ... One can put the keyboard driver into Unicode mode, load a Unicode keymap, and get single keystrokes generate valid UTF-8 for non-ASCII characters. ...
    (Linux-Kernel)