Re: Converting "’" to an Apostrophe?



RedGrittyBrick wrote:

maria wrote:
I am using a CGI program to read XML files and extract their various
items. Somehow, my program converts the apostrophe "’" to ...
"\â\€\™".

It's more likely your browser is doing this than your CGI program. Probably because your program lied about the character-set/encoding.


How do I program my CGI program to convert "’" to
an apostrophe, "'"?

You shouldn't.


Is there a little CGI code that will convert
all these different strings (including dagger, ellipsis, euro symbol, double quote, etc.) to their ASCII equivalents?

No, because dagger, ellipsis and euro don't have ASCII equivalents!


Unicode code-point u2019 is represented in UTF8 as the byte sequence e2 80 99 (shown here in hexadecimal), that same byte sequence, when interpreted as Latin-1 is the three characters ’ (a acute, euro, trademark).

You can learn more about Perl's handling of unicode by typing the command `perldoc perlunicode`


Here's another example, but using XML instead of plain text. Perl has so many different modules for handling XML and CGI that it is unlikely my example matches your situation.

The following perl file can be dropped into a CGI directory. The first line may need changing, depending on OS, webserver etc.

--------------------------------- 8< ----------------------------------
#!perl
#
# Demonstrate handling of Unicode characters in a UTF8 encoded XML file
#
# RGB 2008-02-28
#
use strict;
use warnings;
use XML::Simple;
use CGI qw/:standard/;
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);

#
# First we write some Unicode to an XML file using UTF-8 encoding.
#
my $tempfile = "unicode.xml";
open (my $out, '>:utf8', $tempfile) or die "can't open $tempfile because $!\n";
print $out <<ENDXML;
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>
<baz>Here is a Unicode RIGHT SINGLE QUOTE MARK \x{2019}</baz>
</bar>
</foo>
ENDXML
close $out;

#
# Now we read our XML file and use it in a web-page
#
my $foo = XMLin($tempfile);
my $line = $foo->{bar}->{baz};

print header(-charset=>'utf-8'), # NOTE - Default is NOT utf-8
start_html(), h1("Unicode example"), pre($line), hr(), end_html();

--------------------------------- 8< ----------------------------------

In case it's not obvious, the only reason the example first writes a file is so that I don't have to include a separate example data file. The example is completely self contained. I could have used a DATA section but felt that mishandling text file encodings might be part of your problem.
.



Relevant Pages

  • Re: Character Set Problem?
    ... "Brendan Reynolds" wrote: ... was no problem until I created a test file with accented characters, ... so the actual encoding and the declaration did not match. ... I have an Access 2002 database that imports an XML file. ...
    (microsoft.public.access.modulesdaovba)
  • Re: Character Set Problem?
    ... was no problem until I created a test file with accented characters, ... so the actual encoding and the declaration did not match. ... I have an Access 2002 database that imports an XML file. ...
    (microsoft.public.access.modulesdaovba)
  • Re: Converting "&#x2019;" to an Apostrophe?
    ... euro symbol, double quote, etc.) to their ASCII equivalents? ... Maria's problem is expressed a bit vaguely but let's assume that her XML ... struggle to think up or locate ASCII equivalents for some of these. ... UTF-8 characters properly? ...
    (comp.lang.perl.misc)
  • Re: Unicode Reading
    ... characters. ... > hexa decimal format(representing the unicode) or entities while saving as ... > fonts) appear as character itself in the xml file while the symbols ... > from "symbol font"(or any non-standard font) appear as entities in ...
    (microsoft.public.mac.office.word)
  • Re: Clean out accents in French names
    ... and builds an XML file. ... A few Latin-1 characters are not taken care of by the above function: ...
    (comp.lang.perl.misc)