Re: UTF8 to Unicode conversion

From: Joe Smith (Joe.Smith_at_inwap.com)
Date: 07/31/04

  • Next message: Joe Smith: "Re: Problem with file upload in forum"
    Date: Sat, 31 Jul 2004 04:59:27 GMT
    
    

    Spamtrap wrote:

    > Ok let me try to redefine the problem.
    >
    > I have a text file, [ in Windows 98], which by definition is in plain
    > 256 character ASCII. When I view it I see Español - which I assumed
    > was originally UTF8 - but I want to see Español [which of course
    > could exist in ASCII, without even having to go to Unicode or anything
    > fancy] so the encoding is using the two characters ñ for the single
    > character ñ

    ASCII is only 128 characters. Character codes 128 to 255 can be
       1) ISO-8859-1 (the Latin-1 alphabet), for western European languages.
       2) Some Microsoft CP (code page). There are many.
       3) Special bit patterns used in the UTF8 encoding scheme.

    For Español, all you need is a UTF8-to-ISO8859 conversion utility.

    > The data from that text file is being imported into a database [this
    > part is not Perl programming]. When I display the data, it displays
    > Español not Español

    That means that whatever program you are using to display the data
    does not understand UTF8. There are terminal emulators and command
    consoles that do understand UTF8.

    > Then a program will manipulate that database and create a Microsoft
    > Word document [or possibly an Adobe PDF document] and I assume the
    > text will continue to be incorrect. Therefore I want to use Perl to
    > fix that text data before I do the other processing.

    You could try playing around with
            open IN,':utf8',$input_file or die;
            open OUT,':crlf',$output_file or die;
            print OUT <IN>;

    > I also have things like СубъеР- which is supposed to be Russian
    > and judeţul which is Romanian.

    Russian characters simply cannot be displayed in ASCII or ISO-8859-1.
    ISO-8859-9 has Cyrillic, but not western european accented characters.
    Read http://czyborra.com/charsets/iso8859.html (or Google's cache).

    > It is possible I might have to maitain 2 copies of the strings in the
    > database tables, one as an ASCII close match for display purposes,
    > [since the database will not support UNICODE directly] and one as
    > actual UNICODE for passing into Word.

    The major databases do support Unicode directly. Often it is as simple
    as exporting the database to a flat file, defining a new database
    with UTF8 enabled, and importing the data. You will have to ask the
    DBA to perform this operation.
            -Joe


  • Next message: Joe Smith: "Re: Problem with file upload in forum"

    Relevant Pages

    • Re: Enhanced Unicode support for "Go" tools
      ... Right, you know ASCII? ... accent characters used in French and other European ... UNICODE isn't just about all the different alphabets out ... out wrongly because the character set the file was written in is ...
      (alt.lang.asm)
    • Re: ASCII Requires a Temporary Substitution During Encryption.
      ... ASCII has now been replaced by Unicode: ... makes ascii 00 and then the 94 standard characters ...
      (sci.crypt)
    • Re: Unicode Support
      ... > | single bit extra from ASCII for any ordinary ASCII characters... ... UNICODE character then check what "range" it's in with the table ... 7-bit ASCII characters are encoded in exactly the same way in UTF-8 ... All non-ASCII characters use a multi-byte sequence ...
      (alt.lang.asm)
    • Re: 128 bit password
      ... AdMod is ascii based, it doesn't write unicode. ... If I used the unicode version of ldap_mod it would likely be limited to 127 unicode characters. ... Joe Richards Microsoft MVP Windows Server Directory Services ...
      (microsoft.public.security)
    • Re: DB2 UTF-8 ODBC double conversion
      ... Unicode considers the various UTFs flavors completely equivalent. ... Just various encoding forms for the same thing. ... this means that everyone who is using that database has ... they can't use your database to represent as many characters as ...
      (microsoft.public.vc.mfc)