MultiByte Character Sets and False Matches

From: *ÜŽü¥jˤŽÈ¥s (brian_at_ssl.fujitsu.com)
Date: 06/28/04


Date: Mon, 28 Jun 2004 16:32:31 +0900
To: dbi-users@perl.org

Is there any support in the DBD package (or any workarounds) for
handling searches in MultiByte Character Set data?

The below problem has occurred with DBD::CSV and DBD:Interbase
I have not had the resources to test other packages (ie DBD::Oracle etc)

Using DBD I am using a SELECT statement to find a match for Japanese
EUC strings in a Japanese record fields.

For the main part it works ok but there are also false matches.

In Perl (5.6) itself regular expressions are based on byte per byte
matching rather than character matching
The same also seems to be so with the above mentioned DBD Packages

A little explanation :
Japanese has three character types
(1)Hiragana (around 80 characters)
(2)Katakana (around 80 characters, used for foreign based words)
(3)Kanji (thousands of characters, like pictograms)

Say, in a EUC character based table "SAMPLE_TABLE" a record field
(for example field "TEXT_FIELD_1") has a string which contains two
sequential 2-Byte Katakana characters
(Katakana Character 1 = \xA5\xB9, Katakana Character 2 = \xA5\xC8)

If I use a SELECT statement to find matches for the the 2-byte Kanji
character "\xB9\xA5" it will match the above record
ie

$search_str = "\xB9\xA5";

$sql_str "SELECT REC_ID from SAMPLE_TABLE WHERE TEXT_FIELD_1 %LIKE%
$search_str"

It will find a Kanji match in the middle 2 bytes of the above Katakana
character string ie \xA5(\xB9\xA5)\xC8

The same happens in problem occurs in Perl regular expressions when
using EUC strings.

The problems with false matching with MultiByte Character Sets
are explained more (properly much more clearly than my explanation)
)in english at :
http://iis1.cps.unizar.es/Oreilly/perl/cookbook/ch06_19.htm

Below is Perl Code for handling EUC character set regular expressions

$ascii = '[\x00-\x7F]';
$twoBytes = '[\x8E\xA1-\xFE][\xA1-\xFE]';
$threeBytes = '\x8F[\xA1-\xFE][\xA1-\xFE]';

if ($str =~ /^(?:$ascii|$twoBytes|$threeBytes)*?(?:$pattern)/) {
  print "Found\n";
}

--------------------------------------------
Brian Sweeney
mail:brian<AT>ssl<DOT>fujitsu<DOT>com



Relevant Pages

  • Re: A note on computing thugs and coding bums
    ... code is valid for any character set that is legal in C (which is a ... characters in the required source character set ... A String, in C Sharp or Java, can be redefined. ... allow programmers to handle some other data format, ...
    (comp.programming)
  • Re: [QUIZ] Bytecode Compiler (#100)
    ... So having written the lexer class, I now set up the state transition ... The tokens are ... # The lexer needs to know the character sets involved in deciding ... # Initialize the character set columns to be used by the lexer. ...
    (comp.lang.ruby)
  • Re: include file rule
    ... except the new-line character and " '. ... Looking at 5.2.1 Character Sets, the source character set is described ... 26 uppercase letters ... library files with members having at most eight character names in ...
    (comp.lang.c.moderated)
  • Re: A note on computing thugs and coding bums
    ... code is valid for any character set that is legal in C (which is a ... characters in the required source character set ... the C Standard endorses that decision subject only ... use another definition of the word "string", that's entirely your choice, ...
    (comp.programming)
  • Re: how to tell server that charset is UTF-8??
    ... > to my knowledge Apache itself won't send the character set part of the header ... What sends the character encoding ... Sorry, I meant RSS. ...
    (comp.lang.php)