Re: How to check variables for uniqueness ?



John Ersatznom wrote:

That is how equalsIgnoreCase() works:

"beißen".equalsIgnoreCase("BEISSEN"): false

Well, then, either Wong is completely nuts, or we're using different JDK
versions (1.6 here),

You mean you've tried this and found that your version gives different results
? I find that hard to believe unless its a side effect of attemting to use
non-ASCII characters in the input to javac. Try being explicit about using the
Unicode character (well, UTF16 value).

public class Test
{
public static void
main(String[] args)
{
System.out.println("bei\u00DFen -> " +
"bei\u00DFen".toUpperCase());
System.out.println("BEISSEN".equalsIgnoreCase("bei\u00DFen"));
System.out.println("BEISSEN".equals("bei\u00DFen".toUpperCase()));

// or equivalently, but using octal string escapes
System.out.println("bei\337en -> " + "bei\337en".toUpperCase());
System.out.println("BEISSEN".equalsIgnoreCase("bei\337en"));
System.out.println("BEISSEN".equals("bei\337en".toUpperCase()));
}
}


(Tested on 1.4.2, 1.5.0, and 1.6.0)


or (seems least likely) toUpperCase actually alters
the spelling of some words(!) rather than just changing a-z to A-Z
(likewise accented equivalents) while leaving the rest alone.

That sounds as if you /haven't/ actually tried it. (Nor read the documentation
for String.toUpperCase() which expounds on this subject).

String.toUpperCase() does /not/ change the spelling of words (how could it, it
doesn't know anything about words ?). What it does follow are the correct
(insofar as the Unicode spec is correct) rules for mapping lowercase to
uppercase. It produces the /same/ word with the /same/ spelling[*], but
(naturally) a different representation. In this case the number of visually
separable glyphs changes because the U+00DF character (LATIN SMALL LETTER SHARP
S) is a ligature of two logical characters, long s and short s (U+017F and
U+0073), there is no upper case ligature for that combination (compare fi and
FI in English typography), so the correct uppercase version of those (logical)
characters is the sequence SS. (At least that's the theory the Uncicode people
seem to be operating on -- they know more about it than me so I'm willing to
believe them).

It is simply erroneous to expect String.toUpperCase() to map characters
one-to-one in the way that English case mapping works. I can't, it isn't
supposed to, and it doesn't...

String.equalsIgnoreCase(), on the other hand, is badly broken in that it does
/not/ follow those rules. Or, since it's behaviour is clearly documented,
perhaps "broken" is too strong a term -- "badly misleading" might be preferred.

-- chris

[*] Arguably the concept "same spelling" is flawed in the context of Unicode
case mapping.


.



Relevant Pages

  • Re: A simple metaobject protocol for packages
    ... characters is not bijective in Unicode. ... Why not restrict case mapping issue only to ASCII? ... READ can upcase ASCII chars but treat others as if they ...
    (comp.lang.lisp)
  • Re: Scripts & Communications
    ... > when the public first started to become aware of "giant ... a third mapping and 6 bit characters ... > standard, ASCII, had a mapping of glyphs to ...
    (sci.crypt)
  • Re: A simple metaobject protocol for packages
    ... characters is not bijective in Unicode. ... Why not restrict case mapping issue only to ASCII? ... CLISP is pretty good at upcasing characters: it knows Cyrillic, Greek ...
    (comp.lang.lisp)
  • Re: Windows CE 5.0
    ... text mapping or code page. ... Debug your program and see what you are getting ... > text box any characters it receives from a PC. ... > However when i have CE 5.0 running on the device, the app displays in hex ...
    (microsoft.public.dotnet.framework.compactframework)
  • Re: Arabic cursive in Unicode
    ... separate groups of characters: the ones that you below call "logical ... But one set of the allographs -- what Arabic grammars call "independent ... the "logical characters" have no need of any concrete form. ... A font must store all those glyphs, ...
    (sci.lang)