Re: How to check variables for uniqueness ?



Chris Uppal wrote:
String.toUpperCase() does /not/ change the spelling of words (how could it, it
doesn't know anything about words ?). What it does follow are the correct
(insofar as the Unicode spec is correct) rules for mapping lowercase to
uppercase. It produces the /same/ word with the /same/ spelling[*], but
(naturally) a different representation. In this case the number of visually
separable glyphs changes because the U+00DF character (LATIN SMALL LETTER SHARP
S) is a ligature of two logical characters, long s and short s (U+017F and
U+0073), there is no upper case ligature for that combination (compare fi and
FI in English typography), so the correct uppercase version of those (logical)
characters is the sequence SS. (At least that's the theory the Uncicode people
seem to be operating on -- they know more about it than me so I'm willing to
believe them).

This seems to be excessively technical when the matter under discussion is simply capitalizing strings. In any event, equalsIgnoreCase should collapse these "ligatures" of yours as well. Also, I don't notice "fi" and "FI" producing strange behavior myself -- even if the letters are often run together so the 'i' hasn't got a separate dot *when typeset*, this doesn't affect the representation of a string in a computer, only the visually displayed output (and then usually only when serious typesetting software is used). Likewise, it makes sense to represent any other logical sequence of characters in a sensible way under the hood, regardless of any rendering fanciness that is done when presenting them to the user.

It is simply erroneous to expect String.toUpperCase() to map characters
one-to-one in the way that English case mapping works. I can't, it isn't
supposed to, and it doesn't...

No, it is not erroneous to expect a method to do exactly and only what its name implies. It is erroneous, of course, to give a method a name that is misleading. If toUpperCase needs a lengthy documentation block explaining why its behavior is surprising, then it's a sure bet that it should not have been named that, since it's apparently really toUpperCaseAndDoesSomeExtraStuffToo.

String.equalsIgnoreCase(), on the other hand, is badly broken in that it does
/not/ follow those rules.

So you at least agree with me that it should be consistent with toUpperCase (and toLowerCase) -- all strings should have a single canonical toUpperCase, a single canonical toLowerCase, both should define equivalence classes on the mixed-case input strings, these should be the SAME equivalence class, and equalsIgnoreCase should implement and embody the corresponding equivalence relation.

Or, since it's behaviour is clearly documented,
perhaps "broken" is too strong a term -- "badly misleading" might be preferred.

It sounds like toUpperCase has a "badly misleading" name since it (supposedly) does transformations that go well beyond what is normally meant by everyday blokes by "to upper case", and the method name is supposed to be a reasonably meaningful capsule summary for everyday blokes of what the method does. If a method is supposed to do behavior that's surprising for any English speaker but not for a German speaker, maybe it should have a German rather than an English name? :) If it's supposed to do locale-dependent stuff, then it should have a version that accepts a Locale object. The version that doesn't shouldn't surprise English speakers; the version that does shouldn't surprise anyone familiar with its locale-specific behavior for the locale actually used. Having locale-dependent behavior invoked randomly without explicit use of Locale objects, and which furthermore doesn't use the system locale, is by itself a sign of a questionable design as well as a sure source of bugs and problems.

I've even encountered somewhere a notion that aString.length() is not even accurate in current Java versions if a string contains obscure characters. It suggests aString.<something using the obscure term "code point", apparently just Unicode-geek for "character"> as its replacement, while of course there's a ton of legacy code using length(). I don't suppose it occurred to them that the new fancy-whosit should have been a replacement length() implementation instead of some new name that doesn't suggest anything to do with the length of a string to someone who doesn't care about all the Unicode bells and whistles and just wants to process strings while remaining agnostic about what they are ultimately used for or contain? Those users will gravitate to length() (plus all that legacy code), not caring about the actual storage length of the internal representation but the length in characters of their data as a general rule. So there should be a length() method that returns the true length of the string, and if necessary a getSize() method that returns the representation's size in bytes or whatever in case someone needs such low level data. (If they persist strings as UTF-8 in a text format file that is parsed, or use serialization, then they don't.)

[*] Arguably the concept "same spelling" is flawed in the context of Unicode
case mapping.

A concept like "same spelling" can't be flawed. It's generally accepted that "color" and "colour" are the same word, but have different spellings, right? While "two" and "too" are different words spelled differently that sound the same, "tomato" and "tomato" are the same word spelled the same but pronounced differently, and "ant" (the bug) and "ant" (the build tool) are different words both spelled and pronounced the same.
.



Relevant Pages

  • Re: How to check variables for uniqueness ?
    ... characters is the sequence SS. ... is simply capitalizing strings. ... The fact that case mapping in English /is/ simple is neither here not ... That is a fair criticism of the Unicode position. ...
    (comp.lang.java.programmer)
  • Re: RegExp to find hex value 0D fails
    ... that the Asc and AscW values for some characters are different: ... artefact of the ASCW function. ... Locale'US English ... the US English locale is different from the AscW value for Czech. ...
    (microsoft.public.scripting.vbscript)
  • Re: diferent answers with isalpha()
    ... execute a script file with the same code I get False. ... Python uses the "C" locale where the ... alphabetic characters are a-zA-z only. ... ASCII set is to use Unicode strings. ...
    (comp.lang.python)
  • Re: Can I use std::locale to solve this?
    ... isalphacorrectly identifies the Swedish characters å, ä, and ö as ... Regarding GCC I still haven't found out how to set the locale to Swedish, ... but I noticed that std::localedoesn't throw for unknown locale strings as ...
    (comp.lang.cpp)
  • REWARD: chr() not working for Chinese "Locale"
    ... I have a real stumper of an issue...I am creating a string, ... Smartphone's "Locale" setting to "English", the string is built of the ... proper individual characters representing the specified values for X. ...
    (microsoft.public.pocketpc.developer)