Re: Unicode question



Hans-Peter Diettrich wrote:

Adem wrote:

I am inclined to separate data from representation.

Obviously not for the representation in memory ;-)

I am not sure what you mean here --can you elaborate?

I thought using UCD-4 chars would be good enough for that. Am I missing
something?

I mean, for all practical purposes, 4-byte codepoints cover all my
needs as far as pure data aspect of text is concerned.

They fit your approach to string handling. Could you imagine that
your approach might be inappropriate at all, and only happened to
work fine with ASCII strings?

Of course..

The first thing that comes to my ming is the case of case folding. They
should have completely different code points for the 'I', 'I' (dotted
capital I), 'i' (dotless small i) and 'i' for Turkish to begin with
--otherwise you have no idea whether the case folding you've just
applied makes sense or not..

With regards to its representation, of course, a lot of other
parameters need to be taken into account --such as the display
width or display height, font size, forecolor, backcolor, style
(bold, italic, underline) etc. of a given code point.

Other aspects are file or line/column positions, independently from
the visual representation. When something restricts a field width,
maybe in a database field or some entry form, the limit may be
expressed in bytes of the fixed-size string field, in the required
character encoding, or in the average or precise display width of the
characters, when e.g. Chinese graphemes require the use of a larger
font size, or when ligatures etc. will affect the number of character
cells in an entry form. Unicode does not guarantee for a uniform
character width, even in monospaced fonts. Whenever the length of a
string is restricted, by whatever measure, calculations must be based
on that measure, which is not necessarily the number of code points
in an string.

True. But, since I am mostly interested in in-memory opreations I see
these as the scope of discussion in another level --GUI and widget
design and implementation.

How you decide to handle these are upto you --the developer. You
might want to do that sequentially, or establish a direct-mapping
between the data and the representation/display.

We should leave the onscreen representation to appropriate controls,
and forget about text attributes, and concentrate on the general
string handling stuff instead. Or would you bother with inlining text
attributes, block marks or the like, into your strings?

No. Of course not.

Trouble with UTF-16 (IMO) is that, you will never have a
direct-access option.

Dealing with strings on a character-entity level may not be
appropriate in all languages or notations, which are covered by
Unicode. Think in patterns of an arbitrary size, not only of exactly
1 character, and all your problems go away.

This is like assuming and accepting that there's no gravity or that it
varies of it around the globe. :P

It would be good engineering practice to use GPS location and look up
the relevant gravity value --or better yet, carry along a device that
calculates it-- every single time, but would it be worth the trouble
--and a good enough reason to produce sloppy performance? <G>

Suppose you're designing a DB schema. You have a 'name' field. You
make an executive decision to restrict the number of code points
(characters in the old speak) to 30.

Unless you implement the database engine, what's very unlikely <g>,
you have to use what your database offers for this case. Unless you
decide to use an blob, and manage the entire storage representation
yourself, you don't have to bother with the internals of your
database.

We're constrained with the available tools --unfortunately.

AFAIK, blob fields are not indexable. So you would have to forgo a lot
of speeds.

Since the length of an entry is not normally fixed to exactly 30
characters, your database must offer means to determine the length of
an actual (possibly shorter) entry, either explicitly (counted
string) or implicitly (zero terminated string). In the case of
terminated strings, you'll have to find out the actual string size
yourself, in O(n), regardless of the string encoding. Otherwise the
string length has to be given explicitly, both on entry or retrieval
of the field.

Trouble is, O(n) is quite a number against O(1)..

Now, how many bytes would you specify for that field in the DB in
order not to let it truncate the entered text.

You understand now, that this question doesn't make sense?

Since UTF-16 chars can be 2 or 4 bytes, one has to be always alert
about it --I can see a lot of bugs due to this. Whereas, UCS-4 is 4
bytes all the time. One less thing to watch out for.

And you see that a fixed size field requires 30*4 bytes, for either
UTF-16 or UTF-32?

Yes. And no. UTF-16 is either 2 bytes or 4 bytes.

Any string operation will be much slower.

What kind of string operation? Concatenation, splitting,
searching, or what?


See the 'while do begin end' example I gave in the other/sister
thread.

Which I take for an inappropriate approach to string handling.

:)

Which is what we've all been doing since the beginning of time :P

Such world shattering revolutions should not be introduced so
silently...

I am no more expert than anyone here, but GNU C library's internal
representation is --according to various links (google)-- UCS-4.

Representation of what? The C (character) I/O routines work with
"int" values, not with "char".

And 'int' is a 4-byte numeric, isn't it?

Wikipedia claims that strings are
passed around in UTF-8 encoding, in Linux and alike systems and
libraries.

This, I would have no issues with. What I am pestering you all is the
fact that there is (will be) no 4-byte native string in Delphi.

We, IMO, need a 4-byte reference sounted string for internal
representaion/use.

And, by 'internal representaion/use' I mean 'within my/your Delphi code
itself'.

And for external representation/use, having UTF-16 (and UTF-8) is fine.

BTW, GNU C library does make a distinction between an internal
representation (in-memory text data) and external representation
(data for storage and transmission).

GNU C indeed assumes 4 bytes for the wchar_t type, with a note that
even some Unix systems prefer 2 bytes for the same type. This means
that conversions between the GNU libraries and the OS or other
libraries may be necessary, depending on the specific target
platform. The most important platform for Delphi developers is
Windows, which even in the .NET flavour uses UTF-16.

I consider the .NET to be external to my code --for which I would
naturally have to use whatever it expects.

What I object to is the fact that CG is missing this distinction and
forcing us to use something best suited for external representation
internally too.

The measure for the "best" suited representation is up to the user.

Yes. Yes. Yes. <G>

And, for that we need to have alternatives to choose from. ATM, we
don't have it (not sufficient, anyway).

All I am asking is that we have these 3 options:

-- AnsiString (1 byte per 'character')
-- UTF16String (2 or 4 byte per 'character')

*AND* one more:

-- UCS4String (4 byte per 'character')

CG simply decided that Unicode is supported by the OS (Windows),
eliminating the need for maintaining another proprietary library.

No. CG is missed out on the fact that 4-byte strings would be needed.

Everybody is free to use another library of his choice, be based on
string classes or on other character types.

Libs and data types are completely different things. If CG supplied the
4-byte reference counted strings, writing/adopting a 3rd party lib for
it would be possible. Right now it isn't.

Using a C style library, like glibc, of course will throw you back
into an API without built-in string types or sets - every string
operation requires an explicit function call, even the determination
of the length of an string <BG>.

We may need to face glibc sooner than we fathom. QT4 is going to be
available in Windows too..

I'm not willing to downgrade existing code to that level, nor to
use it in new code. I'd be more comfortable with a (polymorphic)
string class, and never bother with it's internals.

I could write a polymorphic class in a half an hour and be done with it
once and for all.

But... things like RegEx needs to be adopted to it too. This is the
hard part.

Thanks anyway for this interesting discussion. Even if we may end up
without an consensus, I find the discussion of the various aspects of
the use of Unicode very instructive :-)

:)

Me too :)
.



Relevant Pages

  • Re: Musings on a holiday weekend
    ... If the native coding of U/OS is to be UTF-16, I'd expect that all of the files would also have to be UTF-16. ... CLST - Compare Logical STring ... MVCLE - MoVe Character Long Extended ... TRTR - TRanslate and Test Reverse ...
    (bit.listserv.ibm-main)
  • Re: diferences between 22 and python 23
    ... string objects have the same byte ... >representation that they originally had in the source code. ... Then they must have encoding info attached? ... behind the concrete character representations there are abstract entities ...
    (comp.lang.python)
  • Re: SIMPLE NUMBER COMPARISON
    ... | An implementation-dependent representation of the function is returned. ... string, but the string itself is not defined but only loosely indicated; ... character would appear at the beginning of the string. ...
    (comp.lang.javascript)
  • Re: Unicode question
    ... If UCS-4 is, one day, feasible, it should not be ... strings you get from the outside world are UTF-16. ... not make calculating the displayed width of a string any easier except ... be more than one code point per character position ...
    (borland.public.delphi.non-technical)
  • Re: Unicode question
    ... dog knows this, while with UTF-16, pretty much every developper and his ... up a string where I find the ",' character. ... problem for UTF-16. ...
    (borland.public.delphi.non-technical)