Re: Unicode question



Adem wrote:

While (Index1 < SOME_MAX_INDEX) and (Length(AString) < SOME_NUMBER)
do begin

Unless SOME_MAX_INDEX means the end of the string, how would you have determined a specific position inside the string?

Char1 := AString[Index1];
if Char1 in [CHARS_TO_REPLACE] the begin

Very unhandy with sets of UCS-4.

AString[Index1] := SOME_RANDOM_USC4CHAR;
end else if Char1 in [CHARS_NOT_DESIRED] begin
System.Delete(AString, Index1, 1);

You notice that this action tends to invalidate your ending index?

While you take that side effect as a pro for fixed size elements, I take it as a con for indexing strings at all.


Here, everything related to the new UTF-16 string is likely to change.
Index, size, and length..

Given that sets of UCS-4 will not come in the near future, you'll have to redesign your algorithm. I'd suggest one or two lists, containing the character/string codes and what to do with them, then use something like StringReplace with every list element. Or use UTF-8 and sets, refining your decisions into multiple steps, with appropriate sets, for consecutive bytes.


And, all of this needs to be handled iteratively --no direct access; I
am not sure if you can safely System.Delete() or System.Move() safely
anymore.

You are too fixed on the idea, that strings are arrays of characters. There exist clever algorithms and tools (RegExp) for searching and manipulating strings, more efficient than linear algorithms in time, or consuming less memory than sets. Try to express your intended task on a higher abstraction level, and leave the work to the workers.


What size to assign to the various spaces, tabs or other control

characters and whitespace?

Had they been UCS-4, the answer would be simple: 4 {or 42 if you push
me hard enough :) }

That's the size in memory only, not related to string operations. Why do you try to track the length of an string yourself? It will be safer, shorter and faster, to obtain the new string size after every operation, that will or may affect the length of the string. The involved memory management operations cost enough time, which you cannot compensate by local optimization.


Forget about indexing code units or code points, that's so close to pointer arithmetic used in C code. You rarely have to deal with exactly the fifth character in an string, instead in most cases you already obtain the position by pattern matching, even in your example! Then use pattern (substring) manipulation methods, which will work with patterns of any size, and any kind of character encoding (even with MBCS ;-).

DoDi
.



Relevant Pages

  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)
  • Re: "String" manipulation for a Case clause
    ... ' Dim objRegExp As RegExp ... I would use Double or String. ... 'non-word character. ... 'construct a pattern like: ...
    (microsoft.public.excel.programming)
  • Re: Outdated help (feat. Access 97 and VB4)
    ... At y we find the block of memory which holds the string this is then ... terminated by a null character ... We can modify some code I posted earlier to get what is in memory from y-4 ... There are several important things to note about the BSTR data type. ...
    (comp.databases.ms-access)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)