Re: Unicode question
- From: Hans-Peter Diettrich <DrDiettrich1@xxxxxxx>
- Date: Wed, 06 Feb 2008 02:37:34 +0100
Adem wrote:
While (Index1 < SOME_MAX_INDEX) and (Length(AString) < SOME_NUMBER)
do begin
Unless SOME_MAX_INDEX means the end of the string, how would you have determined a specific position inside the string?
Char1 := AString[Index1];
if Char1 in [CHARS_TO_REPLACE] the begin
Very unhandy with sets of UCS-4.
AString[Index1] := SOME_RANDOM_USC4CHAR;
end else if Char1 in [CHARS_NOT_DESIRED] begin
System.Delete(AString, Index1, 1);
You notice that this action tends to invalidate your ending index?
While you take that side effect as a pro for fixed size elements, I take it as a con for indexing strings at all.
Here, everything related to the new UTF-16 string is likely to change.
Index, size, and length..
Given that sets of UCS-4 will not come in the near future, you'll have to redesign your algorithm. I'd suggest one or two lists, containing the character/string codes and what to do with them, then use something like StringReplace with every list element. Or use UTF-8 and sets, refining your decisions into multiple steps, with appropriate sets, for consecutive bytes.
And, all of this needs to be handled iteratively --no direct access; I
am not sure if you can safely System.Delete() or System.Move() safely
anymore.
You are too fixed on the idea, that strings are arrays of characters. There exist clever algorithms and tools (RegExp) for searching and manipulating strings, more efficient than linear algorithms in time, or consuming less memory than sets. Try to express your intended task on a higher abstraction level, and leave the work to the workers.
What size to assign to the various spaces, tabs or other control
characters and whitespace?
Had they been UCS-4, the answer would be simple: 4 {or 42 if you push
me hard enough :) }
That's the size in memory only, not related to string operations. Why do you try to track the length of an string yourself? It will be safer, shorter and faster, to obtain the new string size after every operation, that will or may affect the length of the string. The involved memory management operations cost enough time, which you cannot compensate by local optimization.
Forget about indexing code units or code points, that's so close to pointer arithmetic used in C code. You rarely have to deal with exactly the fifth character in an string, instead in most cases you already obtain the position by pattern matching, even in your example! Then use pattern (substring) manipulation methods, which will work with patterns of any size, and any kind of character encoding (even with MBCS ;-).
DoDi
.
- Follow-Ups:
- Re: Unicode question
- From: Adem
- Re: Unicode question
- References:
- Unicode question
- From: Bob
- Re: Unicode question
- From: Eric Grange
- Re: Unicode question
- From: Troy Wolbrink
- Re: Unicode question
- From: Eric Grange
- Re: Unicode question
- From: Troy Wolbrink
- Re: Unicode question
- From: Adem
- Re: Unicode question
- From: Zoren Lendry
- Re: Unicode question
- From: Adem
- Re: Unicode question
- From: Rudy Velthuis [TeamB]
- Re: Unicode question
- From: Adem
- Re: Unicode question
- From: Rudy Velthuis [TeamB]
- Re: Unicode question
- From: Adem
- Re: Unicode question
- From: Zoren Lendry
- Re: Unicode question
- From: Adem
- Re: Unicode question
- From: Rudy Velthuis [TeamB]
- Re: Unicode question
- From: Adem
- Re: Unicode question
- From: Rudy Velthuis [TeamB]
- Re: Unicode question
- From: Hans-Peter Diettrich
- Re: Unicode question
- From: Adem
- Unicode question
- Prev by Date: Re: Delphi 2007 build machine configuration
- Next by Date: Re: The Delphi Job Market
- Previous by thread: Re: Unicode question
- Next by thread: Re: Unicode question
- Index(es):
Relevant Pages
|