Re: Fast UTF-8 strlen function



randyhyde@xxxxxxxxxxxxx wrote:
>
> Is there a fast UTF-8 string length function floating around?

What's the meaning of "string length" of a UTF-8 (or other
unicode) string? Length in bytes, or length in characters?
As NoDot points out, if we want length in bytes, a "regular"
strlen ought to work.

Beth raised this issue on the luxasm-devel list, too
(although not asking for an optimized function). IIRC, her
example sent a string to stdout, and the question was "what
goes in edx". I'm pretty sure we need length in bytes here,
and in "most" cases (allocating memory, e.g.), but sometimes
we'd want the length in characters (plus "font metrics" to
determine where the "next" print position would be, for
example).

I suppose... if we encounter a byte with the high bit clear,
we just count "one". If we encounter a byte with the high
bit set, we determine how many bits are set (look-up
table?), and skip that many bytes, counting "one" for the
whole mess... I don't see an "optimized" version of this
working out very well...

Beth says Nasm accepts UTF-8 strings in quoted strings (and
comments) "by accident". I don't think it's "by accident", I
think it's by "careful design"... not *Nasm's* design, but
UTF-8's. As Betov observed, the risk is that we'd encounter
a "false end-quote" (or false EOL, in a comment). As long as
that doesn't happen (and Beth's explanation assures us it
won't), it's "just bytes", and the assembler doesn't need to
care what it represents.

LuxAsm, since it'll include an editor, *may* need to
determine length in characters, too... My big question is,
if a user presses the key for "King Tut", what in hell kind
of an "event" do we get???

Best,
Frank
.



Relevant Pages

  • Re: Prothon should not borrow Python strings!
    ... """It does not make sense to have a string without knowing what encoding ... same cul de sac as Python. ... Prothon_String_As_ASCII // raises error if there are high characters ... Python's split between byte strings and Unicode strings is ...
    (comp.lang.python)
  • Re: Letter to US Sen. Byron Dorgan re unpaid overtime
    ... put them in stupid places. ... Programming is difficult (as you must surely appreciate, ... > strings will be in the range 1...1000 characters. ... impose an artificially small limit on string length." ...
    (comp.programming)
  • Re: Byte Array to String
    ... retrieved text will mismatch the original characters. ... encoding the characters. ... Dim strFileData as String ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: A note on personal corruption as a result of using C
    ... impossible to write effective string validation routines by definition ... (Note that a string literal may contain embedded null characters; ... without resorting to abusive language. ... In practice, programmers typically use "struct" ...
    (comp.programming)
  • Re: Self-Documenting Code Contest
    ... self-documenting. ... query:= 'documenting' asSortedCollection. ... string size < query size ... two words becomes a set of Characters. ...
    (comp.lang.smalltalk)