Re: Null-terminated strings: the final analysis.



Mark McIntyre <markmcintyre@xxxxxxxxxxxxxxxxxxx> writes:
On 12/04/09 21:32, Keith Thompson wrote:
Tab and newline characters are non-printable; can a text file contain
those?

Indeed, I left that as so obvious it was unsaid - I'd forgotten I was
in the land of the pedants!

Subtle distinctions are at the core of what we're discussing here.
Let's not ignore such distinctions for the sake of avoiding pedantry.

[...]

On the systems I use, if I write a '\a' character (ASCII BEL) to a
text file, I can reasonably expect to see a '\a' character when I read
it back. The same is not true of '\0' if I use fgets() to read it

What would you expect to "see"? I would hope that nothing is displayed
on your VDU or printed on paper for instance. So in the context of
"text", how can it be meaningful?

By "see", I meant that I could write something like:

c = fgetc(my_file);
if (c == '\a') {
puts("Yes, it's a '\\a' character");
}

with the expectation that the puts statement would be executed
sometimes.

I left that as so obvious it was unsaid. 8-)}

Incidentally, I do have at least one text file with an embedded ASCII
BEL character. I have a perfectly valid reason for doing this, and
it's never been a serious problem.

In any case, the distinction between text files and non-text files is
irrelevant to a discussion of C strings. Clearly C strings can
contain any characters other than '\0', including non-printable
characters. If I want to construct a sequence of characters
containing a control sequence for a VT100-style terminal, for example,
a string is a perfectly sensible thing to use. And if any such
sequences include null characters (I don't know whether they do or
not), then the fact that I can't store embedded null characters in
strings is an inconvenience.

And anyway, if you want char arrays containing nulls, C can do those, no
problem.

Yes, but you can't store a null character in the middle of a string,

But again thats a circular argument.

Not at all.

If, because of some requirement outside the C language, I want to
store arbitrary character sequences, I can use C strings only if I can
guaranteed that I don't need to store any null characters.

[...]

So I concent that its not a useful point. If you want to transport
elephants, use a crate, not a box. If you want to transport nulls, use
an array, not a string - or use some language that allows internal
nulls in its string type.

Right. So C strings impose a limitation, and I might have to work
around that limitation in some circumstances. That seems to me to be
a very useful thing to be aware of.

For example, if I'm reading chunks of data from a binary file, I can
store those chunks in character arrays, but I can't safely use the
language's built-in string processing functions on them. For example,
I can't use strstr() to search for a pattern in the data. If C had
been designed differently, that wouldn't be an issue.

which makes char arrays containing nulls more difficult to deal with.
I'm not saying it's a fatal flaw in the language, but it is a slight
inconvenience.

I can't recall /ever/ having found it so, in 20+ years of
programming. Its surely just a matter of interface design: if you
expect to be fed non-strings, then don't use a string to contain
them. Alternatively, document the interface appropriately.

Ok, so it's a *potential* inconvenience.

And there are languages whose native strings *can* contain embedded
null characters. In C, strlen("foo\0bar") returns 3; in Perl,
length("foo\0bar") returns 7, and there's nothing particularly special
about the 4th character.

Apart from being a nul, which isn't a common character in real-world
strings. For instance, find me a place or person with a nul in their
name, or a word in any language, including klingon.

Strings aren't just used to store names of places or people. And if C
strings *could* store embedded null characters, they might be
*slightly* more useful than they are without that ability.

In the design of the language, a tradeoff was made between the
convenience of null termination vs. the *slightly* greater flexibility
of being able to store embedded null characters. I do not suggest
that the choice was the wrong one, merely that it was a tradeoff with
a non-zero cost. And if you've never run into it, that doesn't change
the point.

--
Keith Thompson (The_Other_Keith) kst-u@xxxxxxx <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
.



Relevant Pages

  • Re: add support for other languages
    ... There are a lot of issues dealing with internationalization. ... ComboBox needs to use language-specific strings, and you cannot make assumptions about the ... alternate language strings (apparently Finnish is the worst language, ... You cannot assume characters are single bytes. ...
    (microsoft.public.vc.mfc)
  • Re: add support for other languages
    ... Currently, the installer reads the messages from a simple ASCII text file, ... and so the language scope is very limited, however the advantage is that the ... strings in the first place. ... > can have up to 3x the number of characters of English strings). ...
    (microsoft.public.vc.mfc)
  • Re: Why R6RS is controversial
    ... the semantics of the language, ... behavior of grapheme-cluster characters under most linguistic ... as the strings grow longer. ... Normalization is hideously complicated, and may require many ...
    (comp.lang.scheme)
  • RE: Prothon should not borrow Python strings!
    ... > "Strings are a powerful data type in Prothon. ... > should be a list of characters. ... Is there any dynamic language that already does this right for us to ...
    (comp.lang.python)
  • Re: Unicode LISP??
    ... I'm not experienced with Common Lisp library, ... terms of strings rather than characters. ... have their representation upgraded if they are updated in place. ...
    (comp.lang.lisp)