Re: An Example for Discussion

From: Arthur J. O'Dwyer (ajo_at_nospam.andrew.cmu.edu)
Date: 03/04/05


Date: Fri, 4 Mar 2005 01:23:52 -0500 (EST)


On Fri, 4 Mar 2005, Stan Milam wrote:
>
> I've been away from C programming since 1999 and I've recently taken it up
> again. Below is a string function I wrote this morning that will be going
> into my home-grown library of string functions. I brought the idea of it
> from a function of the same name in PL/SQL.

   This being a C group, and not an Oracle group, one shouldn't expect the
readers to know the PL/SQL library. I went and tracked it down. It turns
out that the 'TRANSLATE' function in PL/SQL does basically what the 'tr'
command in *nix does by default: given a list of "from" characters and a
matching list of "to" characters, replace each "from[i]" with "to[i]" in
the given string.

> Working with C strings has always been problematic because you can crash a
> program faster than you can say "Jack's you're uncle!" if you are not
> careful.

   True enough.

> I thought I would post the code and see what the experts had to say. For
> reference "strtools.h" is my header file for my homegrown string functions,
> and chrsubst() is used to substitute characters in a string, and
> str_rmchars() removes a set of characters from a string.

   That's not very descriptive. Moreover, I think your aggressive
"modularization" ends up hurting both clarity and performance --- and
not just because of the weird function names!

   (All criticism above and below is intended in the best of spirits, of
course. I hope it's useful to you. Now, on with the dissection!)

> /**FUNCTION************************************************************/
> /* Name: */
> /* translate(). */

   Seriously, if the reader can't tell this much from the code, do you
/really/ want him reading it? This entire comment block is much too
verbose; it's almost longer than the function, and it /will/ be longer
than the function once I get through with it!

> /* */
> /* Synopsis: */
> /* #include <strings.h> */

   You never explained what the above header is. Is it a typo, perhaps?
And why should the reader care what dependencies this function has? That's
the implementor's job --- to make sure that the source file that defines
this function #includes <strings.h> (or <string.h>) at the top. There's
no reason to repeat any of this stuff here.

> /* #include "strtools.h" */
> /* char *translate(char *src, char *fromstr, char *tostr); */

   This is doubly weird: first, because again the reader ought to be able
to read a prototype himself without seeing it twice; and second, because
the prototype above uses completely different variable names from the
one you actually use in the source code. They should match; there's no
reason they shouldn't.

> /* */
> /* Description: */
> /* The translate function will translate characters that appear */
> /* in both the src and fromstr strings. If the fromstr character */
> /* has a corresponding value in the tostr string all characters */
> /* in the src string will be translated to that of the */
> /* corresponding tostr character. If there is no corresponding */
> /* character in the tostr string the character will be removed */
> /* from the source string. */

   This is both dense and vague. I didn't understand what it was trying
to say until I read the appropriate page in the SQL/PL manual. Surely
there's a better way to describe the procedure! (In fact, part of your
problem is that the reader can't immediately tell from the source code
what's going on. Once we clarify the code, the algorithm will be
obvious, and thus will need less prose explanation.)

> /* Arguments: */
> /* char *source - The string to be examined and translated. */
> /* char *fromstr- The characters to translate in the source. */
> /* char *tostr - The corresponding characters the source */
> /* characters are translated to. */
> /* */
> /* Return Value: */
> /* The starting address of the source string. */
> /* */
> /************************************************************FUNCTION**/
>
> char *
> translate ( char *p_source, char *p_from_string, char *p_to_string )

   Hungarian notation. And done wrong. Since when does "p" stand for
"string"? None of those three parameters is a "pointer" to any data
type in the abstract sense; they're all strings. If you were going to
use any Hungarian prefix, you'd be using "s": 's_source', 's_from',
and so on. ('s_from_string' would be a stupid name, as I'm sure you can
see.)
   Drop the obfuscatory notation.

> {
> unsigned x_sub = 0, l_len = strlen( p_to_string );
>
> while ( *p_from_string ) {

   (I disagree with your whitespace style, too, but that's a personal
religious issue. I mention it only because I'll be using my own style
later on, so if you want to see my suggested style, you should scroll
down now.)

> if ( strchr( p_source, *p_from_string ) != NULL ) {
> if ( x_sub < l_len )
> chrsubst( p_source, *p_from_string, p_to_string[x_sub] );

   I have no idea what this is doing, but I bet it scans through the
entire string 'p_source', replacing each instance of '*p_from_string'
with 'p_to_string[x_sub]'. And you do this 'l_len' times. That's a
lot of looping for little reason. And I've already mentioned the poor
naming scheme: 'chrsubst'!?

> else {
> char l_ch[] = "";

   This is equivalent to
                  char l_ch[1] = {0};
The following two lines lead me to believe that perhaps this is another
typo, that you meant 'char l_ch[2]' instead. Otherwise, why bother making
a copy of a single character?

> l_ch[0] = *p_from_string;
> str_rmchars( p_source, l_ch );
> }
> }
>
> x_sub++;
> p_from_string++;

   Any time you see a loop, in C, which ends with an increment or a
pointer-chasing assignment, that should send up warning flags that
something hasn't been thought out properly. Here, it indicates that
you're missing a 'for' loop. See below.

> }
>
> return p_source;
> }

So that was the code. Now here's my overall analysis.
   You end a 'while' loop with an increment. That's the sign of a 'for'
loop struggling to get out. So we rewrite the loop:

[...]
> unsigned x_sub;
>
> for (x_sub=0; *p_from_string; ++x_sub, ++p_from_string) {
> if ( strchr( p_source, *p_from_string ) != NULL ) {
[...]
> }
> }
[...]

The next thing I noticed was that now we have a 'for' loop with two
increments in the increment clause. That's a sign that we have dependent
loop variables that need refactoring --- and indeed, we are really using
'x_sub' as a loop counter, and merely incrementing 'p_from_string' to keep
it pointing at the 'x_sub'th element of that array. So we merge the two
dependent variables; that yields

> char *translate(char *p_source, char *p_from_string, char *p_to_string)
> {
> unsigned l_len = strlen( p_to_string );
> unsigned x_sub;
>
> for (x_sub=0; p_from_string[x_sub]; ++x_sub) {
> if (strchr(p_source, p_from_string[x_sub]) != NULL) {
> if ( x_sub < l_len )
> chrsubst(p_source, p_from_string[x_sub], p_to_string[x_sub]);
> else {
> char l_ch[1];
> l_ch[0] = p_from_string[x_sub];
> str_rmchars(p_source, l_ch);
> }
> }
> }
>
> return p_source;
> }

Suddenly the parallel structure in the call to 'chrsubst' stands out!
We have de-obfuscated a critical part of the algorithm: We now clearly
see that characters from 'p_from_string' are matched only with the
corresponding characters from 'p_to_string'.

   Next, I prefer to remove one level of indentation by changing that
first gigantic 'if' to a conditional 'continue'; at the same time, to
conserve even more space, I'll switch to a saner naming convention.
The two input-only parameters are properly const-qualified. This yields
what I consider the "clean" version of the procedure you wrote:

/*
    Translates characters that appear in the |from| string to their
    corresponding characters in the |to| string. If |to| is shorter
    than |from|, the unmatched |from| characters are deleted from |src|.
    Overwrites |src| with the output string, and returns the new |src|.
*/
char *translate(char *src, const char *from, const char *to)
{
      size_t len_to = strlen(to);
      size_t i;

      for (i=0; from[i] != '\0'; ++i)
      {
          if (strchr(src, from[i]) == NULL)
            continue;
          if (i < len_to) {
              chrsubst(src, from[i], to[i]);
          }
          else {
              char ch[1]; /* Should probably be char ch[2] = {0}; */
              ch[0] = from[i];
              str_rmchars(src, ch);
          }
      }

      return src;
}

   But we're not done! Remember, I wanted to get rid of those weird
library-routine calls for two reasons: first, for clarity, and second,
for efficiency. There's no reason to be scanning so many times through
the input string! What's more, we can really elucidate the algorithm
much better with "simple" code like this:

char *translate(char *src, const char *from, const char *to)
{
      size_t in, out;
      size_t len_to = strlen(to);
      for (in=out=0; src[in] != '\0'; ++in)
      {
          /* Is the current character in the |from| array? */
          const char *t = strchr(from, src[in]);
          /* No; just echo it to the output string */
          if (t == NULL)
            src[out++] = src[in];
          /* Yes; if it has a match in |to|, then echo that entry. */
          else if (t - from < len_to)
            src[out++] = to[t - from];
      }

      src[out] = '\0';
      return src;
}

   A version of this routine using the 'strpbrk' library function
might be interesting to see, also, though I doubt it would be any
cleaner or quicker.

   Anyway, I hope you get the idea --- "clean and simple" beats pretty
much any other strategy, any day. I highly recommend "The Elements of
Programming Style" (Kernighan & Plauger) to anyone who wants to learn
to write (or critique!) source code.

HTH,
-Arthur



Relevant Pages

  • Re: Get the path and namefile in run time
    ... function Get_Path_Only return String; ... -- This returns the first N characters of the program name. ... end loop; ...
    (comp.lang.ada)
  • Re: Help a beginner - simple lowercase to uppercase and so on function
    ... And then one to loop across the string calling that function ... copying at to a new array. ... are any characters other than lowercase letters */ ...
    (comp.lang.c)
  • Re: Fast string operations
    ... Looping over characters like that can't slow things down that much. ... iteration through the string. ... a loop is going to start from the beginning of the ... the original issue was the amount of memory that is being consumed ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Why the following code is faster executed when placed in the form?
    ... everything that is not changed inside the loop must not be calculated there. ... Some of the characters you use are predefined as ... It allocates memory for Chr/in your case you can add call to Chr/. ... Now imaging if you build extremely big string, ...
    (microsoft.public.vb.general.discussion)
  • Re: Altering for loop index
    ... Say I am processing characters within a string and if I ... find a series that needs to be skipped I would increment the index by ... that, at the top of the loop, the index will be incremented again by the ...
    (microsoft.public.vc.language)