Kernighan and Pike's "Beautiful" Code
- From: spinoza1111 <spinoza1111@xxxxxxxxx>
- Date: Thu, 3 Jan 2008 02:16:12 -0800 (PST)
Nonsense prevails, modesty fails
Grace and virtue turn into stupidity...
If something you missed didn't even exist
It was just an ideal -- is it such a surprise?
What shall we do, what shall we do with all this useless beauty?
- Elvis Costello
Richard Heathfield and Randy Howard have vandalized the thread "Brian
Kernighan, maybe I'm not worthy, maybe I'm scum". It was intended to
be a technical discussion of whether the code snippet, authored by Rob
Pike in 1998 on behalf of Princeton computer science professor Brian
Kernighan and reproduced in Beautiful Code (Ozam and Wilson, eds.,
O'Reilly Media, Sebastopol CA 2007) is as Kernighan says, "beautiful".
I have adapted the snippet mostly unchanged to Visual Studio C++ and
written a comparative version in C Sharp to examine Kernighan's claim
that Java or .Net strings would be less "efficient". I am posting my
results in the original thread and in this new thread so that the
community is spared the cyberbullying, cyberstalking, harassment, and
outright vandalism carried out by Richard Heathfield and Randy Howard.
These consist of an intro with the results of my first benchmark, a
list of what I consider the flaws in the Pike code, the C sharp
solution, and the C++ version of the code in the book.
Edward G. Nilges (author, Build Your Own .Net Language and Compiler,
Apress, 2004, Berkeley CA)
Hong Kong 1-3-2007 2:21 PM
I've implemented Rob Pike's algorithm in C Sharp, replacing its use
of
addresses with Strings to provide internationalization and above all
safety as the code is changed. But I've preserved most of its
features...including something I noted while doing the conversion:
Pike's bad implementation of the compare for a character match
(including the match of dot) in two places: in matchhere() and in
matchstar(). This is something which should have been done with a
function call or a macro in the text presented as any code, that is
supposed to be Beautiful code.
I'm still doing validation of my comparision and plan to write a
wrapper for both the C Sharp and C++/C executables to test them
thoroughly for both comparative correctness and comparative
performance. But preliminary results indicate that for the regular
expression C.t, and a string consisting of 38 "d"s and Cat, both
return the same results.
The Pike code takes on my contemporary machine about .6 seconds for
1000000 iterations. The C Sharp code takes about 2.7 seconds.
This is such a minimal gain in efficiency emerging after ONE MILLION
iterations, that, in my view, Pike's code isn't Beautiful, and the
use
of C to point at "strings" is bad practice today as is the use of C,
or C embedded within C++.
Brian, please don't praise Rob for taking only an hour. This culture
of "doing it fast" cannot produce Beautiful or correct code, and it
has resulted at many companies in the destruction of programmer
morale, and their family life, as they are continually bullied to
repeatedly do, and redo, bad code in inadequate languages, "fast"
based on role models such as you and Pike are, deservedly enough.
There are no bugs in Rob's code: bugs in the code presented here are
my fault if they exist in fact.
Flamers may flame, and be damned. Cyberstalkers and cyberbullies may
cyberstalk and cyberbully, and be damned. I do not have the time at
this moment to see if they are afoot, and will reply to them as
needed
at a later time.
The following consists of:
(1) My summary of the flaws in the Pike code in Beautiful Code
(O'Reilly Media, 2007)
(2) The only file you need to create a C Sharp command line
executable
that uses strings to execute the C sharp version of the Pike code in
Visual Studio C Sharp
(3) The only file you need to create a C++ command line executable
that implements the Pike code, repeated from "elsethread"
FLAWS IN THE PIKE CODE
It doesn't implement a regular expression interpreter insofar as
"regular expressions" have a formal, mathematical meaning: it makes
no
provision for a character which must occur at least once and it makes
no provision for nesting as in (a+b)*, and, it returns the shortest
conformant string. In a mathematical sense, the longest conformant
string can be, possibly, transformed into the SET of conformant
strings but the shortest conformant string is only one member. The
code attains its micro-efficiency by not doing a thorough job.
It is idiomatic C, which uses a parameter passed by value as the
changeable index into the string it points-at. This teaches bad
practice unusable in other languages and unsafe in C from the
standpoint of program maintenance.
While advertised as a string processor, it does not handle modern
Unicode or double-byte strings containing international input.
Kernighan does express, in the text of the essay, the concern that
the
code, which does heavy recursion proportional to the string length,
might run slowly or overrun stack limits.
But no consideration exists in the code or Brian's essay, that a
Beautiful way exists to avoid most recursion and speed up the code.
If
the regular expression does not start with a string start carat or
string end dollar sign, its first character that is not followed by
an
asterisk is a "handle" which which can be scanned-for in a
nonrecursive loop or by a strcspn function.
Of course, starter code that searches the regular expression for a
fixed handle (such as is c in cd*) might not be considered Beautiful
in a minimalist sense.
I plan to implement this later in the C Sharp code.
The metacharacters carat, asterisk and dollar sign outside of
expected
positions are treated as ordinary data in an undocumented bug/
feature, and no escape character is supported.
The matching of a specific character against a specific character in
the regular expression is coded in two separate places in two
separate
procedures. This is very poor programming practice, and in itself
makes this Ugly Code.
THE C SHARP CODE FILE
using System;
using System.Collections.Generic;
using System.Text;
namespace kernighanRegexCsharp
{
class Program
{
static void Main(string[] args)
{
const int REPCOUNT_BACKUP = 1000000;
const string REGEX_BACKUP = "C.t";
const string TESTSTRING_BACKUP =
"ddddddddddddddddddddddddddddddddddddddCat";
System.Diagnostics.Stopwatch stopWatch = new
System.Diagnostics.Stopwatch();
int repeats, reps;
int result = 0;
string regex;
string testString;
int satisfierIndex = 0;
int satisfierLength = 0;
int check = 0;
if (args.GetUpperBound(0) < 2)
{
System.Console.Write("My syntax is
kernighanRegexCsharp
<repeats> <regularExpression> <string> [c]\nUse the c at the end of
the command line to check results from repetition for identity to
previous results\n");
System.Console.Write("I'll use my backup test
values\n");
repeats = REPCOUNT_BACKUP;
regex = REGEX_BACKUP;
testString = TESTSTRING_BACKUP;
} else
{
repeats = System.Convert.ToInt32(args[0]);
regex = args[1];
testString = args[2];
if (args.GetUpperBound(0) >= 3)
check = (args[3].ToUpper() == "C") ?
1 : 0;
}
System.Console.Write("Applying the regular expression
" +
"\"" + regex + "\" " +
"to the string " +
"\"" + testString + "\" " +
repeats.ToString() + " " +
"time(s)\n");
int result1 = -1;
int satisfierIndex1 = 0;
int satisfierLength1 = 0;
stopWatch.Start();
for (reps = 0; reps < repeats; reps++)
{
result = kernighanRegexCsharp.match(regex,
testString,
ref satisfierIndex,
ref satisfierLength);
if (check == 1 && repeats > 1)
if (result1 == -1)
{
assertx(result == 0 || result
== 1);
result1 = result;
satisfierIndex1 = satisfierIndex;
satisfierLength1 =
satisfierLength;
} else assertx(result == result1
&&
satisfierIndex ==
satisfierIndex1
&&
satisfierLength ==
satisfierLength1);
}
stopWatch.Stop();
double duration =
(double)stopWatch.ElapsedMilliseconds /
1000;
if (reps > 0)
{
System.Console.WriteLine("Satisfier string was
" +
(result == 0 ? "not " : "")
+
"found");
if (result == 1)
{
System.Console.WriteLine("Zero-origin
index of satisfier
string: " +
satisfierIndex.ToString());
System.Console.WriteLine("Length of
satisfier string: " +
satisfierLength.ToString());
}
System.Console.WriteLine("Time taken was about
" +
duration.ToString() + " " +
"seconds");
if (check == 1 && reps > 1)
System.Console.WriteLine("\nThe first
return value was
checked to see if it was 1 or zero,\nand each set of results starting
with result 2 was checked against\nthe first set of results to see if
they matched: all of these tests succeeded\n");
}
return;
}
static void assertx(bool condition)
{
if (!condition)
{
throw new Exception("Assertion has failed
\n");
}
}
}
public static class kernighanRegexCsharp
{
/* match: search for regexp anywhere in text */
public static int match(string regex, string text, ref
int start,
ref int length)
{
start = 0;
length = 0;
int textIndex = 0;
int regexIndex = 0;
if (regex[0] == '^')
{
regexIndex++;
if (matchhere(regex, ref regexIndex,
text, ref textIndex) == 1)
{
start = 0;
length = textIndex;
return 1;
} else return 0;
}
int textIndex0;
do { /* must look even if string is empty
*/
textIndex0 = textIndex;
if (matchhere(regex, ref regexIndex,
text, ref textIndex) == 1)
{
start = textIndex0;
length = textIndex -
textIndex0;
return 1;
}
} while (textIndex++ < text.Length);
return 0;
}
/* matchhere: search for regexp at beginning of text
*/
private static int matchhere(string regex, ref int
regexIndex,
string text, ref int textIndex)
{
if (regexIndex >= regex.Length) return 1;
int index;
if ((index = regexIndex + 1) < regex.Length &&
regex[index] == '*')
{
regexIndex += 2;
return matchstar(regex[regexIndex],
regex, ref regexIndex, text,
ref textIndex);
}
if (regex[regexIndex] == '$' && index >=
regex.Length)
return (textIndex >= text.Length) ?
1 : 0;
if (textIndex < text.Length && (regex[regexIndex] == '.'
|| regex[regexIndex] == text[textIndex]))
{
regexIndex += 1; textIndex += 1;
return matchhere(regex, ref
regexIndex, text, ref textIndex);
}
return 0;
}
/* matchstar: search for c*regex at beginning of text
*/
static int matchstar(char c, string regex, ref int
regexIndex,
string text, ref int textIndex)
{
do { /* a * matches zero or more instance
*/
if (matchhere(regex, ref regexIndex,
text, ref textIndex) == 1)
return 1;
} while (textIndex < text.Length &&
(text[textIndex++] == c || c ==
'.'));
return 0;
}
};
}
THE C++ CODE FILE
// kernighanRegexC.cpp : main project file.
#include "stdafx.h"
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include <iostream>
#include <fstream>
using namespace std;
#define REPCOUNT_BACKUP ("1")
#define REGEX_BACKUP ("")
#define TESTSTRING_BACKUP ("dddCat")
using namespace System;
public value class kernighanRegexC
{
public:
/* match: search for regexp anywhere in text */
static int match(char *regexp, char *text, char
**start, int
*length)
{
*length = 0;
*start = text;
if (regexp[0] == '^')
{
return matchhere(regexp+1, &text)
?
(*length = text - *start,
1)
:
(*start = 0, 0);
}
do { /* must look even if string is empty
*/
*start = text;
if (matchhere(regexp, &text))
{
*length = text - *start;
return 1;
}
} while (*text++ != '\0');
*start = 0;
return 0;
}
private:
/* matchhere: search for regexp at beginning of text
*/
static int matchhere(char *regexp, char **textp)
{
if (regexp[0] == '\0') return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2,
textp);
if (regexp[0] == '$' && regexp[1] == '\0')
return **textp == '\0';
if (**textp!='\0' && (regexp[0]=='.' ||
regexp[0]==**textp))
return matchhere(regexp+1, (++
(*textp), textp));
return 0;
}
/* matchstar: search for c*regexp at beginning of text
*/
static int matchstar(int c, char *regexp, char
**textp)
{
do { /* a * matches zero or more instance
*/
if (matchhere(regexp, textp))
return 1;
} while (**textp != '\0' && (*(*textp)++ == c
|| c == '.'));
return 0;
}
};
void assertx(int condition)
{
if (!condition)
{
printf("Assertion has failed\n");
abort();
}
}
int main(int argc, char *argv[] )
{
clock_t start;
int repeats, reps;
int result = 0;
char *regexAddr;
char *stringAddr;
char *satisfierAddr;
int satisfierLength;
int check = 0;
if (argc < 4)
{
printf("My syntax is kernighanRegexC <repeats>
<regularExpression>
<string> [c]\nUse the c at the end of the command line to check
results from repetition for identity to previous results\n");
printf("I'll use my backup test values\n");
repeats = atoi((REPCOUNT_BACKUP));
regexAddr = (REGEX_BACKUP);
stringAddr = (TESTSTRING_BACKUP);
} else
{
repeats = atoi(argv[1]);
regexAddr = (argv[2]);
stringAddr = (argv[3]);
if (argc > 4)
check = (argv[4][0] == 'c' && argv[4][1] ==
'\0');
}
printf( "Applying the regular expression '%s' to the string
'%s' %i
time(s)\n",
regexAddr, stringAddr, repeats);
start = clock();
int result1 = -1;
int satisfierIndex1;
int satisfierLength1;
for (reps = 0; reps < repeats; reps++)
{
result = kernighanRegexC::match(regexAddr,
stringAddr,
&satisfierAddr,
&satisfierLength);
if (check && repeats > 1)
if (result1 == -1)
{
assertx(result == 0 || result == 1);
result1 = result;
satisfierIndex1 = satisfierAddr -
stringAddr;
satisfierLength1 = satisfierLength;
} else assertx(result == result1
&&
satisfierIndex1 ==
satisfierAddr - stringAddr
&&
satisfierLength ==
satisfierLength1);
}
double duration = (double)( clock() - start) /
CLOCKS_PER_SEC;
if (reps > 0)
{
printf( "Satisfier string was %sfound\n", result==0 ?
"not " : "");
if (result == 1)
{
printf("Zero-origin index of satisfier string:
%i\n", satisfierAddr
- stringAddr);
printf("Length of satisfier string: %i\n",
satisfierLength );
printf("Starting address of test string: %x
\n", stringAddr);
printf("Starting address of satisfier string:
%x\n",
satisfierAddr);
}
printf( "Time taken was about %2.8f seconds\n",
duration );
if (check && reps > 1)
printf("\nThe first return value was checked
to see if it was 1 or
zero,\nand each set of results starting with result 2 was checked
against\nthe first set of results to see if they matched: all of
these
tests succeeded\n");
}
return 0;
}
.
- Follow-Ups:
- Re: Kernighan and Pike's "Beautiful" Code
- From: moi
- Re: Kernighan and Pike's "Beautiful" Code
- From: Walter Banks
- Re: Kernighan and Pike's "Beautiful" Code
- From: Stephen Howe
- Re: Kernighan and Pike's "Beautiful" Code
- Prev by Date: Re: Brian Kernighan, maybe I'm not worthy, maybe I'm scum
- Next by Date: Re: Brian Kernighan, maybe I'm not worthy, maybe I'm scum
- Previous by thread: Array average algorithms
- Next by thread: Re: Kernighan and Pike's "Beautiful" Code
- Index(es):
Relevant Pages
|
|