Re: Brian Kernighan, maybe I'm not worthy, maybe I'm scum
- From: spinoza1111 <spinoza1111@xxxxxxxxx>
- Date: Wed, 2 Jan 2008 22:26:48 -0800 (PST)
On Dec 28 2007, 12:55 am, spinoza1111 <spinoza1...@xxxxxxxxx> wrote:
Here is a summary of what I think are the flaws in the code Rob Pike
wrote for Brian Kernighan, and which Brian presents as Beautful Code
in a book of that name, published this year by O'Reilly, on page 3 (an
unindented copy of Rob's code as keyed by me, and wrapped in a C++
class, is at my developerDotStar blog). I read C but have long since
abandoned it because I don't think Beautiful Code can be written in C:
the Pike code, I think, is an example of code that only seems
Beautiful.
Hong Kong 1-3-2007 2:21 PM
I've implemented Rob Pike's algorithm in C Sharp, replacing its use of
addresses with Strings to provide internationalization and above all
safety as the code is changed. But I've preserved most of its
features...including something I noted while doing the conversion:
Pike's bad implementation of the compare for a character match
(including the match of dot) in two places: in matchhere() and in
matchstar(). This is something which should have been done with a
function call or a macro in the text presented as any code, that is
supposed to be Beautiful code.
I'm still doing validation of my comparision and plan to write a
wrapper for both the C Sharp and C++/C executables to test them
thoroughly for both comparative correctness and comparative
performance. But preliminary results indicate that for the regular
expression C.t, and a string consisting of 38 "d"s and Cat, both
return the same results.
The Pike code takes on my contemporary machine about .6 seconds for
1000000 iterations. The C Sharp code takes about 2.7 seconds.
This is such a minimal gain in efficiency emerging after ONE MILLION
iterations, that, in my view, Pike's code isn't Beautiful, and the use
of C to point at "strings" is bad practice today as is the use of C,
or C embedded within C++.
Brian, please don't praise Rob for taking only an hour. This culture
of "doing it fast" cannot produce Beautiful or correct code, and it
has resulted at many companies in the destruction of programmer
morale, and their family life, as they are continually bullied to
repeatedly do, and redo, bad code in inadequate languages, "fast"
based on role models such as you and Pike are, deservedly enough.
There are no bugs in Rob's code: bugs in the code presented here are
my fault if they exist in fact.
Flamers may flame, and be damned. Cyberstalkers and cyberbullies may
cyberstalk and cyberbully, and be damned. I do not have the time at
this moment to see if they are afoot, and will reply to them as needed
at a later time.
The following consists of:
(1) My summary of the flaws in the Pike code in Beautiful Code
(O'Reilly Media, 2007)
(2) The only file you need to create a C Sharp command line executable
that uses strings to execute the C sharp version of the Pike code in
Visual Studio C Sharp
(3) The only file you need to create a C++ command line executable
that implements the Pike code, repeated from "elsethread"
FLAWS IN THE PIKE CODE
It doesn't implement a regular expression interpreter insofar as
"regular expressions" have a formal, mathematical meaning: it makes no
provision for a character which must occur at least once and it makes
no provision for nesting as in (a+b)*, and, it returns the shortest
conformant string. In a mathematical sense, the longest conformant
string can be, possibly, transformed into the SET of conformant
strings but the shortest conformant string is only one member. The
code attains its micro-efficiency by not doing a thorough job.
It is idiomatic C, which uses a parameter passed by value as the
changeable index into the string it points-at. This teaches bad
practice unusable in other languages and unsafe in C from the
standpoint of program maintenance.
While advertised as a string processor, it does not handle modern
Unicode or double-byte strings containing international input.
Kernighan does express, in the text of the essay, the concern that the
code, which does heavy recursion proportional to the string length,
might run slowly or overrun stack limits.
But no consideration exists in the code or Brian's essay, that a
Beautiful way exists to avoid most recursion and speed up the code. If
the regular expression does not start with a string start carat or
string end dollar sign, its first character that is not followed by an
asterisk is a "handle" which which can be scanned-for in a
nonrecursive loop or by a strcspn function.
Of course, starter code that searches the regular expression for a
fixed handle (such as is c in cd*) might not be considered Beautiful
in a minimalist sense.
I plan to implement this later in the C Sharp code.
The metacharacters carat, asterisk and dollar sign outside of expected
positions are treated as ordinary data in an undocumented bug/
feature, and no escape character is supported.
The matching of a specific character against a specific character in
the regular expression is coded in two separate places in two separate
procedures. This is very poor programming practice, and in itself
makes this Ugly Code.
THE C SHARP CODE FILE
using System;
using System.Collections.Generic;
using System.Text;
namespace kernighanRegexCsharp
{
class Program
{
static void Main(string[] args)
{
const int REPCOUNT_BACKUP = 1000000;
const string REGEX_BACKUP = "C.t";
const string TESTSTRING_BACKUP =
"ddddddddddddddddddddddddddddddddddddddCat";
System.Diagnostics.Stopwatch stopWatch = new
System.Diagnostics.Stopwatch();
int repeats, reps;
int result = 0;
string regex;
string testString;
int satisfierIndex = 0;
int satisfierLength = 0;
int check = 0;
if (args.GetUpperBound(0) < 2)
{
System.Console.Write("My syntax is kernighanRegexCsharp
<repeats> <regularExpression> <string> [c]\nUse the c at the end of
the command line to check results from repetition for identity to
previous results\n");
System.Console.Write("I'll use my backup test values\n");
repeats = REPCOUNT_BACKUP;
regex = REGEX_BACKUP;
testString = TESTSTRING_BACKUP;
} else
{
repeats = System.Convert.ToInt32(args[0]);
regex = args[1];
testString = args[2];
if (args.GetUpperBound(0) >= 3)
check = (args[3].ToUpper() == "C") ? 1 : 0;
}
System.Console.Write("Applying the regular expression " +
"\"" + regex + "\" " +
"to the string " +
"\"" + testString + "\" " +
repeats.ToString() + " " +
"time(s)\n");
int result1 = -1;
int satisfierIndex1 = 0;
int satisfierLength1 = 0;
stopWatch.Start();
for (reps = 0; reps < repeats; reps++)
{
result = kernighanRegexCsharp.match(regex,
testString,
ref satisfierIndex,
ref satisfierLength);
if (check == 1 && repeats > 1)
if (result1 == -1)
{
assertx(result == 0 || result == 1);
result1 = result;
satisfierIndex1 = satisfierIndex;
satisfierLength1 = satisfierLength;
} else assertx(result == result1
&&
satisfierIndex == satisfierIndex1
&&
satisfierLength == satisfierLength1);
}
stopWatch.Stop();
double duration = (double)stopWatch.ElapsedMilliseconds /
1000;
if (reps > 0)
{
System.Console.WriteLine("Satisfier string was " +
(result == 0 ? "not " : "") +
"found");
if (result == 1)
{
System.Console.WriteLine("Zero-origin index of satisfier
string: " +
satisfierIndex.ToString());
System.Console.WriteLine("Length of satisfier string: " +
satisfierLength.ToString());
}
System.Console.WriteLine("Time taken was about " +
duration.ToString() + " " +
"seconds");
if (check == 1 && reps > 1)
System.Console.WriteLine("\nThe first return value was
checked to see if it was 1 or zero,\nand each set of results starting
with result 2 was checked against\nthe first set of results to see if
they matched: all of these tests succeeded\n");
}
return;
}
static void assertx(bool condition)
{
if (!condition)
{
throw new Exception("Assertion has failed\n");
}
}
}
public static class kernighanRegexCsharp
{
/* match: search for regexp anywhere in text */
public static int match(string regex, string text, ref int start,
ref int length)
{
start = 0;
length = 0;
int textIndex = 0;
int regexIndex = 0;
if (regex[0] == '^')
{
regexIndex++;
if (matchhere(regex, ref regexIndex, text, ref textIndex) == 1)
{
start = 0;
length = textIndex;
return 1;
} else return 0;
}
int textIndex0;
do { /* must look even if string is empty */
textIndex0 = textIndex;
if (matchhere(regex, ref regexIndex, text, ref textIndex) == 1)
{
start = textIndex0;
length = textIndex - textIndex0;
return 1;
}
} while (textIndex++ < text.Length);
return 0;
}
/* matchhere: search for regexp at beginning of text */
private static int matchhere(string regex, ref int regexIndex,
string text, ref int textIndex)
{
if (regexIndex >= regex.Length) return 1;
int index;
if ((index = regexIndex + 1) < regex.Length && regex[index] == '*')
{
regexIndex += 2;
return matchstar(regex[regexIndex], regex, ref regexIndex, text,
ref textIndex);
}
if (regex[regexIndex] == '$' && index >= regex.Length)
return (textIndex >= text.Length) ? 1 : 0;
if (textIndex < text.Length && (regex[regexIndex] == '.'
|| regex[regexIndex] == text[textIndex]))
{
regexIndex += 1; textIndex += 1;
return matchhere(regex, ref regexIndex, text, ref textIndex);
}
return 0;
}
/* matchstar: search for c*regex at beginning of text */
static int matchstar(char c, string regex, ref int regexIndex,
string text, ref int textIndex)
{
do { /* a * matches zero or more instance */
if (matchhere(regex, ref regexIndex, text, ref textIndex) == 1)
return 1;
} while (textIndex < text.Length && (text[textIndex++] == c || c ==
'.'));
return 0;
}
};
}
THE C++ CODE FILE
// kernighanRegexC.cpp : main project file.
#include "stdafx.h"
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include <iostream>
#include <fstream>
using namespace std;
#define REPCOUNT_BACKUP ("1")
#define REGEX_BACKUP ("")
#define TESTSTRING_BACKUP ("dddCat")
using namespace System;
public value class kernighanRegexC
{
public:
/* match: search for regexp anywhere in text */
static int match(char *regexp, char *text, char **start, int
*length)
{
*length = 0;
*start = text;
if (regexp[0] == '^')
{
return matchhere(regexp+1, &text)
?
(*length = text - *start, 1)
:
(*start = 0, 0);
}
do { /* must look even if string is empty */
*start = text;
if (matchhere(regexp, &text))
{
*length = text - *start;
return 1;
}
} while (*text++ != '\0');
*start = 0;
return 0;
}
private:
/* matchhere: search for regexp at beginning of text */
static int matchhere(char *regexp, char **textp)
{
if (regexp[0] == '\0') return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, textp);
if (regexp[0] == '$' && regexp[1] == '\0')
return **textp == '\0';
if (**textp!='\0' && (regexp[0]=='.' || regexp[0]==**textp))
return matchhere(regexp+1, (++(*textp), textp));
return 0;
}
/* matchstar: search for c*regexp at beginning of text */
static int matchstar(int c, char *regexp, char **textp)
{
do { /* a * matches zero or more instance */
if (matchhere(regexp, textp))
return 1;
} while (**textp != '\0' && (*(*textp)++ == c || c == '.'));
return 0;
}
};
void assertx(int condition)
{
if (!condition)
{
printf("Assertion has failed\n");
abort();
}
}
int main(int argc, char *argv[] )
{
clock_t start;
int repeats, reps;
int result = 0;
char *regexAddr;
char *stringAddr;
char *satisfierAddr;
int satisfierLength;
int check = 0;
if (argc < 4)
{
printf("My syntax is kernighanRegexC <repeats> <regularExpression>
<string> [c]\nUse the c at the end of the command line to check
results from repetition for identity to previous results\n");
printf("I'll use my backup test values\n");
repeats = atoi((REPCOUNT_BACKUP));
regexAddr = (REGEX_BACKUP);
stringAddr = (TESTSTRING_BACKUP);
} else
{
repeats = atoi(argv[1]);
regexAddr = (argv[2]);
stringAddr = (argv[3]);
if (argc > 4)
check = (argv[4][0] == 'c' && argv[4][1] == '\0');
}
printf( "Applying the regular expression '%s' to the string '%s' %i
time(s)\n",
regexAddr, stringAddr, repeats);
start = clock();
int result1 = -1;
int satisfierIndex1;
int satisfierLength1;
for (reps = 0; reps < repeats; reps++)
{
result = kernighanRegexC::match(regexAddr,
stringAddr,
&satisfierAddr,
&satisfierLength);
if (check && repeats > 1)
if (result1 == -1)
{
assertx(result == 0 || result == 1);
result1 = result;
satisfierIndex1 = satisfierAddr - stringAddr;
satisfierLength1 = satisfierLength;
} else assertx(result == result1
&&
satisfierIndex1 == satisfierAddr - stringAddr
&&
satisfierLength == satisfierLength1);
}
double duration = (double)( clock() - start) / CLOCKS_PER_SEC;
if (reps > 0)
{
printf( "Satisfier string was %sfound\n", result==0 ? "not " : "");
if (result == 1)
{
printf("Zero-origin index of satisfier string: %i\n", satisfierAddr
- stringAddr);
printf("Length of satisfier string: %i\n", satisfierLength );
printf("Starting address of test string: %x\n", stringAddr);
printf("Starting address of satisfier string: %x\n",
satisfierAddr);
}
printf( "Time taken was about %2.8f seconds\n", duration );
if (check && reps > 1)
printf("\nThe first return value was checked to see if it was 1 or
zero,\nand each set of results starting with result 2 was checked
against\nthe first set of results to see if they matched: all of these
tests succeeded\n");
}
return 0;
}
.
- Prev by Date: Re: Oversight and Programmer productivity
- Next by Date: Re: Oversight and Programmer productivity
- Previous by thread: Re: Brian Kernighan, maybe I'm not worthy, maybe I'm scum
- Next by thread: Re: Brian Kernighan, maybe I'm not worthy, maybe I'm scum
- Index(es):
Relevant Pages
|