Re: Standard C Library regex performance issue



igor.kulkin@xxxxxxxxx wrote:

That might be true. Still regexp inspite of being very long should be
very straightforward.
Here is the regexp (I would understand if noone would read it):

^([[:alpha:]]{3} +[[:digit:]]{1,2} +[[:digit:]]{1,2}:[[:digit:]]{1,2}:
[[:digit:]]{1,2}) +([^ ]+)-([[:digit:]]+)-([[:alnum:]]+)\\[([[:digit:]]
+)\\] +([^ ]+)( +(([^ ]+)\\(([[:digit:]]*)\\)))?: (.*)\\n?$

This is hideous.

It simply matches the line in the log file and has no fancy stuff
involved.

Lots of fancy stuff involved for dubious reasons.

The way I read the file should not matter as I've tryed running read-
file-line-by-line code separately (I've commented regexps stuff) and
it ran really fast.

So you're saying it runs very fast when you read a file line by line but
doesn't run fast when you don't? Well yes then of course it matters.

Here is the Pthon code I've benchmarked:

import re

PATTERN = re.compile(r"^(\w{3}\s*\d{1,2}\s*\d{1,2}\:\d{1,2}\:\d{1,2})
\s*(([^\s]+)\[(\d*)\])\s*([^\s]+)\s*(([^\s]+)\((\d*)\))\:\s(.*?)\n?$")

----
You can set regex matching modes by specifying a special constant as a third
parameter to re.search(). re.I or re.IGNORECASE applies the pattern case
insensitively. re.S or re.DOTALL makes the dot match newlines. re.M or
re.MULTILINE makes the caret and dollar match after and before line breaks in
the subject string. There is no difference between the single-letter and
descriptive options, except for the number of characters you have to type in.
To specify more than one option, "or" them together with the | operator:
re.search("^a", "abc", re.I | re.M).

By default, Python's regex engine only considers the letters A through Z, the
digits 0 through 9, and the underscore as "word characters". Specify the flag
re.L or re.LOCALE to make \w match all characters that are considered letters
given the current locale settings. Alternatively, you can specify re.U or
re.UNICODE to treat all letters from all scripts as word characters. The
setting also affects word boundaries.
----

The above implies that Pyhton's newline mode is *ON* by default. POSIX
regcomp() is NOT newline on by default.

fp = open("some.log", "r")

for line in fp:
mo = PATTERN.match(line)

fp.close()

Which is line by line.

And here is the C code:


#include <stdio.h>
#include <regex.h>


// Just a helper function.
char * read_line(FILE * in) {
size_t line_len;
char * buf;
buf = fgetln(in, &line_len);

if (!buf) return NULL;

while (line_len > 0 && (buf[line_len - 1] == (char) 10 ||
buf[line_len - 1] == (char) 13)) line_len--;

Remove that. There's no reason you should have to do any of that if fgetln()
is doing what it's told.

DESCRIPTION
The fgetln() function returns a pointer to the next line from the stream
referenced by stream. This line is not a C string as it does not end
with a terminating NUL character. The length of the line, including the
final newline, is stored in the memory location to which len points.
(Note, however, that if the line is the last in a file that does not end
in a newline, the returned text will not contain a newline.)

Looks like some BSD4.4 function. Also looks like it operates on a FILE stream
and uses a static pointer of some sort. In short, please just avoid this
function altogether and use fgets().

char *line = malloc(line_len + 1);

strncpy(line, buf, line_len);

line[line_len] = (char) 0;

return line;
}

Right. So the issue here is that you don't know your maximum line length which
is probably what led you to find a function like fgetln() in the first place.
This is one area where you've got to either establish a reasonable boundary
size and use that as the size of your temporary buffer or use fread() and do
buffer management yourself. What this means is that if you do not forsee any
line being longer than let's say 1024 characters. Use a simple temporary
buffer of 1024 char, and throw an error when you hit max line length.

int main(int argc, char **argv) {
regex_t regex;
int errc = regcomp(&regex, "^([a-zA-Z_]{3} +[0-9]{1,2} +[0-9]{1,2}:
[0-9]{1,2}:[0-9]{1,2}) +([^ ]+)-([0-9]+)-([a-zA-Z_]+)\\[([0-9]+)\\] +
([^ ]+)( +(([^ ]+)\\(([0-9]*)\\)))?: (.*)\\n?$", REG_EXTENDED |
REG_ICASE);

Add REG_NEWLINE.
Please remove "\\n?" from your regex.

Also, your regex:
^<0 or 1 matches of a formatted string>: <0 or more chars>$
is not the most efficient use of regex, and you should probably examine your
logfile format as well.

However the problem to me is that you did not set REG_NEWLINE.
.



Relevant Pages

  • Re: [9fans] Sam commands in acme
    ... I am trying to select all c comments from within a file using acme, ... I expected '.*' to work with newline ... characters since it works for spaces and tabs, ... However, since the longest possible regex will be matched, ...
    (comp.os.plan9)
  • Re: [9fans] Sam commands in acme
    ... is the meaning of that comma in there. ... I expected '.*' to work with newline ... characters since it works for spaces and tabs, ... However, since the longest possible regex will be matched, ...
    (comp.os.plan9)
  • RE: RegEx - select NOT
    ... I specifically simplified this regex to make my question ... The procesing instructions for the string in question are ... > will also have to bypass the patterns which I would specify. ... > I know how to negate individual characters but how can you ...
    (microsoft.public.vb.general.discussion)
  • Re: Help with Regex (UserName, Email)
    ... I have a feeling you haven't tried regex till now, ... If Name = ALL recurring characters, excluding spacing, ... ShowMessage: Please re-enter the Name again. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Regex question
    ... I didn't know the range would be that much different from SQL ... The problem is that my regex only gets ... should write a regex pattern that matches _that_, ... remove all the characters from the string that aren't digits or '/' ...
    (microsoft.public.dotnet.languages.csharp)