Re: Searching for byte string in a binary file.

Jens.Toerring_at_physik.fu-berlin.de
Date: 05/16/04


Date: 16 May 2004 11:39:57 GMT


"Walter Dnes (delete the 'z' to get my real address)" <wzaltdnes@waltdnes.org> wrote:
> I took a C course some time ago, but I'm only now beginning to use it,
> for a personal pet project. My current stumbling-block is finding an
> efficient way to find a match between the beginning of a "counted"
> string and data in a binary file. Given...

> #include <stdio.h>

Also include <stdlib.h>, see below.

> int main(int argc, char *argv[])
> {
> char bstring[255];
> int bstring_length, match_length;
> long match_location;
> FILE *input_file, *output_file;

> if(argc != 3){
> printf(" Correct usage:\n %s input_filename output_filename\n", argv[0]);
> return 1;

Make that

     return EXIT_FAILURE;

not all OSes return 1 for failure and 0 for success, and with the macros
EXIT_FAILURE and EXIT_SUCCESS, defined in <stdlib.h>, you're always on
the safe side.

Another common convention is to print error messages to stderr and not
mix them with the normal output of your program. So better use

 fprintf( stderr, "Correct usage:\n %s input_filename output_filename\n",
          argv[0]);

> }
> if((input_file = fopen(argv[1], "rb")) == NULL){
> printf("Error opening %s for input\n", argv[1]);
> return 1;
> }
> if((output_file = fopen(argv[2], "wb")) == NULL){
> printf("Error opening %s for output\n", argv[2]);
> return 1;
> }
> ...

> Later on, bstring and bstring_length get initialized. bstring may
> contain \0's, so the "logical length" (which will not exceed 255) is
> actually stored in bstring_length. What I'm trying to do is to find the
> location and length of the longest match from the left side of bstring.
> If that's not clear, here's an example...

> bstring[0] = 'a';
> bstring[1] = '\0';
> bstring[2] = 'b';
> bstring[3] = 'c';
> bstring[4] = 'd';
> bstring_length = 5;

> Assume that the sequence is not matched anywhere in the file. However,
> assume that the sequence 'a', '\0', 'b', 'c' does exist in the file.
> What's the most efficient search that returns match_length ( 4 ) and
> ftell() of the beginning of the match? If it's a function, should I
> declare in main() like so?

> int bstring_length, *match_length;
> long *match_location;

> I assume that the function prototype would be something like...

> int findmatch( bstring[255], bstring_length, match_length, match_location);

Sorry, that's not a prototype.

> By passing pointers, I assume that both match_location and
> match_length will be modified and available to main(). Or is my
> understanding of pointers lacking. findmatch() should return 0 if a
> match is found, and 1 if no match.

When you want to set 'match_location' within find_match() to a long
value that's visible in main() you need to change that a bit. You
must define 'match_location' in main() as long, i.e. have there

long match_location;

and then pass a pointer to that variable to the function. So your
prototype for the function would look like

int find_match( char buf[ 255 ], int blen, int mlen, long *loc );

(I intentionally changed the names to be used within the function
from the ones you defined in main() to make it more obvious what
belongs to main() and what belongs to find_match().) If you don't
intend to change the char array within the function it probably
would be better to make the first argument "const char buf[ 255 ]".

You now would call that function like

  find_match( bstring, bstring_length, match_len, &match_location );

(please note the '&' in front of 'match_location', it tells that
you pass the address of 'match_location' and not its value).
Within find_match() you would then assign the position of the match
to '*loc', i.e. you write it into the memory location 'loc' points
to.

By the way, using names like 'bstring' and 'bstring_length' is a bit
misleading since 'bstring' isn't a string (a real string can't have
embedded '\0' characters), it's just an array of chars. So it might
be prudent to give it a name that doesn't make people assume that
it's going to be a string.

> Would memory-mapped files help here?

The rest of your questions on how to do the matching in the fastest,
most effective way are not really C questions but more about what
kind of algorithm to use. That's better discussed in groups like
comp.programming since it really doesn't is is about C. And a google
search for e.g. "Boyer-Moore algorithm" (just to name one) will
probably come up with a lot of interesting links (and show you
how many possible algorithms have been developed for that problem
over the years. E.g.

http://www-igm.univ-mlv.fr/~lecroq/string/

looks rather interesting). Finally, questions about memory-mapping
files are off-topic here because that can only be done with system-
specific extensions to C.
                                        Regards, Jens

-- 
  \   Jens Thoms Toerring  ___  Jens.Toerring@physik.fu-berlin.de
   \__________________________  http://www.toerring.de


Relevant Pages

  • Re: [newbie] Best way to search for binary data
    ... [string searching ... ... > searching random bytes. ... algorithms are usually preferred to the Knuth-Morris-Pratt algorithm. ... void init_search(const char *string) ...
    (microsoft.public.vc.language)
  • Re: Mimic AES_ENCRYPT and AES_DECRYPT functions in Ruby
    ... def mysql_key ... # The algorithm just creates a 16 byte buffer set to all zero, ... # Runs bitwise XOR for each char on string ...
    (comp.lang.ruby)
  • Re: permutations
    ... make a list of the permutation of the string you have left. ... The algorithm is recursive -- making a list of permutations is done from a ... void swap(char *s, char *t) ...
    (comp.lang.c)
  • Re: permutations
    ... make a list of the permutation of the string you have left. ... The algorithm is recursive -- making a list of permutations is done from ... void swap(char *s, char *t) ...
    (comp.lang.c)
  • Re: How to add thousand separators
    ... First, this code is obsolete as written, because char is a dead data type and should not ... Note that both of these should be stored as string resources since they might need to be ... 18 digits for any reason. ... you have made a VERY SERIOUS DESIGN ERROR. ...
    (microsoft.public.vc.mfc)