Re: Can anyone explain Bayesian filtering/calculations?

From: Richard Heathfield (dontmail_at_address.co.uk.invalid)
Date: 10/02/03


Date: Thu, 2 Oct 2003 06:49:16 +0000 (UTC)

Corey Murtagh wrote:

> Ok, so I'm messing around with writing a spam/virus filter for my
> mailbox. After reading up on the methods used by the various existing
> packages (which I'm choosing not to use because it's fun to try :>), I
> think I've got a fairly good handle on what I want to do.
>
> Where I'm stuck is in calculating the effects of combining the various
> signatures' probabilities. It seems that the popular method at the
> moment is to use Bayes' Rule to handle the probability calcs... but
> statistics is most definitely /not/ my strong suit, and there's way too
> much high-level stuff out there for easy googling.

No, it's really easy, honest! They just make it look hard. It's one of those
conspiracy things (like STL, or ncurses).

> So... can anyone help with an explanation?

Yeah, but never forget that I'm long-winded. This is easy, but I have a
tendency to stretch things out to make them sound even easier. So don't be
put off by the size of the article. :-)

If we take spam as our text: any word might appear in spam, and that same
word might appear in not-spam (ham). In a "training" phase, we ask the user
to identify all emails as either ham or spam (or, possibly, as "don't know"
in which case we ignore them completely for our purposes). We tokenise the
emails, identifying words (or N-graphs, or whatever you think is best -
words are fine) and adding counts to a database. (I use a text file, with
each line consisting of a tab-separated line:

word <tab> hamcount <tab> spamcount

So, over time, we build up a picture of how many times a word is used in
ham, compared with how many times it's used in spam.

We can therefore calculate a probability (careful! Probability can trap you
easily if you're not watching):

  p = hamcount / (hamcount + spamcount)

What does p represent? It is an approximate measure of the probability that
an email is spam, **given that it contains a particular word**. So - if the
word is "medication", my database might look like this:

dollars 6 1907

so the probability that an email to me, containing the word "dollars", is
spam is very high. In fact, it's 0.996864, which is pretty close to 1
(certainty).

For some words (and if I were to give you an example, we'd be involved in a
contradiction!), it might actually /be/ 1, but we don't actually let any
one word dominate proceedings like that, so we'd probably frob it down to
0.999999. You'll see why in a bit.

Okay, that's just one word. Here's another:

exiting 28 14

Here, we see that there is a 0.333333 probability that an email with the
word "exiting" in it is spam.

So - what if an email contains the word "exiting" AND the word "dollars"?
What is the probability of it being spam? Well, it turns out to be:

a = 0.333333
b = 0.996864
p = (a * b) / ((a * b) + ((1 - a) * (1 - b))))

which is 0.332288 / (0.332288 + 0.209067) or 0.613808. This is the
probability (if we have no other relevant information) that the email is
spam.

If we had three words, we'd use:

(a * b * c) / (a * b * c)((1 - a)(1 - b)(1 - c))

instead. And so on.

That's it - the whole thing.

To get a good idea of whether an email is spam, then, just parse it, check
it against your database, find the N most interesting items (that is, those
items with a high value of fabs(0.5 - p)), and do the calcs. (Some people
suggest that 20 is a good value of N. The same people also suggest that you
might want to double your ham count when calculating p, because otherwise
the ham can tend to be drowned out by the sheer volume of spam.)

> Or a site? Once I've got a
> handle on the method I'm sure I can code it, but an example in code form
> (pref. something I can recognize... ie: no perl :>) would go a long way.

How about the lingua franca? Here's my own code, which you are very welcome
to use under the terms of the GPL:

/* bayes.h */
#ifndef BAYES_H_
#define BAYES_H_ 1
#include <stddef.h>

double CalcBayesFilterValue(double *p, size_t n);
double GetBayesDistance(double d);
void RankBayesDistance(double *p, size_t numelems, double d);
#endif

/* bayes.c */

#include <stddef.h>
#include <math.h>

#include "bayes.h"

double CalcBayesFilterValue(double *p, size_t n)
{
  double m = 1.0;
  double b = 0.0;
  double d = 1.0;
  double epsilon = 0.0001;
  double upepsilon = 1.0 - epsilon;
  size_t i = 0;
  while(i < n)
  {
    m *= (p[i] > epsilon) ? p[i] : epsilon;
    d *= (p[i] > upepsilon) ? epsilon : (1.0 - p[i]);
    ++i;
  }
  b = m;
  m += d;
  b /= m;
  return b;
}

double GetBayesDistance(double d)
{
    return fabs(0.5 - d);
}

/* rankings are in terms of distance from 0.5.
 * Element 0 is the least far from 0.5. An
 * equal ranking at [0] is always replaced.
 */
void RankBayesDistance(double *p, size_t numelems, double d)
{
  double dist = GetBayesDistance(d);
  size_t i = 0;
  int done = 0;

  while(!done && i < numelems)
  {
    if(dist < GetBayesDistance(p[i]))
    {
        done = 1;
    }
    else
    {
      if(i > 0)
      {
        p[i - 1] = p[i];
      }
      p[i] = d;
    }
    ++i;
  }
}

#if TEST
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
  double d[] = {0.6, 0.72, 0.9, 0.1, 0.33, 0.7};
  double rank[sizeof d / sizeof d[0]] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5};
  size_t n = sizeof d / sizeof d[0];
  double b = CalcBayesFilterValue(d, n);
  int i;
  printf("%.6f\n", b);
  for(i = 0; i < 20; i++)
  {
    size_t place;
    double r = rand() / (RAND_MAX + 1.0);
    RankBayesDistance(rank, n, r);
    for(place = 0; place < n; place++)
    {
      printf(" %.5f", rank[place]);
    }
    printf("\n");
  }
  return 0;
}
#endif

-- 
Richard Heathfield : binary@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton


Relevant Pages

  • bayesian spam filter
    ... I have been looking at bayesian based spam filters and have a question about ... where we read this as the probability of the message, containing the word w, ... of choosing that word out of the entire corpus of known words. ... are spam, and 5,000 emails you know are not-spam and you use this to train ...
    (sci.math)
  • Re: Extended 5xx code (was Re: Drop UCE instead of forwarding off-site?)
    ... message's forwarding path. ... the higher the probability that it will span an update cycle of the ... recipients report spam. ... The overall probability of the race condition occuring for any service is ...
    (comp.mail.sendmail)
  • address harvesting (was: For sale: HP Integrity rx2600 parts (various))
    ... anyone harvests usenet for email addresses. ... that they respond positively to spam is much lower than in the general ... I get typically a one digit or low two digit number of ... Real names enhance the probability of getting real answers. ...
    (comp.os.vms)
  • Re: combining probabilities
    ... down pat but I'm having trouble understanding the following problem. ... If word A indicates an e-mail is 90% likely to be spam and word B indicates ... derivation or proof to full understand how this combined probability works. ... You'd like the conditional probability Pthat the email ...
    (sci.math)
  • Re: Motorhall Automobile Trade
    ... Int: +46 705474830 ... International Spam Trade ... eggs and spam, spam and toast, spam, spam and eggs, old spam, stupid ... spam, spam, spam eggs and ham, ham and spam, spam in a can, and of ...
    (rec.autos.4x4)