Re: Can anyone explain Bayesian filtering/calculations?
From: Richard Heathfield (dontmail_at_address.co.uk.invalid)
Date: 10/02/03
- Next message: Richard Heathfield: "Re: SIMPLE algorithm that took me by surprise."
- Previous message: Alf P. Steinbach: "Re: Interfacing java with microsoft word"
- In reply to: Corey Murtagh: "Can anyone explain Bayesian filtering/calculations?"
- Next in thread: TLOlczyk: "Re: Can anyone explain Bayesian filtering/calculations?"
- Reply: TLOlczyk: "Re: Can anyone explain Bayesian filtering/calculations?"
- Reply: emerth: "Re: Can anyone explain Bayesian filtering/calculations?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 2 Oct 2003 06:49:16 +0000 (UTC)
Corey Murtagh wrote:
> Ok, so I'm messing around with writing a spam/virus filter for my
> mailbox. After reading up on the methods used by the various existing
> packages (which I'm choosing not to use because it's fun to try :>), I
> think I've got a fairly good handle on what I want to do.
>
> Where I'm stuck is in calculating the effects of combining the various
> signatures' probabilities. It seems that the popular method at the
> moment is to use Bayes' Rule to handle the probability calcs... but
> statistics is most definitely /not/ my strong suit, and there's way too
> much high-level stuff out there for easy googling.
No, it's really easy, honest! They just make it look hard. It's one of those
conspiracy things (like STL, or ncurses).
> So... can anyone help with an explanation?
Yeah, but never forget that I'm long-winded. This is easy, but I have a
tendency to stretch things out to make them sound even easier. So don't be
put off by the size of the article. :-)
If we take spam as our text: any word might appear in spam, and that same
word might appear in not-spam (ham). In a "training" phase, we ask the user
to identify all emails as either ham or spam (or, possibly, as "don't know"
in which case we ignore them completely for our purposes). We tokenise the
emails, identifying words (or N-graphs, or whatever you think is best -
words are fine) and adding counts to a database. (I use a text file, with
each line consisting of a tab-separated line:
word <tab> hamcount <tab> spamcount
So, over time, we build up a picture of how many times a word is used in
ham, compared with how many times it's used in spam.
We can therefore calculate a probability (careful! Probability can trap you
easily if you're not watching):
p = hamcount / (hamcount + spamcount)
What does p represent? It is an approximate measure of the probability that
an email is spam, **given that it contains a particular word**. So - if the
word is "medication", my database might look like this:
dollars 6 1907
so the probability that an email to me, containing the word "dollars", is
spam is very high. In fact, it's 0.996864, which is pretty close to 1
(certainty).
For some words (and if I were to give you an example, we'd be involved in a
contradiction!), it might actually /be/ 1, but we don't actually let any
one word dominate proceedings like that, so we'd probably frob it down to
0.999999. You'll see why in a bit.
Okay, that's just one word. Here's another:
exiting 28 14
Here, we see that there is a 0.333333 probability that an email with the
word "exiting" in it is spam.
So - what if an email contains the word "exiting" AND the word "dollars"?
What is the probability of it being spam? Well, it turns out to be:
a = 0.333333
b = 0.996864
p = (a * b) / ((a * b) + ((1 - a) * (1 - b))))
which is 0.332288 / (0.332288 + 0.209067) or 0.613808. This is the
probability (if we have no other relevant information) that the email is
spam.
If we had three words, we'd use:
(a * b * c) / (a * b * c)((1 - a)(1 - b)(1 - c))
instead. And so on.
That's it - the whole thing.
To get a good idea of whether an email is spam, then, just parse it, check
it against your database, find the N most interesting items (that is, those
items with a high value of fabs(0.5 - p)), and do the calcs. (Some people
suggest that 20 is a good value of N. The same people also suggest that you
might want to double your ham count when calculating p, because otherwise
the ham can tend to be drowned out by the sheer volume of spam.)
> Or a site? Once I've got a
> handle on the method I'm sure I can code it, but an example in code form
> (pref. something I can recognize... ie: no perl :>) would go a long way.
How about the lingua franca? Here's my own code, which you are very welcome
to use under the terms of the GPL:
/* bayes.h */
#ifndef BAYES_H_
#define BAYES_H_ 1
#include <stddef.h>
double CalcBayesFilterValue(double *p, size_t n);
double GetBayesDistance(double d);
void RankBayesDistance(double *p, size_t numelems, double d);
#endif
/* bayes.c */
#include <stddef.h>
#include <math.h>
#include "bayes.h"
double CalcBayesFilterValue(double *p, size_t n)
{
double m = 1.0;
double b = 0.0;
double d = 1.0;
double epsilon = 0.0001;
double upepsilon = 1.0 - epsilon;
size_t i = 0;
while(i < n)
{
m *= (p[i] > epsilon) ? p[i] : epsilon;
d *= (p[i] > upepsilon) ? epsilon : (1.0 - p[i]);
++i;
}
b = m;
m += d;
b /= m;
return b;
}
double GetBayesDistance(double d)
{
return fabs(0.5 - d);
}
/* rankings are in terms of distance from 0.5.
* Element 0 is the least far from 0.5. An
* equal ranking at [0] is always replaced.
*/
void RankBayesDistance(double *p, size_t numelems, double d)
{
double dist = GetBayesDistance(d);
size_t i = 0;
int done = 0;
while(!done && i < numelems)
{
if(dist < GetBayesDistance(p[i]))
{
done = 1;
}
else
{
if(i > 0)
{
p[i - 1] = p[i];
}
p[i] = d;
}
++i;
}
}
#if TEST
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
double d[] = {0.6, 0.72, 0.9, 0.1, 0.33, 0.7};
double rank[sizeof d / sizeof d[0]] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5};
size_t n = sizeof d / sizeof d[0];
double b = CalcBayesFilterValue(d, n);
int i;
printf("%.6f\n", b);
for(i = 0; i < 20; i++)
{
size_t place;
double r = rand() / (RAND_MAX + 1.0);
RankBayesDistance(rank, n, r);
for(place = 0; place < n; place++)
{
printf(" %.5f", rank[place]);
}
printf("\n");
}
return 0;
}
#endif
-- Richard Heathfield : binary@eton.powernet.co.uk "Usenet is a strange place." - Dennis M Ritchie, 29 July 1999. C FAQ: http://www.eskimo.com/~scs/C-faq/top.html K&R answers, C books, etc: http://users.powernet.co.uk/eton
- Next message: Richard Heathfield: "Re: SIMPLE algorithm that took me by surprise."
- Previous message: Alf P. Steinbach: "Re: Interfacing java with microsoft word"
- In reply to: Corey Murtagh: "Can anyone explain Bayesian filtering/calculations?"
- Next in thread: TLOlczyk: "Re: Can anyone explain Bayesian filtering/calculations?"
- Reply: TLOlczyk: "Re: Can anyone explain Bayesian filtering/calculations?"
- Reply: emerth: "Re: Can anyone explain Bayesian filtering/calculations?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|