![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Having seen numerous articles on Bayesian Spam Filtering, I decided to take a leap and see what it could do for me.
I snagged a copy of POPfile, followed the incredibly simple install instructions and downloaded my mail. during installation I'd given it the name of 4 "buckets" I wanted mail put into - spam, friends, livejournal and general.
I also told it that I didn't want it putting the name of the bucket at the start of the subject line (so all spam mails would have [spam] in the subject line), but to instead put the bucket into the email headers (where it wouldn't be seen). My email reader The Bat happily filters based on header information and I didn't see why other people should have to put up with subject lines being mangled in replies.
Of course, the first time I ran the software it had no idea where the emails should go, as I hadn't trained it at all. It therefore filed them all in bucket numer one, which happened to be "spam". I then opened the User Interface (web based and very easy to use) and went through all of the non-spam emails, telling it what group they should belong in. I also set up filters in The Bat to put emails from each bucket into their own folders.
Once I'd initially trained it on about 20 emails, it's managed to spot livejournal posts perfectly (they obviously have enough in common to be simply spottable), spam near perfectly (no false negatives, only the occasional false positive) and friends fairly well (Joe and Mike are now both recognised instantly as friends, and it'll learn the rest of you as I get emails from you).
It's not perfect (yet), but it's doing a darn good job and I'm confident that as I feed it more email it'll get better and better. Apparently I should only touch it when it categorises something incorrectly, so I happily predict that by this time next week I'll pretty much have forgotten it's there.
Oh, and for all you wierdos that don't use Windows - it's Perl based and installs on Macs and Linux...
Highly recommended
I snagged a copy of POPfile, followed the incredibly simple install instructions and downloaded my mail. during installation I'd given it the name of 4 "buckets" I wanted mail put into - spam, friends, livejournal and general.
I also told it that I didn't want it putting the name of the bucket at the start of the subject line (so all spam mails would have [spam] in the subject line), but to instead put the bucket into the email headers (where it wouldn't be seen). My email reader The Bat happily filters based on header information and I didn't see why other people should have to put up with subject lines being mangled in replies.
Of course, the first time I ran the software it had no idea where the emails should go, as I hadn't trained it at all. It therefore filed them all in bucket numer one, which happened to be "spam". I then opened the User Interface (web based and very easy to use) and went through all of the non-spam emails, telling it what group they should belong in. I also set up filters in The Bat to put emails from each bucket into their own folders.
Once I'd initially trained it on about 20 emails, it's managed to spot livejournal posts perfectly (they obviously have enough in common to be simply spottable), spam near perfectly (no false negatives, only the occasional false positive) and friends fairly well (Joe and Mike are now both recognised instantly as friends, and it'll learn the rest of you as I get emails from you).
It's not perfect (yet), but it's doing a darn good job and I'm confident that as I feed it more email it'll get better and better. Apparently I should only touch it when it categorises something incorrectly, so I happily predict that by this time next week I'll pretty much have forgotten it's there.
Oh, and for all you wierdos that don't use Windows - it's Perl based and installs on Macs and Linux...
Highly recommended
no subject
Date: 2003-07-03 05:57 am (UTC)I upgraded to a newer version on June 11, and reset my stats then . . . so, my current stats as of June 11 are:
Messages classified: 3,334
Classification errors: 40
Accuracy: 98.8%
Not bad, not bad at all.
no subject
Date: 2003-07-03 06:05 am (UTC)Are you just separating into Spam/not spam or are you being pickier than that?
no subject
Date: 2003-07-03 06:09 am (UTC)I've got a bucket for personal mail, and one for spam;
then, in between, I've got one for livejournal and one for lists. I was impressed that it was able to learn to differentiate the 'lists', as it's just bulk-mail that might be of interest, from companies or sites I"ve subscribed to -- but POPfile seems to be able to work highly accurately, even if the lines between buckets are thinly drawn.
no subject
Date: 2003-07-03 06:14 am (UTC)I'd have to read my mail at home to filer spam.
no subject
Date: 2003-07-03 06:22 am (UTC)Alternatively, if you're using a generalised webmail reader (i.e. one that can be pointed at any web address) you could run the filtering proxy on your home machine and bounce all email through it. You're on dialup though which would be both slow and expensive :->
no subject
Date: 2003-07-03 06:51 am (UTC)no subject
Date: 2003-07-03 06:58 am (UTC)I didn't realise it was now that advanced.
no subject
Date: 2003-07-03 07:05 am (UTC)OK, so you can get a decent firewall for free, and less decent AV software for free, but Norton IS integrates everything I want/need for £30. Hardly bank-breaking.
Outlook does allow folders, and condition-based filtering - so with Outlook and Norton IS, I think I have the same functionality you do, yeah?
Also, because Outlook allows condition-based filtering, you can train it to filter out spam after a while too.
Don't get me wrong, this wee package sounds fine, it just doesn't strike me as anything particularly ground-breaking. Or am I missing something?
no subject
Date: 2003-07-03 07:14 am (UTC)What's special about popfile (and other Bayesian filtering systems) is that they start off with no knowledge and you train them to do what you want.
So, for instance, I train it to recognise my friends by just saying "this email came from a friend", not by giving it an address to recognise. It works out how to recognise that an email is 'from a friend' all by itself.
Similarly, it recognises what I count as spam based on my categories, not according to rules I give it.
Outlook certainly allows you to filter stuff into folders based on your own criteria, but it's the recognition of what type of mail you've just recieved which is the clever bit.
If norton's anti-spam works perfectly then yes, you've got effectively the same facilities.
Although I, personally, don't like Outlook very much.
no subject
Date: 2003-07-03 08:00 am (UTC)What do you have against Outlook?
no subject
Date: 2003-07-03 08:06 am (UTC)Outlook's indentation leaves a fair bit to be desired (it doesn't do the whole "line breaks with a ">" at the start very well)
It doesn't reflow text at all.
It won't allow me to have multiple rotating signatures.
I think that's it :->
Re:
Date: 2003-07-03 12:01 pm (UTC)Not sure what version of Outlook you have, but I've been running two accounts in Outlook ever since I got broadband, and it's a dawdle. Replies via either account are selected via a drop-down. Actually, it couldn't be any simpler.
As for your other three points, I'll give you the first two, but signatures are just a vanity that I can live without.
no subject
Date: 2003-07-03 01:18 pm (UTC)I'll admit I could live without signature :->
no subject
Date: 2003-07-03 09:00 am (UTC)