email filter for repeat offenders

M

Mike

I would like some suggestions and/or references to already-written code
to assist with writing an email filter. I use a web-based email client
(questmail.futurequest.net) and want to write an intelligent filter to
(hopefully) discourage some of the repeat junk mail that I receive (>80
messages/day). I'm fairly new to Perl and am still sifting through the
standard packages. Here's a description of what I want this filter to
do:

1. Anyone on my 'white-list' will have their message allowed through.
If they have included a binary attachment (rare), I can scan it
manually for viruses.
2. I want to have a dynamic 'off-white-list' that will consist of
addresses and subject lines of recently sent email. If I can somehow
read my "Sent" folder, that will essentially be the off-white-list.
3. I want a 'black-list' that not only holds the actual email address
(not the display "From") but also holds the date of the last email sent
from that address and a counter of the number of sends.

The date information in #3 will be used to remove entries from the
black-list once they've cleaned up their act and haven't sent an email
in, say one month. The 'send-count' will be used to progressively send
more obnoxious email kickbacks to the sender. On the first three
occurrences, the kickback will consist of either an email notifying the
sender why their email was rejected or a mockup of the message returned
from email servers for non-existing addresses (I haven't yet decided
which). Beginning with offense number four, N-minus-3 emails will be
sent (N being the number of occurrences stored in the black-list) and
will include a 1Mb junk attachment file. So on the fifth offense, the
sender will receive two identical email kickbacks, each with a 1Mb
attachment, on the sixth they will receive
three, and so on. Even if the actual sender ignores the kickbacks,
eventually someone should notice their email server filling up.

I will need to handle the (hopefully) rare case of the black-list email
entry being a "No Reply" type of address, lest my code get into an
infinite loop. For now, the only thing that I can think to do is
remove such addresses (if I can identify them) from the black-list. If
anyone has any slicker suggestions for this, please let me know.

Incoming email not already on the black-list will be sent there based
on the following criteria:
1. One or more keywords (from a list) in the subject line; this will
help quickly throw out porn.
2. A subject line of the format "RE: {something}" where the {something}
is NOT a subject of any of my sent messages (from the off-white-list).
3. A subject line containing those 'filter-defeaters', words with
zeroes in place of the letter "O", dollar signs in place of the letter
"S", and so on. I'm not sure how I'm going to write
this logic yet, but it sounds like a complex exercise in regular
expressions.

Any feedback would be appreciated. I'm particularly looking for tips
on parsing the incoming (and outgoing) email and/or reusable code from
someone who has done anything remotely similar.

Thanks in advance.

Mike McIntyre
 
M

Mike

Thanks for the reply Scott. I asked for advice, and "Don't do this."
is advice. I hadn't thought about my host's TOS, but you're probably
correct. Defensive spamming is still spamming, and I don't want to be
guilty of the very problem that I'm battling.

If my original idea is bad, I may just use the logic for adding
addresses to my black-list to simply bounce the emails normally, or do
what I currently do with the built-in filters which is forward the
rejects to another email account for sifting through later. This works
for about 85-90% of the spam, but I get about 4000 messages per month
in my "suspect" email account, and they're a pain to wade through.

I would welcome any other suggestions. Thanks.

Mike
 
R

Rick Scott

(Mike said:
...
3. I want a 'black-list' that not only holds the actual email address
(not the display "From") but also holds the date of the last email
sent from that address and a counter of the number of sends.

The date information in #3 will be used to remove entries from the
black-list once they've cleaned up their act and haven't sent an email
in, say one month. The 'send-count' will be used to progressively
send more obnoxious email kickbacks to the sender. On the first
three occurrences, the kickback will consist of either an email
notifying the sender why their email was rejected or a mockup of the
message returned from email servers for non-existing addresses (I
haven't yet decided which). Beginning with offense number four,
N-minus-3 emails will be sent (N being the number of occurrences
stored in the black-list) and will include a 1Mb junk attachment
file. ...

This is a very bad idea, since the 'From' address in most spam is
either nonexistent (meaning a bounce back to you) or one of an
innocent party. Have a look at the spam FAQ; there's also a number
of references to different email filters there.

http://spamfaq.net/spamfighting.shtml




Rick
 
W

Wes Groleau

Mike said:
I would welcome any other suggestions. Thanks.

I had _excellent_ results with "spamprobe" after its
initial training. 'bogofilter' is another such tool.

If you are not familiar with "Bayesian clssifiers"
do a web search for the terms "spam" and "Bayesian"
Or click the 'spam' link at http://www.paulgraham.com

I ran it on a server and then 'procmail' would forward
only the good stuff to my real, secret address.

What's great about Bayesian classifiers is that they
learn the spammers' new tricks as fast as the spammers
come up with them. I never had to tinker with mail-bot
recipes for months, once spamprobe's probability data
was sufficient.

--
Wes Groleau

I've noticed lately that the paranoid fear of computers becoming
intelligent and taking over the world has almost entirely disappeared
from the common culture. Near as I can tell, this coincides with
the release of MS-DOS.
-- Larry DeLuca
 
I

Iain Chalmers

Wes Groleau said:
I had _excellent_ results with "spamprobe" after its
initial training. 'bogofilter' is another such tool.

If you are not familiar with "Bayesian clssifiers"
do a web search for the terms "spam" and "Bayesian"
Or click the 'spam' link at http://www.paulgraham.com

I ran it on a server and then 'procmail' would forward
only the good stuff to my real, secret address.

What's great about Bayesian classifiers is that they
learn the spammers' new tricks as fast as the spammers
come up with them. I never had to tinker with mail-bot
recipes for months, once spamprobe's probability data
was sufficient.

Yeah, I'll add my vote to bayesian filtering, and suggest you look at
popfile <http://popfile.sourceforge.net/> - its working very nicely for me
with over 500,000 messages filtered at around 99.8% accuracy. Good enough
for me...

big
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top