Text Classification - Spam Filter

C

Corn Holio

I'm currently writing a POP3 proxy to act as a spam filter in Java.
Does anyone know of any good java text classification tools? I want
to start with basic spam filtering.

I tried using Classifier4J's BayesianClassifier. I tested this
by exporting about 6300 emails from Outlook and used the subjects
and bodies of these messages to "teach" the classifier what spam looks
like.

Problem is.. now Classifier4J thinks EVERY email that comes
through is spam and it filters it into a different spam inbox.

Anyone have any suggestions? I'm looking for a Java API
for accomplishing this.

Thanks
 
J

Joona I Palaste

Corn Holio said:
I'm currently writing a POP3 proxy to act as a spam filter in Java.
Does anyone know of any good java text classification tools? I want
to start with basic spam filtering.
I tried using Classifier4J's BayesianClassifier. I tested this
by exporting about 6300 emails from Outlook and used the subjects
and bodies of these messages to "teach" the classifier what spam looks
like.
Problem is.. now Classifier4J thinks EVERY email that comes
through is spam and it filters it into a different spam inbox.
Anyone have any suggestions? I'm looking for a Java API
for accomplishing this.

I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.
 
G

GaryM

I have a related problem. It is well known that spammers mask
critical words to escape filtering. For example "Gen.er.ic
v1agra". I would like to develop some kind of filter rules that
would classify *EVERY* masked word as spam, regardless of what the
word is intended to mean. But I can't think of a rule that would
catch masked words and still let normal punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

I have found the Soundex (http://en.wikipedia.org/wiki/Soundex) method
useful if deriving obscured words. Formulated from geneaology (think
surname corruption) it will provide a score for a word based on the
leading char and phonoemes. I have tested it on words with numbers in
place of letters and it works more often than not.
 
J

Jim Sculley

Joona said:
I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

On my Linux box, I use a tool called SpamAssassin. It has the
capability to learn from existing messages. Out of the box, it caught
about 50% of the SPAM I received. After a few weeks of 'teaching' it
misses only 1 to 2% of the junk. Perhaps the source can give you some tips:

http://useast.spamassassin.org/downloads.html


Jim S.
 
K

Kai Grossjohann

Corn Holio said:
I tried using Classifier4J's BayesianClassifier. I tested this
by exporting about 6300 emails from Outlook and used the subjects
and bodies of these messages to "teach" the classifier what spam looks
like.

Problem is.. now Classifier4J thinks EVERY email that comes
through is spam and it filters it into a different spam inbox.

Did you give it both spam and ham to learn from, or did you just give
it spam? This was not clear from your description.

I use the bogofilter program, myself. Works well. But it's written
in C, not Java.

Kai
 
W

William Brogden

Joona I Palaste said:
I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.
There are some text analysis news groups but they are not very active
(if anybody knows of an active one, please let me know.)

Lessee -the following occur to me: - for the spam that breaks up words
with periods, etc; - I think you have to look at message statistics, not
detect single words. Try
A high proportion of period or comma embedded (no following space)
in text. Lots of short "words" not in the normal English vocabulary.

For the nonsense word sequences that appear to be randomly generated
and are typically hidden by HTML tags:
Extended sequences of words with none of the usual a, an, the, and
small words found in normal text. Also sequences with no verbs.
Unrecognizable HTML-like tags.

Bill (Also interested in text analysis)
 
P

Pavel Tonkov

Joona said:
I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

Take a look at the java.util.regex package in the 1.4 SDK.

I've been thinking of writing a filtering proxy for a while using
regular expression matching to "see through" masking like that.

You'd still have to define spam words to catch spam, but you could match
"v.i.a.g.r.a" (or any variant with punctuation in between) using

Pattern p = Pattern.compile("v\W[i|]\Wa\Wg\Wr\Wa\W", CASE_INSENSITIVE |
UNICODE_CASE);

I matched the i OR a "|" mark which "they" use to hinder detection.

You'd have to do some kind of unicode translation to detact Valï(u)m (I
receive that in a subject line today - notice the umlaut). I'm not sure
how to do that, but I amagine there are algorthms out there to strip
diacritic marks.

It's an interesting problem, and I'd love to see a solution!!!! Death to
spammers!

Nige
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top