Text Classification - Spam Filter

Corn Holio · Jan 3, 2004

I'm currently writing a POP3 proxy to act as a spam filter in Java.
Does anyone know of any good java text classification tools? I want
to start with basic spam filtering.

I tried using Classifier4J's BayesianClassifier. I tested this
by exporting about 6300 emails from Outlook and used the subjects
and bodies of these messages to "teach" the classifier what spam looks
like.

Problem is.. now Classifier4J thinks EVERY email that comes
through is spam and it filters it into a different spam inbox.

Anyone have any suggestions? I'm looking for a Java API
for accomplishing this.

Thanks

Joona I Palaste · Jan 3, 2004

Corn Holio said:
I'm currently writing a POP3 proxy to act as a spam filter in Java.
Does anyone know of any good java text classification tools? I want
to start with basic spam filtering.

I tried using Classifier4J's BayesianClassifier. I tested this
by exporting about 6300 emails from Outlook and used the subjects
and bodies of these messages to "teach" the classifier what spam looks
like.

Problem is.. now Classifier4J thinks EVERY email that comes
through is spam and it filters it into a different spam inbox.

Anyone have any suggestions? I'm looking for a Java API
for accomplishing this.

I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

GaryM · Jan 3, 2004

I have a related problem. It is well known that spammers mask
critical words to escape filtering. For example "Gen.er.ic
v1agra". I would like to develop some kind of filter rules that
would classify *EVERY* masked word as spam, regardless of what the
word is intended to mean. But I can't think of a rule that would
catch masked words and still let normal punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

I have found the Soundex (http://en.wikipedia.org/wiki/Soundex) method
useful if deriving obscured words. Formulated from geneaology (think
surname corruption) it will provide a score for a word based on the
leading char and phonoemes. I have tested it on words with numbers in
place of letters and it works more often than not.

Jim Sculley · Jan 3, 2004

Joona said:
I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

On my Linux box, I use a tool called SpamAssassin. It has the
capability to learn from existing messages. Out of the box, it caught
about 50% of the SPAM I received. After a few weeks of 'teaching' it
misses only 1 to 2% of the junk. Perhaps the source can give you some tips:

http://useast.spamassassin.org/downloads.html

Jim S.

Kai Grossjohann · Jan 3, 2004

Corn Holio said:
I tried using Classifier4J's BayesianClassifier. I tested this
by exporting about 6300 emails from Outlook and used the subjects
and bodies of these messages to "teach" the classifier what spam looks
like.

Problem is.. now Classifier4J thinks EVERY email that comes
through is spam and it filters it into a different spam inbox.

Did you give it both spam and ham to learn from, or did you just give
it spam? This was not clear from your description.

I use the bogofilter program, myself. Works well. But it's written
in C, not Java.

Kai

William Brogden · Jan 3, 2004

Joona I Palaste said:
I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

There are some text analysis news groups but they are not very active
(if anybody knows of an active one, please let me know.)

Lessee -the following occur to me: - for the spam that breaks up words
with periods, etc; - I think you have to look at message statistics, not
detect single words. Try
A high proportion of period or comma embedded (no following space)
in text. Lots of short "words" not in the normal English vocabulary.

For the nonsense word sequences that appear to be randomly generated
and are typically hidden by HTML tags:
Extended sequences of words with none of the usual a, an, the, and
small words found in normal text. Also sequences with no verbs.
Unrecognizable HTML-like tags.

Bill (Also interested in text analysis)

Pavel Tonkov · Jan 4, 2004

Joona said:
I have a related problem. It is well known that spammers mask critical
words to escape filtering. For example "Gen.er.ic v1agra". I would like
to develop some kind of filter rules that would classify *EVERY* masked
word as spam, regardless of what the word is intended to mean. But I
can't think of a rule that would catch masked words and still let normal
punctuation through.
This is not a Java question as such - so if I'm off-topic, please
suggest another group.

Take a look at the java.util.regex package in the 1.4 SDK.

I've been thinking of writing a filtering proxy for a while using
regular expression matching to "see through" masking like that.

You'd still have to define spam words to catch spam, but you could match
"v.i.a.g.r.a" (or any variant with punctuation in between) using

Pattern p = Pattern.compile("v\W[i|]\Wa\Wg\Wr\Wa\W", CASE_INSENSITIVE |
UNICODE_CASE);

I matched the i OR a "|" mark which "they" use to hinder detection.

You'd have to do some kind of unicode translation to detact Valï(u)m (I
receive that in a subject line today - notice the umlaut). I'm not sure
how to do that, but I amagine there are algorthms out there to strip
diacritic marks.

It's an interesting problem, and I'd love to see a solution!!!! Death to
spammers!

Nige

What is Anti-Spam Filter.(thunderbird spam filter)	1	Mar 27, 2008
New kind of Spam Filter	31	Sep 21, 2003
E-Mail Marketing is now officially SPAM	1	Sep 22, 2006
Spambayes modifications with web services	5	Oct 28, 2005
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
NewsMaestro Usenet Supertool	0	Aug 22, 2007
NewsMaestro Usenet Supertool 3.8.1 is released	0	Sep 20, 2007
NewsMaestro Usenet Supertool	0	Aug 29, 2007

Text Classification - Spam Filter

Corn Holio

Joona I Palaste

GaryM

Jim Sculley

Kai Grossjohann

William Brogden

Pavel Tonkov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads