There are intermediate levels between just header and whole thing
that are useful in spam detection, namely:
just the first X characters of the body.
I think that if I am at the body level doing a lexigraphic analysis,
then the whole body will always perform better than anything less.
Don't forget the numero uno spam feature, to wit, The Unsubscribe
Message and all of its permutations, is always near the end.
Just the body without the attachments
.
In a multipart mime message there is no distinction between 'body' and
'attachment'. These are all parts and you can make an intelligent guess
by by looking at the mime type and its disposition. Sadly there are no
guarantees where any will occur, so you must parse them all or parse
until you meet an assumption (like the first text/* part is the one I
want analyze).
IMHO, Body checks are definitely the most expensive and you can obtain
excellent performance without them, but they are useful as a last
resort. If not using Javamail, then you can just read the stream and
abort when you've seen enough, but, you will need to decipher mime
boundaries on the fly and decode base 64, quoted-printable etc.
Some other pointers with mail body checks are:
Embedded RFC822 messages which if are multipart require recursion to
parse. Here Javamail is cumbersome but can be wrapped easily to do the
job.
Strip HTML or not? You may have seen, Her<fhjhdfjhd>bal remedy, which
is rendered as Herbal in all mail clients that render html. In general
it is best to strip html, but sometimes the URLs in the body are more
indicting than the domains in the headers.
Just a few thoughts,
Gary