Recursively Parsing through multipart messages use Mail::Box::Manager;

B

Bloch

I've written a little script that uses mailbox manager to parse an mbox
file, strip off most of the headers, decode the body, and eventually print
the data that is encoded as text/plain. It works fine for messages that
are flat (i.e., multipart/alternative on the top level) and it can just
grab the plaintext attachments from 1 level down.

I run into problems when I hit multipart/mixed messages and I have to
descend down a level. I've been reading through the groups.google.com
archives and and the man pages of these modules and see that applying
these items recursively is tricky for inexperienced programmers -- which
I claim to be. Can someone recommend a better way of getting to my desired
endpoint, or help me sort out how to get there using my existing approach?

I've attached the relevent portion of my code and the output of
printStructure to give a better idea of the problem domain.

#!/usr/local/bin/perl

use Mail::Box::Manager;
use Date::parse;
use warnings;
use strict;
my $mgr = Mail::Box::Manager->new;
#my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
"/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
$folder_file) or die "Could not open folder $!n";
my(@subject,@sender,@body,@time);
my $x = 0;
for ($folder->messages) {
$subject[$x] = $_->subject;
$sender[$x] = $_->sender->address;
$time[$x] = $_->get('Date');
#body[$x] = $decode = $_->decoded;
#$_->printStructure;

if($_->isMultipart) {
foreach my $part($_->body->parts) {
my $attached_head = $part->head;
my $attached_body = $part->decoded;
if($attached_head =~ /text\/plain/) {
# print "$attached_head\n";
# print "OK\n";
}elsif($attached_head =~ /multipart\/alternative/i) {
print "$attached_head\n";
print "Crap\! How do I parse the next batch of headers?\n"; print
"$attached_body";
}
}
}
$x++;
}

PARTIAL OUTPUT OF MESSAGE STRUCTURES:

OK:
multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
text/plain (47689 bytes)
text/html (62436 bytes)

OK:
multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
Offense' Comment (10116 bytes)
text/plain (2647 bytes)
text/html (5495 bytes)

OK:
multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
Bill (8876 bytes)
text/plain (1030 bytes)
text/html (5864 bytes)

FAILS TO PARSE PROPERLY:
multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID THE
TRUTH OF THE WAR (202224 bytes)
multipart/alternative (146945 bytes)
text/plain (54877 bytes)
text/html (91778 bytes)
application/msexcel (53598 bytes)

....
 
B

Bloch

GEEEEEEYYYYYAAAARGH!!!

foreach my $part($_->body->parts('RECURSE'))

was the option that I was looking for. Missed it in the documentation
(several times, I might add).

For what it's worth, I place the blame entirely on Mark Overmeer, who
spent godknowshowlong writing and documenting this excellent module.
Mark, if you hadn't been so thorough, I would never have missed such an
important, easily-spotted detail. No, no, this has nothing to do with the
fact that I'm an American, weaned on television and raised in the age of
instant gratification. Nor with the fact that my iq is roughly 200 points
lower than a sponge -- and not one of those real sponges either, I'm
talking a sponge made by 3M or Dow or someone. No, it's your fault.

And that goes for the lot of you Perl mongers who have contributed to
developing Perl, and in so-doing, have helped to build the modern
internet, or rather, "internets" as our President so eloquently puts it.
You owe me something. I could be using Smalltalk, or Eiffel, or Scheme or
Visual Basic or something, but I chose Perl. Okay, admittedly, Perl
*might* be slightly better than those languages for the problem domains
that I usually look at -- parsing textfiles and playing around with *nixy
stuff and so on. But many of my former CS professors *insist* that it's
ugly -- so it must be true -- so, again, you owe me for giving me such a
cool language to play with for for free -- as in free beer and free
speech.

;-)

I've written a little script that uses mailbox manager to parse an mbox
file, strip off most of the headers, decode the body, and eventually
print the data that is encoded as text/plain. It works fine for
messages that are flat (i.e., multipart/alternative on the top level)
and it can just grab the plaintext attachments from 1 level down.

I run into problems when I hit multipart/mixed messages and I have to
descend down a level. I've been reading through the groups.google.com
archives and and the man pages of these modules and see that applying
these items recursively is tricky for inexperienced programmers -- which
I claim to be. Can someone recommend a better way of getting to my
desired endpoint, or help me sort out how to get there using my existing
approach?

I've attached the relevent portion of my code and the output of
printStructure to give a better idea of the problem domain.

#!/usr/local/bin/perl

use Mail::Box::Manager;
use Date::parse;
use warnings;
use strict;
my $mgr = Mail::Box::Manager->new;
#my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
"/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
$folder_file) or die "Could not open folder $!n";
my(@subject,@sender,@body,@time);
my $x = 0;
for ($folder->messages) {
$subject[$x] = $_->subject;
$sender[$x] = $_->sender->address;
$time[$x] = $_->get('Date');
#body[$x] = $decode = $_->decoded;
#$_->printStructure;

if($_->isMultipart) {
foreach my $part($_->body->parts) {
my $attached_head = $part->head;
my $attached_body = $part->decoded;
if($attached_head =~ /text\/plain/) {
# print "$attached_head\n";
# print "OK\n";
}elsif($attached_head =~ /multipart\/alternative/i) {
print "$attached_head\n";
print "Crap\! How do I parse the next batch of headers?\n";
print "$attached_body";
}
}
}
$x++;
}

PARTIAL OUTPUT OF MESSAGE STRUCTURES:

OK:
multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
text/plain (47689 bytes)
text/html (62436 bytes)

OK:
multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
Offense' Comment (10116 bytes)
text/plain (2647 bytes)
text/html (5495 bytes)

OK:
multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
Bill (8876 bytes)
text/plain (1030 bytes)
text/html (5864 bytes)

FAILS TO PARSE PROPERLY:
multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID
THE TRUTH OF THE WAR (202224 bytes)
multipart/alternative (146945 bytes)
text/plain (54877 bytes)
text/html (91778 bytes)
application/msexcel (53598 bytes)

...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top