B
Bloch
I've written a little script that uses mailbox manager to parse an mbox
file, strip off most of the headers, decode the body, and eventually print
the data that is encoded as text/plain. It works fine for messages that
are flat (i.e., multipart/alternative on the top level) and it can just
grab the plaintext attachments from 1 level down.
I run into problems when I hit multipart/mixed messages and I have to
descend down a level. I've been reading through the groups.google.com
archives and and the man pages of these modules and see that applying
these items recursively is tricky for inexperienced programmers -- which
I claim to be. Can someone recommend a better way of getting to my desired
endpoint, or help me sort out how to get there using my existing approach?
I've attached the relevent portion of my code and the output of
printStructure to give a better idea of the problem domain.
#!/usr/local/bin/perl
use Mail::Box::Manager;
use Date:arse;
use warnings;
use strict;
my $mgr = Mail::Box::Manager->new;
#my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
"/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
$folder_file) or die "Could not open folder $!n";
my(@subject,@sender,@body,@time);
my $x = 0;
for ($folder->messages) {
$subject[$x] = $_->subject;
$sender[$x] = $_->sender->address;
$time[$x] = $_->get('Date');
#body[$x] = $decode = $_->decoded;
#$_->printStructure;
if($_->isMultipart) {
foreach my $part($_->body->parts) {
my $attached_head = $part->head;
my $attached_body = $part->decoded;
if($attached_head =~ /text\/plain/) {
# print "$attached_head\n";
# print "OK\n";
}elsif($attached_head =~ /multipart\/alternative/i) {
print "$attached_head\n";
print "Crap\! How do I parse the next batch of headers?\n"; print
"$attached_body";
}
}
}
$x++;
}
PARTIAL OUTPUT OF MESSAGE STRUCTURES:
OK:
multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
text/plain (47689 bytes)
text/html (62436 bytes)
OK:
multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
Offense' Comment (10116 bytes)
text/plain (2647 bytes)
text/html (5495 bytes)
OK:
multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
Bill (8876 bytes)
text/plain (1030 bytes)
text/html (5864 bytes)
FAILS TO PARSE PROPERLY:
multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID THE
TRUTH OF THE WAR (202224 bytes)
multipart/alternative (146945 bytes)
text/plain (54877 bytes)
text/html (91778 bytes)
application/msexcel (53598 bytes)
....
file, strip off most of the headers, decode the body, and eventually print
the data that is encoded as text/plain. It works fine for messages that
are flat (i.e., multipart/alternative on the top level) and it can just
grab the plaintext attachments from 1 level down.
I run into problems when I hit multipart/mixed messages and I have to
descend down a level. I've been reading through the groups.google.com
archives and and the man pages of these modules and see that applying
these items recursively is tricky for inexperienced programmers -- which
I claim to be. Can someone recommend a better way of getting to my desired
endpoint, or help me sort out how to get there using my existing approach?
I've attached the relevent portion of my code and the output of
printStructure to give a better idea of the problem domain.
#!/usr/local/bin/perl
use Mail::Box::Manager;
use Date:arse;
use warnings;
use strict;
my $mgr = Mail::Box::Manager->new;
#my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
"/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
$folder_file) or die "Could not open folder $!n";
my(@subject,@sender,@body,@time);
my $x = 0;
for ($folder->messages) {
$subject[$x] = $_->subject;
$sender[$x] = $_->sender->address;
$time[$x] = $_->get('Date');
#body[$x] = $decode = $_->decoded;
#$_->printStructure;
if($_->isMultipart) {
foreach my $part($_->body->parts) {
my $attached_head = $part->head;
my $attached_body = $part->decoded;
if($attached_head =~ /text\/plain/) {
# print "$attached_head\n";
# print "OK\n";
}elsif($attached_head =~ /multipart\/alternative/i) {
print "$attached_head\n";
print "Crap\! How do I parse the next batch of headers?\n"; print
"$attached_body";
}
}
}
$x++;
}
PARTIAL OUTPUT OF MESSAGE STRUCTURES:
OK:
multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
text/plain (47689 bytes)
text/html (62436 bytes)
OK:
multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
Offense' Comment (10116 bytes)
text/plain (2647 bytes)
text/html (5495 bytes)
OK:
multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
Bill (8876 bytes)
text/plain (1030 bytes)
text/html (5864 bytes)
FAILS TO PARSE PROPERLY:
multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID THE
TRUTH OF THE WAR (202224 bytes)
multipart/alternative (146945 bytes)
text/plain (54877 bytes)
text/html (91778 bytes)
application/msexcel (53598 bytes)
....