Perl script to identify corrupt mbox messages?

T

Tuxedo

I'm trying to repair a gigantic mbox file that appears to have been
corrupted in that it displays only 17 of the most recent and total 3087
messages contained in the actual file.

The mail application used is Mozilla on Windows. The exact same error
occurs in all Mozilla applications I tested the file with, i.e. the full
Mozilla suite, the most recent Seamonkey and the stand-alone Thunderbird.

The identical error also happens when testing the mbox in Mozilla
Thunderbird on a Linux system.

I've tried to fix the error by manually removing the message just before
the first message shown in the index, but this does not seem to be the
(only) position of the culprit. The same mbox works entirely when viewed in
for example MUTT or the standard KDE mail client, Kmail. In other words,
the error may partly be attributed to how Mozilla parses it's own mbox and
partly to any incorrectly formatted messages. Perhaps there are several
incorrectly formatted messages, in the form of junk mail, which have been
intentionally or not crafted to currupt standard mbox files within Mozilla.

Does anyone have or know of a perl script that traverses through an mbox
file and that can identify incorrectly formatted mail messages within?

Tuxedo
 
T

Tuxedo

Glenn Jackman wrote:

[...]
Make sure that supposedly empty lines are in fact empty (no stray spaces
or carriage returns). Check for the proper existence of /^From / lines
following a blank line (except for the first line of the mbox) which are
the indicator for a message in an mbox file.

Many thanks for those pointers!
 
T

Tuxedo

Petr Vileta wrote:

[...]
You can try my freeware Tbird2OE (use google). This program is not primary
mean for recovery mbox file but if this will work and will create more
then 17 files then I can send you part of Perl code for read and parse
mbox file.

Ok, thanks. Your program converts folders from mbox format to separate .eml
messages. I will try that to see if it recognizes all 3000+ messages in the
mbox. With a bit of luck, it might be possible to use Mozilla's built in
function to re-import the messages and perhaps the new folder will be fixed.
 
D

Dr.Ruud

Tuxedo schreef:
I'm trying to repair a gigantic mbox file that appears to have been
corrupted in that it displays only 17 of the most recent and total
3087 messages contained in the actual file.

Consider formail, it is on about every *ix system.
 
T

Tuxedo

Consider formail, it is on about every *ix system.

Sounds good but unfortunately the mbox belongs to a company which would not
consider anything better than their crap Windows system!
 
D

Dr.Ruud

Tuxedo schreef:
[attribution repaired] Ruud:
Consider formail, it is on about every *ix system.

Sounds good but unfortunately the mbox belongs to a company which
would not consider anything better than their crap Windows system!

Well, you could still move the file to a different system, or add
cygwin.
 
T

Tuxedo

Dr.Ruud said:
Tuxedo schreef:
[attribution repaired] Ruud:
Consider formail, it is on about every *ix system.

Sounds good but unfortunately the mbox belongs to a company which
would not consider anything better than their crap Windows system!

Well, you could still move the file to a different system, or add
cygwin.

True, but the file still needs to be parsed by the Mozilla application
running on a Windows system. In fact, the exact same error occurs on a
Linux system when placing the mbox in the Mozilla mail directory.
 
D

Dr.Ruud

Tuxedo schreef:
Dr.Ruud:
Tuxedo:
[attribution repaired] Ruud:
Consider formail, it is on about every *ix system.

Sounds good but unfortunately the mbox belongs to a company which
would not consider anything better than their crap Windows system!

Well, you could still move the file to a different system, or add
cygwin.

True, but the file still needs to be parsed by the Mozilla application
running on a Windows system. In fact, the exact same error occurs on a
Linux system when placing the mbox in the Mozilla mail directory.

Use formail to repair the file, is why I suggested it.
 
T

Tuxedo

Dr.Ruud wrote:

[...]
Use formail to repair the file, is why I suggested it.

My mistake, and thanks for the tip! I was not familiar with 'formail'
before. This interesting utility is indeed is on my Linux system. But
having neither a remote idea what error(s) the original mailbox may
contain, nor being familiar with formail, it is a bit complicated to guess
how to best process it. Nevertheless, I tried the following examples:

formail -ds <my_crappy_mbox >>reinvigorated_mbox

.... this certainly made some changes, in fact, 10 or so additional messages
appear in the Mozilla index which did not show up earlier, including a
couple without a valid sender which are now listed by Mozilla as from
foo@bar, but which appear to be file fragments, i.e. not real mail.

Most of the 3000+ messages, however, still do not show up in Mozilla.

So I tried: ...
formail -zds <my_crappy_mbox >>reinvigorated_mbox
... but this made the file no more readable in Mozilla than the previous try.

and ...
formail -rds <my_crappy_mbox >>reinvigorated_mbox
... but with the same result as the former try.

Naturally I removed the generated (.msf) index files as well as terminated
the Mozilla application between the tries, in case something would get
cached otherwise.

The Mozilla application simply appears to be choking on the mbox while
building the index. The progress bar is helplessly trying to move forward,
but then falls back, then forward a bit, and then back again, until it
finally gives up. In other words, the graphical indicator at the bottom
right of the application, which is meant to indicate the progress of
building the index, never reaches its maximum.

Perhaps the mbox contains some very odd characters, maybe part of some
attachment, which causes Mozilla but not other mail clients to choke.
Perhaps it is the result of some malformatted mail circulating via zoombie
machines, Outlook and whatever, that affects Mozilla on multiple platforms.
 
P

Peter J. Holzer

I'm trying to repair a gigantic mbox file that appears to have been
corrupted in that it displays only 17 of the most recent and total 3087
messages contained in the actual file.

The mail application used is Mozilla on Windows. [...]
The same mbox works entirely when viewed in for example MUTT or the
standard KDE mail client, Kmail.

Have you tried copying all messages to a new mbox file with mutt or
kmail and then reading the new mbox file with Mozilla?

hp
 
M

Mumia W.

[...]
formail -ds <my_crappy_mbox >>reinvigorated_mbox

.... this certainly made some changes, in fact, 10 or so additional messages
appear in the Mozilla index which did not show up earlier, including a
couple without a valid sender which are now listed by Mozilla as from
foo@bar, but which appear to be file fragments, i.e. not real mail.

Most of the 3000+ messages, however, still do not show up in Mozilla.

So I tried: ...
formail -zds <my_crappy_mbox >>reinvigorated_mbox
... but this made the file no more readable in Mozilla than the previous try.

and ...
formail -rds <my_crappy_mbox >>reinvigorated_mbox
... but with the same result as the former try.

Naturally I removed the generated (.msf) index files as well as terminated
the Mozilla application between the tries, in case something would get
cached otherwise.

The Mozilla application simply appears to be choking on the mbox while
building the index. The progress bar is helplessly trying to move forward,
but then falls back, then forward a bit, and then back again, until it
finally gives up. In other words, the graphical indicator at the bottom
right of the application, which is meant to indicate the progress of
building the index, never reaches its maximum.

Perhaps the mbox contains some very odd characters, maybe part of some
attachment, which causes Mozilla but not other mail clients to choke.
Perhaps it is the result of some malformatted mail circulating via zoombie
machines, Outlook and whatever, that affects Mozilla on multiple platforms.

Research the problem with the help of this website:
http://kb.mozillazine.org/

In particular, this article may (or may not) be of help:
http://kb.mozillazine.org/Inbox_stays_blank

Here is a script that, might improve things a little bit:

use strict;
use warnings;
require FileHandle;
require Email::Folder;
require Date::parse;
require POSIX;
Date::parse->import('str2time');
POSIX->import('ctime');

my $file = glob('~/tmp/mozmail/OldTests');
my $outfile = 'output.mbox';

my $fh = FileHandle->new($outfile, '>') or die("Stop: $!");
my $folder = Email::Folder->new($file);

my $count = 0;
while (my $msg = $folder->next_message) {
my $date = $msg->header('Date');
$date = ctime(str2time($date)); chomp $date;
$fh->print("From - $date\n");
$fh->print($msg->as_string() . "\n");
$count++;
}
print "There are $count messages in the folder.\n";

$fh->close;

Email::Folder and Date::parse are modules you can download from CPAN.
The other modules are standard parts of Perl. You should change $file
and $outfile as appropriate. You shouldn't modify the original mailbox file.

Probably, you'll not need the script. Things should improve after you've
deleted the .msf (index) file and closed an reopened Mozilla.

(Followups set to alt.fan.mozilla)
 
D

Dr.Ruud

Tuxedo schreef:
Dr.Ruud:

This interesting utility is indeed is on my Linux system. But
having neither a remote idea what error(s) the original mailbox may
contain, nor being familiar with formail, it is a bit complicated to
guess how to best process it. Nevertheless, I tried the following
examples:

formail -ds <my_crappy_mbox >>reinvigorated_mbox

... this certainly made some changes, in fact, 10 or so additional
messages appear in the Mozilla index which did not show up earlier,
including a couple without a valid sender which are now listed by
Mozilla as from foo@bar, but which appear to be file fragments, i.e.
not real mail.

I assume that you want to find out at which message the problem starts
and at which line in the mbox file that is, and start fixing from there.

Be careful not to introduce extra problemes with the move of the mbox
file from the Windows to the Linux system (maybe you should do a
dos2unix on the file, and then maybe you shouldn't, formail will DWIM).

You can use formail together with procmail to convert from mbox to
maildir format (the one file per message in new/ cur/ tmp/ structure)
like this:

formail -defYz \
-s procmail -m VERBOSE=yes DEFAULT="test_maildir/" /dev/null <
crappy.mbx

(the "test_maildir/" will be created in the user's $HOME, include
MAILDIR="/some/path" to redirect)

The "-defYz" just lists all interesting options for this case, change at
will, see man formail.

From the maildir structure you should be able to find out at which
message the split up breaks.
 
T

Tuxedo

Peter said:
I'm trying to repair a gigantic mbox file that appears to have been
corrupted in that it displays only 17 of the most recent and total 3087
messages contained in the actual file.

The mail application used is Mozilla on Windows. [...]
The same mbox works entirely when viewed in for example MUTT or the
standard KDE mail client, Kmail.

Have you tried copying all messages to a new mbox file with mutt or
kmail and then reading the new mbox file with Mozilla?

Yes that was the first thing I tried, but it didn't work :-(

I assume therefore that that some crummy characters are contained within
one or more messages, or/and in headers, which somehow cause the Mozilla to
choke, and so whatever conversion is done is simply carried forward.

The file, coming from Windows appeared to have been in DOS format. I've
un-DOS'ed it, awk-splitted each individual message, re-combed with the
proper empty line, followed by a ^From occurance, but without luck.

The file is around 150 MB and I wish I could post the entire mbox here, but
its not mine to distribute, and it surely contains much private
communication. Maybe there is a way to encode the entire content into
Mozilla safe characters as this is obviously a Mozilla bug.
 
T

Tuxedo

Mumia said:
[...]
formail -ds <my_crappy_mbox >>reinvigorated_mbox

.... this certainly made some changes, in fact, 10 or so additional
messages appear in the Mozilla index which did not show up earlier,
including a couple without a valid sender which are now listed by
Mozilla as from foo@bar, but which appear to be file fragments, i.e. not
real mail.

Most of the 3000+ messages, however, still do not show up in Mozilla.

So I tried: ...
formail -zds <my_crappy_mbox >>reinvigorated_mbox
... but this made the file no more readable in Mozilla than the previous
try.

and ...
formail -rds <my_crappy_mbox >>reinvigorated_mbox
... but with the same result as the former try.

Naturally I removed the generated (.msf) index files as well as
terminated the Mozilla application between the tries, in case something
would get cached otherwise.

The Mozilla application simply appears to be choking on the mbox while
building the index. The progress bar is helplessly trying to move
forward, but then falls back, then forward a bit, and then back again,
until it finally gives up. In other words, the graphical indicator at
the bottom right of the application, which is meant to indicate the
progress of building the index, never reaches its maximum.

Perhaps the mbox contains some very odd characters, maybe part of some
attachment, which causes Mozilla but not other mail clients to choke.
Perhaps it is the result of some malformatted mail circulating via
zoombie machines, Outlook and whatever, that affects Mozilla on multiple
platforms.

Research the problem with the help of this website:
http://kb.mozillazine.org/

In particular, this article may (or may not) be of help:
http://kb.mozillazine.org/Inbox_stays_blank

Here is a script that, might improve things a little bit:

use strict;
use warnings;
require FileHandle;
require Email::Folder;
require Date::parse;
require POSIX;
Date::parse->import('str2time');
POSIX->import('ctime');

my $file = glob('~/tmp/mozmail/OldTests');
my $outfile = 'output.mbox';

my $fh = FileHandle->new($outfile, '>') or die("Stop: $!");
my $folder = Email::Folder->new($file);

my $count = 0;
while (my $msg = $folder->next_message) {
my $date = $msg->header('Date');
$date = ctime(str2time($date)); chomp $date;
$fh->print("From - $date\n");
$fh->print($msg->as_string() . "\n");
$count++;
}
print "There are $count messages in the folder.\n";

$fh->close;

Email::Folder and Date::parse are modules you can download from CPAN.
The other modules are standard parts of Perl. You should change $file
and $outfile as appropriate. You shouldn't modify the original mailbox
file.

Probably, you'll not need the script. Things should improve after you've
deleted the .msf (index) file and closed an reopened Mozilla.

(Followups set to alt.fan.mozilla)

Excellent! However, the problem does not want to be so easily solved. It
was no problem getting the above script running with the 2 up-to-date and
non-standard modules, and after having saved the script, as fixbox.pl, I
sucessfully tested it on a small mbox file containing only 3 messages.

However, with the real file, and when using a 2GH notebook with 512MB
memory and Perl 5.8.7, munching through the approximately 150MB mbox, the
above script (or the shell) returned: "Out of Memory!". The resulting
'output.mbox' file remained empty.

Personally, I'm not a fan of Mozilla mail, and looking a bit closer, I
could not find a solution to this particular issue on kb.mozillazine.org,
either. The problematic mailbox is someone's else. I'm seriously
contemplating telling them to: 1) abandon Windows, 2) Mozilla mail.
 
T

Tuxedo

Dr.Ruud said:
Tuxedo schreef:

I assume that you want to find out at which message the problem starts
and at which line in the mbox file that is, and start fixing from there.

Yes. If I only knew. I tried to split it up, delete sections and so on,
only to find the problem appearing in many sections, but not knowing
exactly where. The file is just too big to locate the error manually.
Be careful not to introduce extra problemes with the move of the mbox
file from the Windows to the Linux system (maybe you should do a
dos2unix on the file, and then maybe you shouldn't, formail will DWIM).

You can use formail together with procmail to convert from mbox to
maildir format (the one file per message in new/ cur/ tmp/ structure)
like this:

formail -defYz \
-s procmail -m VERBOSE=yes DEFAULT="test_maildir/" /dev/null <
crappy.mbx

(the "test_maildir/" will be created in the user's $HOME, include
MAILDIR="/some/path" to redirect)

The "-defYz" just lists all interesting options for this case, change at
will, see man formail.

From the maildir structure you should be able to find out at which
message the split up breaks.

Many thanks for all the tips, especially about formail. However, in this
particular case, I will leave the problematic mbox to rest, because it is
my ambition not to deal with anything from the Windows user space. In
realising it is probably a 99% Mozilla bug, my solution became to simply
transfer the file to a good old BSD mail server and let the file owner
access it's content via Neomail - an exceptional perl program which handles
mbox pop-mail via a no-frills web interface. I just tested that and the
full mbox was read, as fine as it was in MUTT or another native *nix mailer.

Sooner or later Mozilla developers will likely fix the bug, whatever it is.
 
P

Peter J. Holzer

Peter said:
I'm trying to repair a gigantic mbox file that appears to have been
corrupted in that it displays only 17 of the most recent and total 3087
messages contained in the actual file.

The mail application used is Mozilla on Windows. [...]
The same mbox works entirely when viewed in for example MUTT or the
standard KDE mail client, Kmail.

Have you tried copying all messages to a new mbox file with mutt or
kmail and then reading the new mbox file with Mozilla?

Yes that was the first thing I tried, but it didn't work :-(

I assume therefore that that some crummy characters are contained within
one or more messages, or/and in headers, which somehow cause the Mozilla to
choke, and so whatever conversion is done is simply carried forward.

That sounds likely. I think the best way to identify the message(s) with
the crummy characters is to split your big mail boy into a small number
(at most 10 or so) of smaller mboxes. You can do that with formail or a
perl (or awk) script or any text editor you find convenient. Then try to open
each mbox. For each mbox which Mozilla cannot open, split it again,
until you little mboxes with one bad message in each. Then you can check
what they have in common.

The file, coming from Windows appeared to have been in DOS format. I've
un-DOS'ed it, awk-splitted each individual message, re-combed with the
proper empty line, followed by a ^From occurance, but without luck.

The file is around 150 MB and I wish I could post the entire mbox here, but
its not mine to distribute, and it surely contains much private
communication. Maybe there is a way to encode the entire content into
Mozilla safe characters as this is obviously a Mozilla bug.

Parsing each message with MIME::parser and then printing it out again
may help. But I would hate to do that for a whole folder of 3000
messages. It may or may not work and you still don't know the problem
afterwards. Try to find the bad message(s) and convert only these. Then
you also have a test case for bugreport for mozilla.

hp
 
T

Tuxedo

Peter said:
Peter said:
I'm trying to repair a gigantic mbox file that appears to have been
corrupted in that it displays only 17 of the most recent and total
3087 messages contained in the actual file.

The mail application used is Mozilla on Windows.
[...]
The same mbox works entirely when viewed in for example MUTT or the
standard KDE mail client, Kmail.

Have you tried copying all messages to a new mbox file with mutt or
kmail and then reading the new mbox file with Mozilla?

Yes that was the first thing I tried, but it didn't work :-(

I assume therefore that that some crummy characters are contained within
one or more messages, or/and in headers, which somehow cause the Mozilla
to choke, and so whatever conversion is done is simply carried forward.

That sounds likely. I think the best way to identify the message(s) with
the crummy characters is to split your big mail boy into a small number
(at most 10 or so) of smaller mboxes. You can do that with formail or a
perl (or awk) script or any text editor you find convenient. Then try to
open each mbox. For each mbox which Mozilla cannot open, split it again,
until you little mboxes with one bad message in each. Then you can check
what they have in common.

The file, coming from Windows appeared to have been in DOS format. I've
un-DOS'ed it, awk-splitted each individual message, re-combed with the
proper empty line, followed by a ^From occurance, but without luck.

The file is around 150 MB and I wish I could post the entire mbox here,
but its not mine to distribute, and it surely contains much private
communication. Maybe there is a way to encode the entire content into
Mozilla safe characters as this is obviously a Mozilla bug.

Parsing each message with MIME::parser and then printing it out again
may help. But I would hate to do that for a whole folder of 3000
messages. It may or may not work and you still don't know the problem
afterwards. Try to find the bad message(s) and convert only these. Then
you also have a test case for bugreport for mozilla.

That is certainly the best solution, but unfortunately it would set back to
much time. I would however consider submitting the entire file, given
permission by the file owner and without making it public, to someone who
could identify, bugreport and potentially fix the problem.
 
T

Tuxedo

Petr said:
Tuxedo wrote:
Waht my Tbird2OE? Helped you? I want to know the result ;-)

Sorry, but unfortunately I did not get a chance to test your program,
mainly due to time constraints. Also, I have a feeling that converting to
mdir or eml and then back to mbox again will not work in that the cause
will simply remain throughout the process and the problem will appear again
in Mozilla. Instead, my solution has been to move the mbox to a unix system
where the user can access it via a webmail interface. The mbox is
completely readably on any mbox reader except Mozilla. Nevertheless, thanks
for your kind advise! I have bookmarked your program and for another time.
 
T

Tuxedo

Tuxedo said:
Sorry, but unfortunately I did not get a chance to test your program,

Correction to above, I've now tested Tbird2OE with the following result:

All messages which existed in the mbox format were succesfully imported
into Outlook Express. In thereafter importing and converting the Outlook
(eml) files back into the combined Mozilla mbox format, all messages were
restored and show up in the newly created Mozilla index. In other words,
Tbird2OE effectively repaired the mbox and circumvented whatever problem
Mozilla, Seamonkey and Thunderbird has to parse an mbox containing certain
buggy messages, or perhaps inconsistencies in how they were separated.

Thank's a ton Petr for fixing the problem! I wish you all the best with
your application, even if my particular purpose is not exactly what
Tbird2OE was designed for.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top