email parsing

ra9ftm · Aug 27, 2008

It is my first script on python. Don't know is it correctly uses
modules, but it is working fine with specially with russian code pages
and mime formated messages. Also quoted-printable and base64
encoded....

It will be very good if anybody post any comments on this script. Is
it good or bad...

import email
import mailbox
from email.Header import decode_header
from email.Header import make_header
import string
import sys

outEnc="cp866"
infile=sys.argv[1]

subStrObrez = []
subStrObrez.append("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
subStrObrez.append("""~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To UNSUBSCRIBE from this forum, send an email to:""")
subStrObrez.append("~~~~~~~~~~~~~~~~~~")

# Cut yahoo info at the end of message
def obrez(strMsg):
for s in subStrObrez:
n = string.rfind(strMsg,s)
if n != -1:
return strMsg[0:n]
return strMsg

# Convert message header
def my_get_header(str):
str2=""
for val,encoding in decode_header(str):
if encoding:
str2 = str2+ val.decode(encoding)+" "
else:
str2 = str2+ val+" "
return str2

# Process the message
def proc(msg):
print 'From : '+ my_get_header(msg['From']).encode(outEnc)
print 'To : '+ my_get_header(msg['To']).encode(outEnc)
print 'Subject: '+ my_get_header(msg['Subject']).encode(outEnc)
print

if msg.is_multipart():
for part in msg.walk():
if part.get_content_type() == "text/plain":
if part.get_content_charset():
print
obrez(part.get_payload(None,True).decode(part.get_content_charset()).encode(outEnc))
else:
print obrez(part.get_payload(None,True))

else:
if msg.get_content_type() == "text/plain":
if msg.get_content_charset():
print
obrez( (msg.get_payload(None,True)).decode(msg.get_content_charset()) ).encode(outEnc)
else:
print obrez( msg.get_payload(None,True) )
else:
if msg.get_content_type() == "text/html":
if msg.get_content_charset():
print
(msg.get_payload(None,True)).decode(msg.get_content_charset()).encode(outEnc)
else:
print msg.get_payload(None,True)

####################################################################################
# The main program

f = open(infile, "rb")
m1 = mailbox.UnixMailbox(f)

RubLst=[]
RubLst.append(["[contestru]","FOTSTR"])
RubLst.append(["[russiandx]","FORUDX"])

for msg in mailbox.UnixMailbox(f,email.message_from_file):
for rub in RubLst:
if string.find(my_get_header(msg['Subject']),rub[0]) != -1 :
print "SB "+rub[1]+"@FORUM < INET"
print my_get_header(msg['Subject']).encode(outEnc)
print
proc(msg)
print
print "powered by Python"
print "/EX"

Alan Franzoni · Aug 27, 2008

ra9ftm was kind enough to say:

It is my first script on python. Don't know is it correctly uses
modules, but it is working fine with specially with russian code pages
and mime formated messages. Also quoted-printable and base64
encoded....

Some hints:

1) don't write
"xxxxxxxxx",

use

"x" * 10

instead; it's more readable and editable.

def obrez(strMsg):
for s in subStrObrez:
n = string.rfind(strMsg,s)
if n != -1:
return strMsg[0:n]
return strMsg

In Python >= 2.5 you can probably use the partition() method to make the
former function much shorter.

# Convert message header
def my_get_header(str):
str2=""
for val,encoding in decode_header(str):
if encoding:
str2 = str2+ val.decode(encoding)+" "
else:
str2 = str2+ val+" "
return str2

I'm not 100% sure what you're doing there, BTW I'd suggest you to use as
many Unicode objects as you can while working in Python, and encoding them
just when you're outputting them. It'll save you many headaches.

--
Alan Franzoni <[email protected]>
-
Remove .xyz from my email in order to contact me.
-
GPG Key Fingerprint:
5C77 9DC3 BD5B 3A28 E7BC 921A 0255 42AA FE06 8F3E

Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
HOWTO: Parsing email using Python part1	2	Jul 3, 2011
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
Twitter Bot for Series recommendations help please	1	Oct 2, 2024
Python client/server that reads HTML body from server	1	Apr 12, 2023
Web Page Parsing/Downloading	1	Nov 22, 2013
Parsing email attachments: get_payload() produces unsaveable data	6	Oct 4, 2009

email parsing

ra9ftm

Alan Franzoni

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads