Mail extraction problem (something's wrong with split methods)

Luka Milkovic · Sep 11, 2004

Hello,

I have a little problem and although it's little it's extremely difficult
for me to describe it, but I'll try.
I have written a program which extracts certain portions of my received
e-mail. The content of the e-mail is actually predictable, it has one very
long list of numbers, something looking like this:

[34234,35435,657789,6756735,12312378,09678567,23424]

Of course I cannot manipulate my mail while connected to the POP3 server,
so I decided to transfer mail locally and write it to a file and then
manipulate it. Another problem is that in e-mails there is lot of output,
garbage characters and all sorts of nasty things, but somehow, I managed
to solve it (to download e-mail and extract interesting parts), and here
is how (i'll only show the "interesting parts" part):

temp = [mail.read()]
enc_txt = "\n".join(temp)
begin = enc_txt.find(", '[")+len(", '[")
ending = enc_txt.find("]', ")

enc_txt2 = (enc_txt[begin:ending])
mail.close()
lines = enc_txt2.splitlines()
enc_txt3 = ' '.join([line.strip() for line in lines])
split = re.split(",", enc_txt3)
enc = [int(elem) for elem in split]
enc = map(int, split)

And this code works! But, there is a problem! When the list of numbers is
longer than 350 bytes, on the 350'th place I don't get a number, but I get
some quotes and commas and strange things. When the list is longer than
700 bytes, this problem occurs twice (actually it does not occur because
interpretor complains, but there are two mistakes of this type). Is there
a thing I'm missing, can split methods handle more than 350 bytes of
splitting text? What's actually happening.

To make it more clear (because I think you will not understand it
completely) i could upload errors, but it's large, so I'll minimize the
log.

[6964, 7086, 3211, 7522, 9472, 3265, 3610, 104, 9729, 6706, 8035, 5439,
7142, 360, 677, 1667, 1382, 9417, 4493, 8289, 9613, 3470, 889, 1021, 3381,
3480, 2483, 6579, 8928, 3240, 4437, 5908, 2290, 9587, 866, 202, 859, 2184,
8328, ..........] - the list of numbers 705 bytes long.

When I run the program (with command print split inside my code, to see
what's going on):

['6964', ' 7086', ' 3211', ' 7522', ' 9472', ' 3265', ' 3610', ' 104', '
9729', ' 6706', ' 8035', ' 5439', ' 7142', ' 360', ' 677', ' 1667', '
1382', ' 9417', ' 4493', ' 8289', ' 9613', ' 3470', ' 889', ' 1021', '
3381', ' 3480', ' 2483', ' 6579', ' 8928', ' 3240', ' 4437', ' 5908', '
2290', ' 9587', ' 866', ' 202', ' 859', ' 2184', ' 8328', ..... " 6730'",
" '", ' 6793'...... , " '", " '6573", ' 869'...]

File "OTPAenc_dec.py", line 258, in decr
enc = [int(elem) for elem in split]
ValueError: invalid literal for int(): 6730'

Please help me, any help will be appreciated.

Thanks in advance.

Sorry for my bad English and my bad expression style, I really don't know
how to explain it more throughly.

Diez B. Roggisch · Sep 11, 2004

File "OTPAenc_dec.py", line 258, in decr

enc = [int(elem) for elem in split]
ValueError: invalid literal for int(): 6730'

The problem is the trailing ' in your number - that of course can't be
converted. And I see that the number 6573 has similar problems - it has a
leading '.

So your splitting code does not work, or your data is malformed - without
more information, I can't say anything about that, but it seems to me the
latter is the case.

Luka Milkovic · Sep 11, 2004

The problem is the trailing ' in your number - that of course can't be
converted. And I see that the number 6573 has similar problems - it has a
leading '.

Yes, I know that, but I don't understand why it works normally for lists
under 350 bytes? It works perfectly...

So your splitting code does not work, or your data is malformed -
without more information, I can't say anything about that, but it seems
to me the latter is the case.

Data is actually not malformed, because before splitting it looks normal
(I mean, no ' or double quotes or other strange characters). The splitting
code is the problem, and I don't know how to fix it. I mean, if it would
be wrong, the smaller lists wouldn't work either, but it seems the
problems occur with big lists.

Diez B. Roggisch · Sep 11, 2004

Yes, I know that, but I don't understand why it works normally for lists

under 350 bytes? It works perfectly...

That certainly has _nothing_ to do with the size of 350 - this snippet works
perfect:

len(",".join([str(i) for i in xrange(20000)]).split(','))

Data is actually not malformed, because before splitting it looks normal
(I mean, no ' or double quotes or other strange characters). The splitting
code is the problem, and I don't know how to fix it. I mean, if it would
be wrong, the smaller lists wouldn't work either, but it seems the
problems occur with big lists.

As I proved above, it has nothing to do with that. Unless you provide actual
data I can't say more. I can only guess that 350 bytes has something to do
with line-boundaries or similar stuff - you hit some sort of special case
you didn't thing of or such a thing.

Do post the data, and I'm sure things will be sorted out soon.

Peter Otten · Sep 11, 2004

Luka said:
temp = [mail.read()]
enc_txt = "\n".join(temp)

These two lines an be simplified to

enc_text = mail.read()

begin = enc_txt.find(", '[")+len(", '[")
ending = enc_txt.find("]', ")

A guess: you are not checking whether the ending "]', " is really found. It
may well be that it is not and therefore ending set to -1 -- which means
end_txt2 will contain all characters from 'begin' until the end of the
string excluding only the last character.

enc_txt2 = (enc_txt[begin:ending])
mail.close()

next two lines are superfluous:

lines = enc_txt2.splitlines()
enc_txt3 = ' '.join([line.strip() for line in lines])

split = re.split(",", enc_txt3)

better IMHO:

split = enc_txt2.split(",")

you need only one of the following lines:

enc = [int(elem) for elem in split]
enc = map(int, split)

If my guess doesn't pinpoint the problem, I suggest that you post the actual
code and data that reproduces the error.

Peter

Luka Milkovic · Sep 12, 2004

That certainly has _nothing_ to do with the size of 350 - this snippet
works perfect:

len(",".join([str(i) for i in xrange(20000)]).split(','))

The thing you said here led me to a tiny clue. The splitting code is
actually not the source of the problem: i made a list looking just like
ordinary e-mail many bytes long and it did it's job great.
The clue I was thinking about is e-mail format. Is there a possibility
that mails get specially formated after a particular size? Because, my
splitting code works great, but the source of the problem is obviously
e-mail, I checked temporary file into which e-mail is saved _before_
spliting and it too contains this strange quotes.

My mailing code looks like this:

def prompt(prompt):
return raw_input(prompt).strip()
posiljatelj = prompt("From: ")
primatelj = prompt("To: ").split()
host = prompt("SMTP host: ")
subject = "OTP"
msgin = priv
x = "\n---Code block---\n"
msg = ("From: %s\r\nTo: %s\r\nSubject: %s\r\n\r\n"
% (posiljatelj, string.join(primatelj, ", "), subject))
msg = msg + x + msgin + x
print "Vas e-mail: ", msg
print ""
print "Duljina vaseg maila je " + `len(msg)` + " bajtova"
server = smtplib.SMTP(host)
server.set_debuglevel(1)
server.sendmail(posiljatelj, primatelj, msg)
server.quit()

I don't think there is problem in my actual SMTP code, it's more likely
that the problem lies inside the protocol itself or some special formating
style, I don't know.

As I proved above, it has nothing to do with that. Unless you provide
actual data I can't say more. I can only guess that 350 bytes has
something to do with line-boundaries or similar stuff - you hit some
sort of special case you didn't thing of or such a thing.

Do post the data, and I'm sure things will be sorted out soon.

What should I actually post?

Thank you very much Diez, I appreciate your help.

Diez B. Roggisch · Sep 12, 2004

What should I actually post?

The email text. Whatever the reason for the unexpected behaviour is, its in
there.

Thank you very much Diez, I appreciate your help.

Your welcome.

Luka Milkovic · Sep 12, 2004

The email text. Whatever the reason for the unexpected behaviour is, its in
there.

('+OK 4815 octets', ['Received: from galileo.resean
([email protected] [193.198.128.149])', '\tby
jagor.srce.hr (8.12.10/8.12.10) with ESMTP id i8BFvSRt009065', '\tfor
<[email protected]>; Sat, 11 Sep 2004 17:57:28 +0200 (CEST)',
'Date: Sat, 11 Sep 2004 17:57:28 +0200 (CEST)', 'Message-Id:
<[email protected]>', 'From:
(e-mail address removed)', 'To: (e-mail address removed)',
'Subject: OTP', 'X-Spam-Score: 5.544 (*****)
DATE_MISSING,MSGID_FROM_MTA_SHORT,NO_REAL_NAME', 'X-Scanned-By: MIMEDefang
2.42', 'X-Virus-Scanned: by amavisd-new at jagor.srce.hr',
'Content-Length: 4210', 'Status: ', '', '', '---Code block---', '[6964,
7086, 3211, 7522, 9472, 3265, 3610, 104, 9729, 6706, 8035, 5439, 7142,
360, 677, 1667, 1382, 9417, 4493, 8289, 9613, 3470, 889, 1021, 3381, 3480,
1385, 2027, 956, 9317, 6567, 5552, 1114, 3311, 4437, 631, 5881, 2101,
9948, 4529, 3088, 5548, 3728, 8727, 7787, 5754, 8315, 8250, 8308, 510,
8183, 4052, 9046, 8217, 5107, 8333, 7799, 4589, 209, 7465, 1010, 4459,
5984, 8272, 5311, 4458, 3565, 5747, 8460, 9845, 9305, 1662, 2650, 5290,
9725, 5743, 6679, 9896, 4776, 8586, 3075, 8824, 9369, 6957, 8564, 7165,
112, 9940, 6291, 1489, 3561, 1218, 3890, 9970, 9973, 7624, 7721, 8620,
456, 872, 4546, 926, 2687, 8884, 8598, 7544, 6857, 5363, 6686, 8579, 7937,
8290, 3578, 5411, 6375, 5596, 6860, 8392, 5300, 5927, 8211, 2232, 2194,
1388, 9047, 5384, 876, 4773, 7331, 3238, 5699, 7498, 2789, 8344, 5198,
1732, 3330, 6832, 908, 4210, 8943, 2390, 1655, 5324, 993, 6281, 2909,
2178, 9929, 40, 5060, 964, 4752, 8570, 7714, 607, 6450, 5793, 9292, 6428,
5410, 7567, 6040, 543, 3602, 8022, 4052, 7222, 6324, 6729, 1030, 299,
8641, 4312, 8614, 423, 6730', ', 6793, 3453, 9470, 9382, 2037, 4103,
6427, 5312, 1366, 6287, 2316, 5745, 6916, 1640, 2381, 7510, 1156, 1538,
3015, 1592, 4136, 2170, 6263, 3829, 6869, 8079, 9724, 1830, 3245, 4694,
782, 9703, 3615, 2907, 4435, 7329, 7511, 5418, 2913, 1567, 7865, 3729,
8289, 373, 5635, 8292, 9569, 4370, 8728, 3082, 7829, 4797, 9632, 8283,
2741, 7887, 6366, 9821, 1604, 1099, 3256, 2722, 8474, 6261, 8582, 6431,
1762, 8615, 9745, 599, 4078, 4779, 1469, 90, 5432, 5475, 9098, 5614, 184,
9515, 8909, 3868, 4880, 2408, 9665, 8552, 5444, 9209, 993, 9008, 1495,
1885, 3871, 4774, 8698, 5212, 1303, 6629, 6011, 4490, 9329, 1062, 4558,
4338, 2279, 8502, 473, 9650, 5787, 8329, 6816, 6858, 3868, 1854, 2991,
9958, 8931, 9276, 7837, 9372, 6732, 2402, 5453, 6012, 2958, 2593, 2258,
6599, 2127, 2214, 5839, 3947, 5270, 10093, 8043, 2905, 686, 6451, 312,
1682, 1947, 3447, 4083, 6838, 7896, 3054, 9913, 6716, 3831, 1861, 7286,
6863, 7754, 5534, 8451, 9536, 7945, 9747, 7075, 3808, 6180, 5387, 930,
9663, 7337, 3513, 9535, 4329, 6056, 2114, 8972, 8336, 9743, 5397, 3112,
8023, 3392, 1488, 1707, 8223, 9982, 4498, 1840, 962, 2471, 7919, 2731,
7935, 2826, 6904, 4150, 8780, 9697, 5955, 412, 1816, 7017, 5219, 1290,
7106, 6747, 1180, 1230, 2564, 1568, 373, 9301, 59, 9632, 4667, 7701, 9141,
6240, 3290, 7172, 4006, 8018, 5744, 1125, 4388, 7109, 7357, 5188, 841,
7950, 666, 6754, 4894, 7222, 9275, 7291, 3038, 6510, 8543, 7400, 2218,
2671, 1, 1753, 5620, 4833, 2920, 3754, 9364, 9724, 3445, 6378, 1986, 9350,
4887, 633, 6400, 4586, 1541, 5883, 2696, 306, 5971, 8164, 748, 2464, 550,
9843, 9373, 5004, 4295, 1055, 6916, 6386, 8480, 4480, 8744, 2586, ',
'6573, 869, 9277, 6960, 4871, 9340, 6119, 4271, 7572, 1230, 1213, 5534]',
'---Code block---'], 4815)

This is the original mail, sorry because of the size. As you can see,
there are two problematic spots: 6730', ', and ','6573, at the end of the
mail.

I was wandering is there any way to modify my splitting code I already
posted? The thing I want to implement is that the code would parse e-mail
as usual and when it comes to these problematic spots, it removes
unnecessary quotes and continues parsing...

Is there anything that could be done?

Thank you once more

Luka

Diez B. Roggisch · Sep 12, 2004

I was wandering is there any way to modify my splitting code I already

posted? The thing I want to implement is that the code would parse e-mail
as usual and when it comes to these problematic spots, it removes
unnecessary quotes and continues parsing...

Is there anything that could be done?

Well, you could certainly code around these special cases - however, it
seems to me that whatever generates this mail is malfunctioning. Not on the
transport-layer, but from the thing that produces this

---Code block---

thingy.

What is that actually for? It looks as if you try to reinvent the wheel and
produce your own encoding scheme for binary data - instead of doing this, I
suggest you use one of the several available standards, like uuencode or
others. These are covered by standard apis in python as well as in other
languages. Better go for them.

Pierre Fortin · Sep 12, 2004

On Sun, 12 Sep 2004 17:32:15 +0200 Luka wrote:

This msg has already been processed by something that appears to generate
list/tuple segments... I would suspect that whatever modified the message
has a string size limitation... However, it looks like whatever
manhandled this msg just did what looks like a python print of a tuple...
If you really want to process this type of message instead of getting at
the real problem, then here's a clue...

Here, I reduced the contents to just the items...

('+OK',
['Received', # brackets, braces, parens are just text herein
'by',
'for',
'Date',
'Message-Id',
'From',
'To',
'Subject',
'X-Scanned-By: MIMEDefang 2.42',
'X-Virus-Scanned',
'Content-Length: 4210',
'Status: ',
'',
'',
'---Code block---',
'[6964, 7086, ..., 6730', # "[" is just text here
', 6793, ..., 5534]', # "]" ditto
'---Code block---'
],
4815
)

Further reducing the items shows the structure:

('s',
['s', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', 's', # headers
'', # header/body separator
'',
's', # ---Code block---
's', # 1st part
's', # 2nd part
's' # ---Code block---
],
4815 # reported msg size
)

which boils down to:

(s,[s, ..., s],i) # aka: tuple(string,list(strings),int)

So... looks like you just need to isolate the strings between the
"---Code block---" strings (could be more than 2 or just 1) and
concatenate them. splitting the result...

Straight-line brute forcing it:

msg = .... # get the message as a tuple
sep = "---Code block---"
start = msg[1].index(sep)
data = msg[1][start+1:]
end = data.index(sep)
data = data[:end]
print "".join(data)[1:-1].split(", ")

This is the original mail, sorry because of the size. As you can see,
there are two problematic spots: 6730', ', and ','6573, at the end of
the mail.

Pierre

Luka Milkovic · Sep 12, 2004

Well, you could certainly code around these special cases - however, it
seems to me that whatever generates this mail is malfunctioning. Not on the
transport-layer, but from the thing that produces this

---Code block---

thingy.

The thing that produces this Code block thingy is a /dev/random output.
I'm actually building a primitive One-time pad

Here is the code (not
translated, but you'll understand I think):

duljina_grupe = 4
dev_random = open("/dev/random")
P = select.poll()
P.register(dev_random.fileno(),select.POLLIN)
grupe = []

while len(grupe) < gener:
grupa = ""
while len(grupa) < duljina_grupe:
if P.poll(0.1):
datum = ord(dev_random.read(1))
if datum < 200:
grupa+="%2.2d"%(datum%100)
else:
print "Nedovoljna kolicina entropije u /dev/random! Pomaknite misa"
print "Za sada generirano %d grupa"%len(grupe)
grupe.append(int(grupa))

And overlapping it with text transfered in ASCII format:

grupe = []
for char in otv:
grupe.append(ord(char))
f2 = open("OTpad.pad","r")
prvi = pickle.load(f2)
enc_txt = map(lambda x,y: (x or 0) ^ (y or 0),prvi,grupe)

This enc_txt is the thingy which is mailed... Do you see any mistakes?

What is that actually for? It looks as if you try to reinvent the wheel
and produce your own encoding scheme for binary data - instead of doing
this, I suggest you use one of the several available standards, like
uuencode or others. These are covered by standard apis in python as well
as in other languages. Better go for them.

Straight-line brute forcing it:

msg = .... # get the message as a tuple
sep = "---Code block---"
start = msg[1].index(sep)
data = msg[1][start+1:]
end = data.index(sep)
data = data[:end]
print "".join(data)[1:-1].split(", ")

Thanks Pierre, I'll try something like that later.

Luka

Diez B. Roggisch · Sep 13, 2004

This enc_txt is the thingy which is mailed... Do you see any mistakes?

Nope - but the result of that operation should be a stream of ints stored in
a list. Thats ok - but how do you actually _create_ the mail text - do you
use the repr() of the list? That certainly is not a good idea, as that
representation is made for human readability.

Better to use e.g. module struct to create a string out of it, and uuencode
that.

The more interesting code is what you do with your list to actually create
the mail.

Jeff Epler · Sep 13, 2004

You cannot send data with arbitrary-length lines over SMTP without using
an encoding such as quoted-printable or base64.

RFC2821
section 2.3.7 "Limits MAY be imposed on line lengths by servers"
section 4.5.3:
The maximum total length of a text line including the <CRLF>
is 1000 characters (not counting the leading dot duplicated
for transparency). This number may be increased by the use
of SMTP Service Extensions.
(The relevant SMTP Service Extension in this case being RFC1652, 8BITMIME)

While this is not the limit you seem to be running into (you said
problems happened at 350 bytes) you should be aware of this (and lots of
other details about SMTP and other mail protocols) before you write
something to run "on top of" SMTP.

If you want to send a Python data structure across the network, you
might use pickle and then the email module to create a properly
MIME-encoded message:
import pickle, email.Message
bytes = pickle.dumps(my_structure)
message = email.Message.Message()
message.set_type("application/x-luka-milkovic")
message.set_payload(bytes.encode("base64"))
# set other headers as needed
message = str(message)

If your structure is actually a sequence of fixed-width integers, then
you might be happy using struct.pack() instead of pickle:
import struct
l = len(my_structure)
bytes = struct.pack("!" + "H"*l, *my_structure)
"H" is for numbers in range(65536), "!" uses network byte-order

In either case, you perform the reverse steps on the reassembled message
on the other end:
import email.Parser
decoded_message = email.Parser.Parser().parsestr(message)
decoded_message.get_type() # must be application/x-luka-milkovic
# or this is not a message from your
# program
bytes = decoded_message.get_payload().decode("base64")
# now pickle.loads or struct.unpack the bytes

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFBRZNuJd01MZaTXX0RApJ2AJ9z7eBihqJNZfudVEE3LLx2ujPgBgCgkogk
SXZ4DMoU6CILzlym0gyt8mo=
=rdSM
-----END PGP SIGNATURE-----

Luka Milkovic · Sep 13, 2004

In either case, you perform the reverse steps on the reassembled message
on the other end:
import email.Parser
decoded_message = email.Parser.Parser().parsestr(message)
decoded_message.get_type() # must be application/x-luka-milkovic
# or this is not a message from your
# program
bytes = decoded_message.get_payload().decode("base64")
# now pickle.loads or struct.unpack the bytes

Jeff

Hi Jeff, thanks for the info about SMTP protocol, I knew I was doing
something wrong, and now I know what exactly went wrong. I was thinking
about encoding, but at the time of the development of the sending part, I
decided to postpone it, and that seems to be a big mistake.

But, you mentioned MIME types and the methods of creating MIME mails. I
was always afraid of MIME because I don't understand it well (though I've
read documentation). I did what you told me to do about sending e-mail,
and it works fine. But when it comes to decoding the mail, I have some
problems. I connect to my POP3 server and download my MIME e-mail (using
command pop.retr()) and save it to a file tempMail.dat. I don't know what
to do next, I tried with the parser but it doesn't work. My code looks
like the one you've given above.

print bytes command returns nothing, a blank line... I don't know what to
do, please help.

Thank you very much.

Luka Milkovic · Sep 13, 2004

Thanks everybody, I finally got things working. I bypassed MIME because,
as I already said, I don't understand it well, and used base64 module and
encoding... Everything is working as it should, thank you very much

Speed up creation of combo box options	2	Mar 10, 2006
Help with Dll	0	Mar 28, 2006
OK ... AGE FUNCTION TEST RESULTS ...	1	Jan 11, 2004
How can I make a better program from the following one	1	Jun 14, 2008

Mail extraction problem (something's wrong with split methods)

Luka Milkovic

Diez B. Roggisch

Luka Milkovic

Diez B. Roggisch

Peter Otten

Luka Milkovic

Diez B. Roggisch

Luka Milkovic

Diez B. Roggisch

Pierre Fortin

Luka Milkovic

Diez B. Roggisch

Jeff Epler

Luka Milkovic

Luka Milkovic

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads