Extracting patterns after matching a regex



Mart. said:
I have been doing this to turn the email into a string
email =ys.argv[1]
f =open(email, 'r')
s =str(f.readlines())
so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',

The mistake I see is trying to turn a list into a string, just so you
can try to parse it back again.  Just write a loop that iterates through
the list that readlines() returns.


No kidding.

Instead of this:
s = str(f.readlines())

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

I would have possibly done something like this (not tested):
lines = f.readlines()
for row in lines:
key,sep,value = row.partition(':')[2].rstrip()
url = 'ftp://' + header['ftphost'] + header['ftpdir']


Thus far I have
#!/usr/bin/env python
import sys
import re
import urllib
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
\....", s)
ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir
for i in xrange(len(m)):
   print i, ':', len(m)
   file1 = m[:-4]               # remove xml bit..
   file2 = m

   urllib.urlretrieve(url, file1)
   urllib.urlretrieve(url, file2)
which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:

 >>> f = open(email, 'r')
 >>> lines = f.readlines()
 >>> lines
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
11028908\n', '\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:

Using 'str' on that list then converts it to s string _representation_
of that list:

 >>> str(lines)
"['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
11028908\\n', '\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE:

That just parsing a lot more difficult.

It's much easier to just read the entire file as a single string and
then parse that:

 >>> f = open(email, 'r')
 >>> s = f.read()
 >>> s
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'
 >>> import re
 >>> re.findall(r"FILENAME: (.+)", s)

Ok I see what you mean...my mistake. So I have adjusted it as you

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = f.read()

# match the modis files...
m = re.findall(r"FILENAME: (.+)", s)

# get the ftp locations?
ftphost = re.search(r"FTPHOST: (.+)", s).group(1)
ftpdir = re.search(r"FTPDIR: (.+)", s).group(1)

This doesn't exclude the \r I seem to have aquired at the end of the
line, i.e.

In [45]: ftphost
Out[45]: 'e4ftl01u.ecs.nasa.gov\r'

and if I put the \\r back in

AttributeError: 'NoneType' object has no attribute 'group'


Excellent thanks, sorry I thought I had to escape it to access it. If
it helps anyone the script is as follows...Many thanks all for the

#!/usr/bin/env python
import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = f.read()

# match the modis files...
m = re.findall(r"FILENAME: (.+)\r", s)

# get the ftp locations?
ftphost = re.search(r"FTPHOST: (.+)\r", s).group(1)
ftpdir = re.search(r"FTPDIR: (.+)\r", s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):
print i, ':', len(m) # counter
modis_file = str(m)
urllib.urlretrieve(url, modis_file)

