Extracting patterns after matching a regex

Martin · Sep 8, 2009

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

MRAB · Sep 8, 2009

Martin said:
Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

m = re.search(r"FTPHOST: (.*)", s)
print m.group(1)

pdpi · Sep 8, 2009

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

What you're doing is telling python "look for all matches of
'FTPHOST'". That doesn't really help you much, because you pretty much
expect FTPHOST to be there anyway, so finding it means squat. What you
_really_ want to tell it is "Look for things shaped like 'FTPHOST:
<ftpaddress>', and tell me what <ftpaddress> actually is". Look here:
http://docs.python.org/howto/regex.html#grouping. That'll explain how
to accomplish what you're trying to do.

Andreas Tawn · Sep 8, 2009

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

No need for regex.

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]

Cheers,

Drea

Mark Tolonen · Sep 8, 2009

Martin said:
Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

In regular expressions, you match the entire string you are interested in,
and parenthesize the parts that you want to parse out of that string. The
group() method is used to get the whole string with group(0), and each of
the parenthesized parts with group(n). An example:
'e4ftl01u.ecs.nasa.gov'

-Mark

Mart. · Sep 8, 2009

m = re.search(r"FTPHOST: (.*)", s)
print m.group(1)

so the .* means to match everything after the regex? That doesn't help
in this case as the string is placed amongst others for example.

MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',

Mart. · Sep 8, 2009

In regular expressions, you match the entire string you are interested in,
and parenthesize the parts that you want to parse out of that string. The
group() method is used to get the whole string with group(0), and each of
the parenthesized parts with group(n). An example:

'FTPHOST: e4ftl01u.ecs.nasa.gov'>>> re.search(r'FTPHOST: (.*)',s).group(1)

'e4ftl01u.ecs.nasa.gov'

-Mark

I see what you mean regarding the groups. Because my string is nested
in amongst others e.g.

MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',

I get the information that follows as well. So is the only way to then
parse the new string? I am trying to construct something that is
fairly robust, so not sure just printing before the \r is the best
solution.

Thanks

Terry Reedy · Sep 8, 2009

Whether or not you need re is an issue to be determined.

Just split the string on ': ' and take the second part.
Or find the position of the space and slice the remainder.

so the .* means to match everything after the regex? That doesn't help
in this case

It helps in the case you presented.

> as the string is placed amongst others for example.

MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',

What you show above is a tuple of strings. Scan the members looking for
s.startswith('FTPHOST:') and apply previous answer.
Or if above is actually meant to be one string (with quotes omitted),
split in ',' and apply previous answer.

tjr

Mart. · Sep 8, 2009

Hi,

Click to expand...

I need to extract a string after a matching a regular expression. For
example I have the string...

Click to expand...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

Click to expand...

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

Click to expand...

m = re.findall(r"FTPHOST", s)

Click to expand...

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Click to expand...

Thanks in advance for the help,

Click to expand...

Martin

Click to expand...

No need for regex.

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]

Cheers,

Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

etc. So I need to find a way to stop it before the \r

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Many thanks

Andreas Tawn · Sep 8, 2009

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

Click to expand...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

Click to expand...

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

Click to expand...

m = re.findall(r"FTPHOST", s)

Click to expand...

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Click to expand...

Thanks in advance for the help,

Click to expand...

Martin

Click to expand...

No need for regex.

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]

Cheers,

Drea

Click to expand...

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

etc. So I need to find a way to stop it before the \r

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.

nn · Sep 8, 2009

No need for regex.

Click to expand...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]

Cheers,

Click to expand...

Drea

Click to expand...

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

etc. So I need to find a way to stop it before the \r

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Many thanks

It is not clear from your post what the input is really like. But just
guessing this might work:
'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'
'e4ftl01u.ecs.nasa.gov'

Mart. · Sep 8, 2009

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea

Click to expand...

Click to expand...

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

Click to expand...

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

Click to expand...

etc. So I need to find a way to stop it before the \r

Click to expand...

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Click to expand...

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...

FINISHED: 09/07/2009 08:42:31

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

I have been doing this to turn the email into a string

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',

So not sure splitting it like you suggested works in this case.

Thanks

Mart. · Sep 8, 2009

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea

Click to expand...

Click to expand...

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

Click to expand...

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

Click to expand...

etc. So I need to find a way to stop it before the \r

Click to expand...

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Click to expand...

Many thanks

Click to expand...

It is not clear from your post what the input is really like. But just
guessing this might work:

'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'

'e4ftl01u.ecs.nasa.gov'

Hi,

That does work. So the \ escapes the \r, does this tell it to stop
when it reaches the "\r"?

Thanks

pdpi · Sep 8, 2009

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea

Click to expand...

Click to expand...

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

Click to expand...

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

Click to expand...

etc. So I need to find a way to stop it before the \r

Click to expand...

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Click to expand...

Many thanks

Click to expand...

It is not clear from your post what the input is really like. But just
guessing this might work:

'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'

'e4ftl01u.ecs.nasa.gov'

Except, I'm assuming, the OP's getting the data from a (windows-
formatted) file, so \r\n shouldn't be escaped in the regex:

MRAB · Sep 8, 2009

Mart. said:
Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Click to expand...

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.

Click to expand...

It is an email which contains information before and after the main
section I am interested in, namely...

FINISHED: 09/07/2009 08:42:31

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

I have been doing this to turn the email into a string

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

To me that seems a strange thing to do. You could just read the entire
file as a string:

f = open(email, 'r')
s = f.read()

Mart. · Sep 8, 2009

Mart. said:
Mart. said:

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.

Click to expand...

Click to expand...

It is an email which contains information before and after the main
section I am interested in, namely...

Click to expand...

FINISHED: 09/07/2009 08:42:31

Click to expand...

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

Click to expand...

I have been doing this to turn the email into a string

Click to expand...

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

Click to expand...

To me that seems a strange thing to do. You could just read the entire
file as a string:

f = open(email, 'r')
s = f.read()

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

Click to expand...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',

Click to expand...

So not sure splitting it like you suggested works in this case.

Click to expand...

Within the file are a list of files, e.g.

TOTAL FILES: 2
FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
FILESIZE: 11028908

FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
\....", s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

print i, ':', len(m)
file1 = m[:-4] # remove xml bit.
file2 = m

urllib.urlretrieve(url, file1)
urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Thanks.

Dave Angel · Sep 8, 2009

Mart. said:
<snip>
I have been doing this to turn the email into a string

email =ys.argv[1]
f =open(email, 'r')
s =str(f.readlines())

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
<snip>

The mistake I see is trying to turn a list into a string, just so you
can try to parse it back again. Just write a loop that iterates through
the list that readlines() returns.

DaveA

MRAB · Sep 8, 2009

Mart. said:
Mart. said:

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
It is an email which contains information before and after the main
section I am interested in, namely...
FINISHED: 09/07/2009 08:42:31
MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:
I have been doing this to turn the email into a string
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

Click to expand...

To me that seems a strange thing to do. You could just read the entire
file as a string:

f = open(email, 'r')
s = f.read()

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
So not sure splitting it like you suggested works in this case.

Click to expand...

Click to expand...

Within the file are a list of files, e.g.

TOTAL FILES: 2
FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
FILESIZE: 11028908

FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
\....", s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

print i, ':', len(m)
file1 = m[:-4] # remove xml bit.
file2 = m

urllib.urlretrieve(url, file1)
urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:
['TOTAL FILES: 2\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
11028908\n', '\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
18975\n']

Using 'str' on that list then converts it to s string _representation_
of that list:
"['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
11028908\\n', '\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE:
18975\\n']"

That just parsing a lot more difficult.

It's much easier to just read the entire file as a single string and
then parse that:
'TOTAL FILES: 2\n\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE:
11028908\n\n\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'['MOD13A2.A2007033.h17v08.005.2007101023605.hdf',
'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']

nn · Sep 8, 2009

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa..gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
Many thanks

Click to expand...

Click to expand...

It is not clear from your post what the input is really like. But just
guessing this might work:

Click to expand...

'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'

Click to expand...

'e4ftl01u.ecs.nasa.gov'

Click to expand...

Except, I'm assuming, the OP's getting the data from a (windows-
formatted) file, so \r\n shouldn't be escaped in the regex:

I am just playing the guessing game like everybody else here. Since
the OP didn't use re.DOTALL and was getting more than one line for .*
I assumed that the \n was quite literally '\' and 'n'.

nn · Sep 8, 2009

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa..gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
Many thanks

Click to expand...

Click to expand...

It is not clear from your post what the input is really like. But just
guessing this might work:

Click to expand...

'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'

Click to expand...

'e4ftl01u.ecs.nasa.gov'

Click to expand...

Hi,

That does work. So the \ escapes the \r, does this tell it to stop
when it reaches the "\r"?

Thanks

Indeed.

matching patterns after regex?	8	Aug 12, 2009
SQL Connection string regex pattern to parse sections	1	May 9, 2024
Regex not matching a string	2	Jan 9, 2013
help with regex matching multiple %e	0	Mar 3, 2011
matching against a zillion patterns	17	Oct 15, 2009
Regex Matching on Readline()	3	Dec 20, 2007
Creating a regex to get multiple values and print	0	Jan 10, 2021
My regex kung-fu is not strong =(	0	Apr 4, 2020

Extracting patterns after matching a regex

Martin

MRAB

pdpi

Andreas Tawn

Mark Tolonen

Mart.

Mart.

Terry Reedy

Mart.

Andreas Tawn

nn

Mart.

Mart.

pdpi

MRAB

Mart.

Dave Angel

MRAB

nn

nn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads