returning regex matches as lists

J

Jonathan Lukens

I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

Thank you,
Jonathan
 
J

John Machin

I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

It would help if you explained what you want the contents of the list
to be, why you want a list as opposed to a tuple or a generator or
whatever ... we can't be expected to imagine why getting groups as
tuples is "sort of a problem".

Use a concrete example, e.g.
import re
regex = re.compile(r'(\w+)\s+(\d+)')
text = 'python 1 junk xyzzy 42 java 666'
r = regex.findall(text)
r [('python', '1'), ('xyzzy', '42'), ('java', '666')]

What would you like to see instead?
 
J

Jonathan Lukens

What would you like to see instead?

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:
import re
corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
terms = corporate_names.findall(sourcetext)

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:
[u'string one', u'string two', u'string three']

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:
detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

I appreciate the help.

Jonathan
 
G

Gabriel Genellina

En Fri, 15 Feb 2008 17:07:21 -0200, Jonathan Lukens
I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

Do you want something like this?

py> re.findall(r"([a-z]+)([0-9]+)", "foo bar3 w000 no abc123")
[('bar', '3'), ('w', '000'), ('abc', '123')]
py> re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> groups = re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
py> groups
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> [group[0] for group in groups]
['bar3', 'w000', 'abc123']
 
G

Gabriel Genellina

En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
What would you like to see instead?

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:
import re
corporate_names =
re.compile(u'(?u)\\b([Ð-Я]{2,}\\s+)([<<"][а-ÑÐ-Я]+)(\\s*-?[а-ÑÐ-Я]+)*([>>"])')
terms = corporate_names.findall(sourcetext)

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:
[u'string one', u'string two', u'string three']

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

The group() method of match objects does what you want:

terms = [match.group() for match in corporate_names.finditer(sourcetext)]

See http://docs.python.org/lib/match-objects.html
detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

That ''.join(...) works equally well on tuples; you don't have to convert
tuples to lists first:

delisted_terms = [''.join(term_list) for term in terms]
 
J

Jonathan Lukens

En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
<[email protected]> escribi¨®:


I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:
import re
corporate_names =
re.compile(u'(?u)\\b([§¡-§Á]{2,}\\s+)([<<"][§Ñ-§ñ§¡-§Á]+)(\\s*-?[§Ñ-§ñ§¡-§Á]+)*([>>"])')
terms = corporate_names.findall(sourcetext)
Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:
[u'string one', u'string two', u'string three']
...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

The group() method of match objects does what you want:

terms = [match.group() for match in corporate_names.finditer(sourcetext)]

Seehttp://docs.python.org/lib/match-objects.html
detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]
which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

That ''.join(...) works equally well on tuples; you don't have to convert
tuples to lists first:

delisted_terms = [''.join(term_list) for term in terms]

Thanks Gabriel,

That is just what I was looking for.

Jonathan
 
J

John Machin

What would you like to see instead?

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:
import re
corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
terms = corporate_names.findall(sourcetext)

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:

[u'string one', u'string two', u'string three']

What is the point of having parenthesised groups in the regex if you
are interested only in the whole match?

Other comments:
(1) raw string for improved legibility
ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'
(2) consider not including space at the end of a group
ru'(?u)\b([á-ñ]{2,})\s+([<<"][Á-Ñá-ñ]+)\s*(-?[Á-Ñá-ñ]+)*([>>"])'
(3) what appears between [] is a set of characters, so [<<"] is the
same as [<"] and probably isn't doing what you expect; have you tested
this regex for correctness?
...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:
detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

I can't imagine how "not a programmer" implies "interested to know if
there is a more elegant way". In any case, explore the correctness
axis first.

Cheers,
John
 
J

Jonathan Lukens

John,
(1) raw string for improved legibility
ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'

This actually escaped my notice after I had posted -- the letters with
diacritics are incorrectly decoded Cyrillic letters -- I suppose I
code use the Unicode escape sequences (the sets [á-ñ] and [Á-Ñá-ñ] are
the Cyrillic equivalents of [a-z] and [A-Za-z]) but then suddenly the
legibility goes out the window again.
(3) what appears between [] is a set of characters, so [<<"] is the
same as [<"] and probably isn't doing what you expect; have you tested
this regex for correctness?

These were angled quotation marks in the original Unicode. Sorry
again. The regex matches everything it is supposed to. The extra
parentheses were because I had somehow missed the .group method and it
had only been returning what was only in the one needed set of
parentheses.
I can't imagine how "not a programmer" implies "interested to know if
there is a more elegant way".

More carefully stated: "I am self-taught have no real training or
experience as a programmer and would be interested in seeing how a
programmer with training
and experience would go about this."

Thank you,
Jonathan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top