Regular expressions, help?

S

Sania

Hi,
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

Any help would be appreciated,
Thank you,
Sania
 
J

Jussi Piitulainen

Sania said:
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.
 
S

Sania

Sania said:
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.
 text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
      dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
      deadnum=dead.group(1)
      deaths.append(deadnum)
      print deaths

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.

Hey Thanks,
So now my regex is

dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)

But I only find 7 not 657. How is it that the group is only matching
the last digit? The whole thing is parenthesis not just the last
part. ?
 
A

azrazer

Le 19/04/2012 14:02, Sania a écrit :
On Apr 19, 2:48 am, Jussi Piitulainen<[email protected]> [...]
text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.

Hey Thanks,
So now my regex is

dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)
Hi,
But there, your regex matches :
<something>death toll<anything which length is <=20> followed by what
you capture (which is made up of a digit, at least)
there are at least two issues here :
- the number of characters between death toll and the figure may be > 20
- your {0,20} is greedy => .{0,20} matches as many as "." as it can
AND one digit is matched by (\d[,\d\.]*), since your group captures a
digit followed(OR NOT) by a digit, a comma, a dot
=====> so " at 63" is sucked by .{0,20} and (\d[,\d\.]*) matches
the remaining digit "7"

a solution would be to follow what Jussi suggested...
=> dead=re.match(r".*death toll\D*(\d*)", text)
But I only find 7 not 657. How is it that the group is only matching
the last digit? => .{,20} greed
The whole thing is parenthesis not just the last part. ?
yeah but only one digit remains when your group matches...

Good luck understanding regexes, it's a powerful tool ! :)

best,
azra.
 
J

Jussi Piitulainen

Sania said:
Sania said:
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.
 text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
      dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
      deadnum=dead.group(1)
      deaths.append(deadnum)
      print deaths

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.

Hey Thanks,
So now my regex is

dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)

But I only find 7 not 657. How is it that the group is only matching
the last digit? The whole thing is parenthesis not just the last
part. ?

It's still consuming the digits among the text that comes _before_ the
parenthesised group: the .{0,20} matches as _much_ as it _can_ without
making the whole regex fail, and the . in it matches also digits.

Try \D{0,20} to limit its matching ability to non-digits.

Try \.{0,20}? to limit to it to matching as _little_ as it can.

(The variant of * I referred to is *?; {} and {}? are similar.)

The simplicity of regexen is deceptive. Be careful. Be surprised.
<http://docs.python.org/library/re.html>. Keep them simple. Consider
also other means instead or in addition.
 
J

Jon Clements

Hi,
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

Any help would be appreciated,
Thank you,
Sania

Or just don't fully rely on a regex. I would, for time, and the little sanity I believe I have left, would just do something like:

death_toll = re.search(r'death toll.*\d+', text).group().rsplit(' ', 1)[1]

hth,

Jon.
 
S

Sania

Hi,
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.
 text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
      dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
      deadnum=dead.group(1)
      deaths.append(deadnum)
      print deaths
Any help would be appreciated,
Thank you,
Sania

Or just don't fully rely on a regex. I would, for time, and the little sanity I believe I have left, would just do something like:

death_toll = re.search(r'death toll.*\d+', text).group().rsplit(' ', 1)[1]

hth,

Jon.

Thank you all so much!

I ended up using Jussi's advice..... \D{0,20}
Azrazer what you suggested works but I need to make sure that it
catches numbers like 6,370 as well as 637. And I tried tweaking the
regex around from the one you said in your reply but It didn't work
(probably would have if I was more adept). But thanks!

Jon- I kind of see what you are doing. In the regex you say that after
death toll there can be 0 or more characters followed by 1 or more
digits (although I would need to add a comma within digit so it
catches 6,370). I can also see that you are splitting each string but
I don't understand the 1 in rsplit(' ', 1)[1]. I am not really
familiar with the syntax I guess.

Thanks again!
 
A

Andy

If you plan on doing more work with regular expressions in the future and you have access to a Windows machine you may want to consider picking up a copy of RegxBuddy. I don't have any affiliation with the makers but I have been using the software for a few years and it has saved me a lot of frustration.

Thanks,
-Andy-
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,742
Latest member
AshliMayer

Latest Threads

Top