regexp

  • Thread starter Jayme Assuncao Casimiro
  • Start date
J

Jayme Assuncao Casimiro

I have this piece of html text from Amazon.com

<dt><b><a
href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-0064804">1
Business, 2 Approaches : How to Succeed in Internet Business by Employing
Real-World Strategies</a></b>
~ <NOBR><font color=#990033>Usually ships in 2-3 days</font></NOBR><dd>
Ron Gielgun / Hardcover / Published 1998
<br>
Our Price: $13.97 ~ <NOBR><font color =#990033>You Save: $5.98
(30%)</font></NOBR>
<br>
<a
href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-0064804"><i>Read
more about this title...</i></a>
<p>

And I would like to use only one regexp to extract the title, the price,
and the desconunt in percent.

On the above example it would be:
title = 1 Business, 2 Approaches : How to Succeed in Internet Business byEmploying
Real-World Strategies
Price = $13.97
Descount = 30%

I have used:
($title) = $_ =~ m{<a.*?>(.*?)</a>};
($price) = $_ =~ m{.*Our Price:\s(\$?[\d\,.]+)};
($descount) = $_ =~ m{.*You Save:.*?[\d\,.]+.*?([\d\,.]+)};

But I would like to use only one regexp.

Thanks
+---------------------------------------------+
| Jayme Assuncao Casimiro |
| Graduado em Ciência da Computação |
| Estudante de Mestrado em Computação |
| Universidade Federal de Minas Gerais - UFMG |
+---------------------------------------------+
 
G

Gunnar Hjalmarsson

Jayme said:
I have used:
($title) = $_ =~ m{<a.*?>(.*?)</a>};
($price) = $_ =~ m{.*Our Price:\s(\$?[\d\,.]+)};
($descount) = $_ =~ m{.*You Save:.*?[\d\,.]+.*?([\d\,.]+)};

But I would like to use only one regexp.

So, what stops you?

($title, $price, $discount) = m{...};
------------------------------------^^^
(to be filles with the regex)
 
D

David K. Wall

Jayme Assuncao Casimiro said:
I have this piece of html text from Amazon.com
[snip HTML]

And I would like to use only one regexp to extract the title, the price,
and the desconunt in percent.

Don't do that. Use one of the modules designed for parsing HTML. Using REs
to parse HTML is painful and produces easily-broken code.
 
G

Gunnar Hjalmarsson

David said:
Jayme Assuncao Casimiro said:
I have this piece of html text from Amazon.com

[snip HTML]

And I would like to use only one regexp to extract the title, the
price, and the desconunt in percent.

Don't do that. Use one of the modules designed for parsing HTML.
Using REs to parse HTML is painful and produces easily-broken code.

For extracting the first link and two other parts that are not
identified by help of HTML markup? Please, David, there are more
colours in this world than black and white. ;-)

perlfaq9 is less rigid:

http://www.perldoc.com/perl5.8.0/pod/perlfaq9.html#How-do-I-remove-HTML-from-a-string-

http://www.perldoc.com/perl5.8.0/pod/perlfaq9.html#How-do-I-extract-URLs-
 
D

David K. Wall

Gunnar Hjalmarsson said:
David said:
Jayme Assuncao Casimiro said:
I have this piece of html text from Amazon.com

[snip HTML]

And I would like to use only one regexp to extract the title, the
price, and the desconunt in percent.

Don't do that. Use one of the modules designed for parsing HTML.
Using REs to parse HTML is painful and produces easily-broken code.

For extracting the first link and two other parts that are not
identified by help of HTML markup? Please, David, there are more
colours in this world than black and white. ;-)

Yeah, you're right. <insert standard excuses>. Thanks for the reality
check.
 
D

David K. Wall

Jayme Assuncao Casimiro said:
I have this piece of html text from Amazon.com

<dt><b><a
href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-00648
04">1 Business, 2 Approaches : How to Succeed in Internet Business by
Employing Real-World Strategies</a></b>
~ <NOBR><font color=#990033>Usually ships in 2-3 days</font></NOBR><dd>
Ron Gielgun / Hardcover / Published 1998
<br>
Our Price: $13.97 ~ <NOBR><font color =#990033>You Save: $5.98
(30%)</font></NOBR>
<br>
<a
href="/exec/obidos/ASIN/0965761762/qid=917872216/sr=1-1/002-1496444-00648
04"><i>Read more about this title...</i></a>
<p>

And I would like to use only one regexp to extract the title, the price,
and the desconunt in percent.

I still think you should use one of the HTML parsing modules to get the
otherwise unremarkable piece of HTML, but below is one regex that captures
all three things. Ugly and fragile.

my ($price, $title, $discount);
if ($html =~ m{
<dt>\s*
<b>\s*
<a\s+href\s*=\s*"\S+">
([^<]+) # title
</a>\s*
</b>
.*?
Our\s+Price:\s+
(\S+) # price
.*?
You\s+Save:\s+\S+\s+
\((\S+)\) # discount
}xs )
{
($title, $price, $discount) = ($1, $2, $3);
$title =~ s/\s+/ /g;

print "title: $title\n\n";
print "price: $price\n\n";
print "discount: $discount\n";

}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top