regexp problem

J

Joao Silva

how i can extract:

<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>

i need this nuber: 123313? I tried to match this in many ways but i stil
have problem with escape characters.
 
M

Mike Cargal

Of course that depends upon how general this needs to be. If it will
always be the first part of the first parameter to a call to
Math.ceil and negated, then:

======================================================================
text = <<EOS
<td>Traffic left:</td><td
align
=
right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</
script>
MB</b></td>
EOS

m = text.match(/Math\.ceil\(\-(\d+)/)
puts m[1] if m
======================================================================


Of course, it seems "suspicious that you don't want to pick up the
minus, and this seems to take a lot of consistency for granted. For a
good answer, you'll need to specify what conditions will always be the
same.
 
J

Joao Silva

m = text.match(/Math\.ceil\(\-(\d+)/)

I cannot use regexp on this - need regexp on whole this prase
(<td>Traffic left:</td>.....), because document is full of strings like
this.
 
M

Mike Cargal

If you're only trying to pull out the single number, this REGEX will
work for the whole phrase you provided.

One of the things you want to do with a REGEX is to avoid any more
detail than is necessary to find what you're looking for. The REGEX
does not need to "match" the whole string.
 
7

7stud --

Mike said:
If you're only trying to pull out the single number, this REGEX will
work for the whole phrase you provided.

The problem is that your regex will also retrieve 9999999 in this html:

<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>

and the op is trying to tell you that he doesn't want that number.

Parsing html with regex's is a bad strategy.
 
W

William James

Joao said:
how i can extract:

<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/100
0)));</script> MB</b></td>

i need this nuber: 123313? I tried to match this in many ways but i
stil have problem with escape characters.


list = DATA.read.scan( %r{<td.*?>\s*(.*?)\s*</td>}im ).flatten

list.each_cons(2){|a,b|
if "Traffic left:" == a and b =~ /Math.ceil\((-?\d+)/
p $1
end
}


__END__

<td>NOT TRAFFIC LEFT:</td><td
align=right><b>
<script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));
</script>
MB</b></td>

<td> Traffic left:
</td><td
align=right><b><script>
document.write(setzeTT(""+Math.ceil(-123313/1000)));
</script>
MB</b></td>
 
R

Rick DeNatale

[Note: parts of this message were removed to make it a legal post.]

list = DATA.read.scan( %r{<td.*?>\s*(.*?)\s*</td>}im ).flatten

list.each_cons(2){|a,b|
if "Traffic left:" == a and b =~ /Math.ceil\((-?\d+)/
p $1
end
}


__END__

<td>NOT TRAFFIC LEFT:</td><td
align=right><b>
<script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));
</script>
MB</b></td>

<td> Traffic left:
</td><td
align=right><b><script>
document.write(setzeTT(""+Math.ceil(-123313/1000)));
</script>
MB</b></td>
As 7Stud pointed out, a toolbox with only regular expressions inside is
often a poor choice for dealing with xml/html

Here's a rather verbose and commented program using a combination of hpricot
and a regular expression to do something like what I think you are looking
for:

require 'rubygems'
require 'hpricot'

def get_traffic_left_numbers(string)
doc = Hpricot(string)
results = []
# iterate over all of the td elements in the document
traffic_lefts = doc.search("td").each do |td1|
# check to see if the td contents is "Traffic left:"
if td1.inner_text == "Traffic left:"
# if yes, get the next sibling
td2 = td1.next_sibling
# and then for each script tag inside
td2.search("script") do | script |
# get the script_tag text
script_text = script.inner_text
# Use a regexp to capture the number
number = /Math\.ceil\(-?(\d+)/.match(script_text)
# add the number we found, if any, to the results array
results << number[1] if number
end
end
end
results
end

p get_traffic_left_numbers("<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>
<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>")

When run this outputs:

["123313"]

In other words it produces an array of strings representing the target
numbers in a script tag within a td tag which follows another td tag whose
inner text is "Traffic left:"

HTH
 
I

Igor Pirnovar

Rick said:
On Tue, Feb 10, 2009 at 3:19 AM, William James wrote:

As 7Stud pointed out, a toolbox with only regular expressions
inside is often a poor choice for dealing with xml/html

Here's a rather verbose and commented program using a
combination of hpricot and a regular expression to do
something like what I think you are looking for:

require 'rubygems'
require 'hpricot'
. . .

When run this outputs: ["123313"]

In other words it produces an array of strings representing
the target numbers in a script tag within a td tag which
follows another td tag whose inner text is "Traffic left:"

Rick, your solution is swell, and it is probably worth while considering
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer's perspective William's
solution is far more appealing, much shorter, easier to understand and
requires virtually no additional learning effort. It nullifies or
"flattens" the comment started out by 7Stud that you also elevated to an
undeserving height.
 
R

Rick DeNatale

[Note: parts of this message were removed to make it a legal post.]

Rick said:
On Tue, Feb 10, 2009 at 3:19 AM, William James wrote:

As 7Stud pointed out, a toolbox with only regular expressions
inside is often a poor choice for dealing with xml/html

Here's a rather verbose and commented program using a
combination of hpricot and a regular expression to do
something like what I think you are looking for:

require 'rubygems'
require 'hpricot'
. . .

When run this outputs: ["123313"]

In other words it produces an array of strings representing
the target numbers in a script tag within a td tag which
follows another td tag whose inner text is "Traffic left:"

Rick, your solution is swell, and it is probably worth while considering
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer's perspective William's
solution is far more appealing,

subjective.


much shorter,


certainly, particularly with my pedagogical comments,

easier to understand and


I'd be quite willing to argue that.
requires virtually no additional learning effort.


Yes, we wouldn't want to expend any unnecessary effort on learning would we.

And by the way to get that to work (in Ruby 1.8) a nuby rubyist would have
to learn that you'd need to include 'enumerable' to get the cons method.

It nullifies or
"flattens" the comment started out by 7Stud that you also elevated to an
undeserving height.

You can treat regular expressions as a Maslovian hammer, but I've had enough
experiences with xml to realize that that hammer is often a very poor tool
for parsing html. I'd rather expend my learning budget in learning how to
apply a tool like Hpricot than to debug my own low-level attempts.

But, as they say, to each his own.
 
I

Igor Pirnovar

Rick said:
certainly, particularly with my pedagogical comments,

and much nicer as well as more elegant, I should add. But more
importantly William's solution is inherently packed with its own
semantics that needs no pedagogue to explain its purpose or meaning!
True, beauty is in the eyes of the beholder, but if you think of all
those engineering accomplishments that defy ageing you will certainly
notice none of them need any pedagogic, aesthetic or any other comments.
Yes, we wouldn't want to expend any unnecessary effort on learning
would we.

No, we most certainly would not, especially when there's absolutely no
need for it! This is why Java is such a drag. There large number of
classes that appear to be relevant to the Java environment itself have
been prolifically growing, to the point that programmers are suffocated
in "alpha.beta.gamma..." notations, never mind the unnecessary clutter
they have to memorize in order to be able to assign semantic value to
each token. You may as well write tons of pedagogic comments for every
line. At the end you do not see the trees because of the forest.
Besides, since when a long learning curve is an appreciable attribute?
... work (in Ruby 1.8) a nuby rubyist would have to learn that
you'd need to include 'enumerable' to get the cons method.

What can I say, any language is a constantly evolving thing but at least
in the case of of Ruby's "enumerable" represents a shift towards better
quality which for the user means less unnecessary overhead and smaller
learning curve. I seriously doubt that now-days any astute Ruby newbie
seeks to learn Ruby 1.8 ignoring Ruby 1.9, I'd much rather say it's just
the opposite, precisely because one would try to avoid learning too much
clutter.
I've had enough experiences with xml to realize that that
hammer is often a very poor tool for parsing html. I'd rather
expend my learning budget in learning how to apply a tool like
Hpricot than to debug my own low-level attempts.

Precisely, if your life revolves around xml and html, Hpricot may be the
better way. However, for an occasional brush with a Markup Language my
old Perl book and core Ruby should do just fine.

Cheers,
igor :)
 
W

William James

Rick said:
And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
have to learn that you'd need to include 'enumerable' to get the cons
method.

I didn't need to, and I'm using

ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32]
 
R

Rick DeNatale

[Note: parts of this message were removed to make it a legal post.]

Rick said:
And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
have to learn that you'd need to include 'enumerable' to get the cons
method.

I didn't need to, and I'm using

ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32]
Yes, I guess I should have said Ruby < 1.8.7

But personally, I don't use or recommend 1.8.7, since it's really neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.
 
P

Pit Capitain

2009/2/11 Rick DeNatale said:
But personally, I don't use or recommend 1.8.7, since it's really neither
fish nor fowl. The backporting of some things from 1.9 feels like it has
caused more problems than it is worth.

Which problems? As I've written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I'm interested why you seem to
have made a different experience.

Regards,
Pit
 
M

Mark Thomas

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//td[contains(.,"Traffic left")]/following-
sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)
 
W

w_a_x_man

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//td[contains(.,"Traffic left")]/following-
sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)

What if the cell contains "No Traffic left"?
 
M

Mark Thomas

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.
* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)
require 'nokogiri'
doc = Nokogiri::HTML(html)
puts doc.xpath('//td[contains(.,"Traffic left")]/following-
sibling::td//script').to_s.scan(/Math.ceil.-(\d*)/)

What if the cell contains "No Traffic left"?

Then you can use the XPath function starts-with() instead of contains
().
 
R

Rick DeNatale

[Note: parts of this message were removed to make it a legal post.]

Which problems? As I've written in ruby-core, all (but one) of my
1.8.6 code works flawlessly with 1.8.7. I'm interested why you seem to
have made a different experience.
I'm not alone. I'll refer you to the thread which Gregory Brown just opened
to discuss the problems caused by having 1.8.7 be incompatible with 1.8.6.
 
P

Pit Capitain

2009/2/11 Rick DeNatale said:
I'm not alone. I'll refer you to the thread which Gregory Brown just opened
to discuss the problems caused by having 1.8.7 be incompatible with 1.8.6.

So can anyone show me some 1.8.6 code that doesn't work in 1.8.7? In
the thread you mention there have been no examples yet.

Regards,
Pit
 
W

w_a_x_man

As long as we're being pedagogical, I prefer XPath to all the previous
posted solutions.

* More accommodating to minor changes in the HTML
* Very short (one-liner) and easy to read (IMHO)

require 'nokogiri'
doc = Nokogiri::HTML(html)

puts doc.xpath('//

// is quite cryptic.

td[contains(.,

..?

"Traffic left")]/following-
sibling::td//script'

script?

).to_s.scan(/Math.ceil.-(\d*)/)

I'd rather use Ruby.
 
M

Mark Thomas

I'd rather use Ruby.

Would you use Ruby string functions instead of the regular expression?
You could, but you probably wouldn't want to. XPath is like regular
expressions for XML and HTML. It has a particular syntax but once you
learn it, it's very powerful.
// is quite cryptic.

It's the wildcard in XPath. So '//td' just means the td can be
anywhere in the tree, as opposed to '/td' which would be at the root.
It's no more cryptic than the .* wildcard in regexps.

td[contains(.,"Traffic Left")]

The square braces constrain the td with an expression that compares
the current td node (that's what the . means) to the string "Traffic
Left". So this phrase says select the <td> tag(s) which contain the
string.

following-sibling::td//script

This says find the <script> tag under the next (in document order)
<td> tag.

XPath isn't hard to learn. And it's well worth the investment.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,183
Messages
2,570,966
Members
47,514
Latest member
AdeleGelle

Latest Threads

Top