Finding a sentence (more than one word & punctuation (, . ;)) ina string?

Kev Jackson · Jan 11, 2006

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

Erik Veenstra · Jan 11, 2006

given this string

" <td valign=\"top\"> message</td> <td valign=\"top\"> the
message to echo.</td> <td valign=\"top\" align=\"center\">
Yes, unless data is included in a character section within
this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is
included in a character section within this element."]

?

s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}

gegroet,
Erik V. - http://www.erikveen.dds.nl/

Robert Klemme · Jan 11, 2006

Kev said:
given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data
is included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

If you really want sentences, this will work:

s.scan /\w+(?:[\s,]+\w+)*[.;?!]/

Click to expand...

=> ["the message\nto echo.", "Yes, unless data is\nincluded in a character
section within this element."]

s.scan /\w+(?:,?\s+\w+)*[.;?!]/

Click to expand...

=> ["the message\nto echo.", "Yes, unless data is\nincluded in a character
section within this element."]

Kind regards

robert

Mark Woodward · Jan 11, 2006

Hi all,

Erik Veenstra wrote:
....

s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}

gegroet,
Erik V. - http://www.erikveen.dds.nl/

As a newbie I thought I'd have a go at this.
What I was trying to do was take Eriks code above, get the text between
tags into an array and then print it out as:
[message, the message to echo, Yes, unless data is included...]

I can do it by the look of things but if there are any suggestions how
to improve this I'd appreciate it. Ie is the {} the most efficient way
to fill the array? Is there a better way to print it out?

# --------------------------------
foo = " <td valign=\"top\">message</td> <td valign=\"top\">the
message to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless
data is included in a character section within this element.</td> </tr> "

# I want to fill an array so I can display in the format
# [message, the message to echo, Yes, unless...]
a = Array.new

# I think I understand this.
# /\s*<[^<>]*>\s*/ = find all tags
# \s* find 0 or more spaces
# <[^<>]*> find anything between and including <>
# \s* as above
# and reject them (.reject)
# whats left (text between tags) use as x in the block |x|

# x seemed to include empty strings so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

# Trying to find the best way to print this???
# nothing like what I want
# puts "--- print a ---"
print a

# extra space after last item
# puts "\n\n--- print \"[\" a.each{|x| print x + \", \" print \"]\" ---"
print "[ "
a.each{|x| print x + ", "}
print "]"

# close but must know array size
# puts "\n\n print \"[\" + a[0] + \", \" + a[1] + \", \" + a[2] + \"]\""
print "[" + a[0] + ", " + a[1] + ", " + a[2] + "]\n"

# probably the most 'right' output wise
puts "\n\n--- for i in 0...a.length-1 ---"
print "[ "
for i in 0...a.length-1
print a + ", "
end
print a[a.length-1]
print "]"
# --------------------------------

thanks,

Mark

Mark Woodward · Jan 11, 2006

Mark Woodward wrote:
....

# x seemed to include empty strings so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

Hmm, here's the first improvement? Seems I can use a << x to append to
an array:

# x seemed to include ""??? so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}

Xavier Noria · Jan 11, 2006

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the
message to echo.</td> <td valign=\"top\" align=\"center\">Yes,
unless data is included in a character section within this
element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included
in a character section within this element."]

There have been several simple approaches proposed in this thread
that may work for what you want. Just in case, if you needed
something more robust you could have a glance at existing Perl
modules that solve this problem like Lingua::EN::Sentence.

-- fxn

Ross Bamford · Jan 11, 2006

Mark Woodward wrote:
...

# x seemed to include empty strings so only add x to the array if not
""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

Click to expand...

Hmm, here's the first improvement? Seems I can use a << x to append to
an array:

# x seemed to include ""??? so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}

I'm not sure what you're trying to do here, but I think split returns an
array already, operated on by reject in this case, which returns the new
array. So with the Erik's code:

a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
p a
# => ["message", "the message to echo.", ... etc ... ]

I guess an alternative similar to your approach above might be:

b = foo.split(/\s*<[^<>]*>\s*/).inject([]) { |ary,x| if x.empty? then ary
else ary << x end }
p b
# => ["message", "the message to echo.", ... etc ... ]

Note the 'p' method, which prints out using 'inspect'. Alternatively, you
could have done:

puts b.inspect
print "{b.inspect}\n"

and so on. Another nitpick about your example, is that in most Ruby I've
seen people tend to prefer using unless rather than !negating the
condition to if. So where you have:

if x != ""

I'd tend to use:

unless x == ""

or (more likely):

unless x.empty?

Cheers,

Mark Woodward · Jan 11, 2006

Hi Ross,

I'm not sure what you're trying to do here,

makes 2 of us ;-)

but I think split returns

an array already, operated on by reject in this case, which returns the
new array. So with the Erik's code:

a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
p a
# => ["message", "the message to echo.", ... etc ... ]

Exactly what I was trying to do. I thought it had to be an array but
couldn't figure out how to print it like ["","",""] like the OP wanted.
p a - now thats embarrassing! 2 letters and it works. Compare that to my
gibberish :-(. We all have to start somewhere I guess!

I guess an alternative similar to your approach above might be:

b = foo.split(/\s*<[^<>]*>\s*/).inject([]) { |ary,x| if x.empty?
then ary else ary << x end }
p b
# => ["message", "the message to echo.", ... etc ... ]

Note the 'p' method, which prints out using 'inspect'. Alternatively,
you could have done:

puts b.inspect
print "{b.inspect}\n"

steady on! ;-)

and so on. Another nitpick about your example, is that in most Ruby
I've seen people tend to prefer using unless rather than !negating the
condition to if. So where you have:

if x != ""

I'd tend to use:

unless x == ""

or (more likely):

unless x.empty?

Nitpick away! I appreciate it. Its been a good little exercise re p,
puts, print and chaining methods etc. I've been reading the pickaxe
book, but readings not good enough. I need to write some code. If I can
make a fool of myself here but learn something at the same time then
thats great!

Cheers,

thanks,

Ross Bamford · Jan 11, 2006

Exactly what I was trying to do. I thought it had to be an array but
couldn't figure out how to print it like ["","",""] like the OP wanted.
p a - now thats embarrassing! 2 letters and it works. Compare that to my
gibberish :-(. We all have to start somewhere I guess!

Absolutely. My early Ruby was probably some of the least Rubyish Ruby
around

Check out the 'show_array' nonsense here at
http://roscopeco.co.uk/code/noob/basic-syn2.rb - ouch. (I later refactored
it a bit to http://roscopeco.co.uk/code/noob/arrays.html).

Nitpick away! I appreciate it. Its been a good little exercise re p,
puts, print and chaining methods etc. I've been reading the pickaxe
book, but readings not good enough. I need to write some code. If I can
make a fool of myself here but learn something at the same time then
thats great!

Heh, I definitely know what you mean there - I have to do stuff to learn
too. That said, though, I just got my paper pickaxe (finally, this
morning!) and it's much better having something solid to refer to without
having to switch to the browser and all that, so I can at least check I'm
making sense

Cheers,

Gene Tani · Jan 11, 2006

Kev said:
given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there's also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/

Mark Woodward · Jan 11, 2006

Ross said:
Heh, I definitely know what you mean there - I have to do stuff to
learn too. That said, though, I just got my paper pickaxe (finally,
this morning!) and it's much better having something solid to refer to
without having to switch to the browser and all that, so I can at least
check I'm making sense

Yeah, I've been using the PDF version of Pickaxe(vers 2) but will order
the felled trees version I think. Also 'The Ruby Way' version 2 when it
is published. What ever it takes ;-)

thanks again,

Kev Jackson · Jan 12, 2006

Gene said:
Kev Jackson wrote:

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

Click to expand...

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there's also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/

That is indeed what the problem domain is (did the <td> give it away!).

Basically I have a whole lot of html files and I need to re-write them
as xml (sort of docbook-ish, but not quite). I'm using builder
(fantastic bit of kit by the way), but my original files sometimes
contain things like

"<td valign=\"top\">append</td>
<td valign=\"top\">Append to an existing file (or
<a
href=\"http://java.sun.com/j2se/1.4.2/docs/api/java/io/FileWriter.html#FileWriter(java.lang.String,
boolean)\" target=\"_blank\">
open a new file / overwrite an existing file</a>)?
</td>
<td valign=\"top\" align=\"center\">No - default is false.</td>"

And anything I try basically means that I end up with either nothing
extracted or the whole table extracted! My thoughts were to try a
simple conversion and then fix things manually afterwards (ie get 95% of
the conversion done through a script and then apply some elbow grease to
finish off the parts that take too much time to work out)

I'm now off to read about this tokenizer ^^^ and see if it does what I
want - obviously I'd love to have an automated solution (there are 1000+
html docs I need to convert).

I must admit to beginning to loathe HTMLs lack of structural information
- if this was a docbook file I'd have very few problems converting it (I
could choose many options), but html is so limited in its ability to
express what meaning some section has [sigh]

Thanks to all for the suggested regexps - I never intended it to become
a mini Ruby Quiz

Kev

Adam Sanderson · Jan 12, 2006

A quick scan says that you've got legit xml there, why not use REXML?
It's included in the ruby standard libs. Or any of the above html/xml
parsing libraries with xpath to pluck your values out.

REXML Docs:
http://ruby-doc.org/stdlib/

REXML Homepage:
http://www.germane-software.com/software/rexml

.adam

Image shifts to the right when export the page to pdf	4	May 5, 2023
problem with regex, how to conclude more than one character	3	Nov 7, 2008
ASP CDO sending MS Word copied text	5	Oct 14, 2009
If Then within Subroutine that uses Parameters	0	Sep 13, 2007
[SUMMARY] Word Chains (#44)	12	Sep 1, 2005
Compiler Error Message	1	Jun 27, 2007
Possible to check for empty input boxes when names generated dynamically?	3	Sep 12, 2003
Repeater Question	1	Aug 3, 2007

Finding a sentence (more than one word & punctuation (, . ;)) ina string?

Kev Jackson

Erik Veenstra

Robert Klemme

Mark Woodward

Mark Woodward

Xavier Noria

Ross Bamford

Mark Woodward

Ross Bamford

Gene Tani

Mark Woodward

Kev Jackson

Adam Sanderson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads