strings and regex's

J

jdm

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

lineArray=Array.new()
stateTable=Hash.new()

File.open("stateTable.txt") { |file|
file.each { |line|
lineArray=line.scan(/[^\s]+/)
stateTable[lineArray[0]]=lineArray[1..-1]
}
}

I've printed this hash out in several different ways with the same results: the
key-value pairs look as expected (no extraneous spaces, newlines, etc.). Once
the hash is set up, it drives a state machine with this code:

1 state="html"
2 while input=gets() # text lines are the s.m.'s "clock"
3 if input.chomp().length>0 # skip blank lines
4 if stateTable.has_key?(state) # is current state defined by a tuple?
# for now all states are defined
5 if input=~Regexp.new(stateTable[state][0]) # change state if match
6 state=stateTable[state][3]
7 elsif # else complain
8 print("\nline #{$NR}: no match on #{stateTable[state][0]}\n")
9 exit
10 end

11 end # if state in stateTable
12 end # if input.chomp()
13 end # while

I have confirmed multiple times and ways that stateTable["html"][0] contains
"<html>" yet the if on line 4 is never successful even though the first
non-blank line in the input is "<html>". I tried doing it manually by inserting
the following between lines 3 and 4:

if input=~/<html>/ ...

and this worked (moving the state machine to the next state which is "title")
but the problem repeated itself all over again in that state too. So I have no
problem pattern matching with regex literals but can't pattern match with
regex's derived from ostensibly identical strings read from a file.

For the conditional on line 5 I have also tried:
Regexp.new(Regexp.escape(stateTable[state][0]))
and
Regexp.new(stateTable[state][0].to_s)
and
Regexp.new(stateTable[state][0]).match(input) # returned nil
to no avail.

For line 5 I initially had:
if input=~stateTable[state][0]
This didn't work either and generated the following warning:
warning: string=~string will be obsolete; use explicit regexp

I'm using version 1.8.1 (2003-12-25) on Windows (i386-mswin32).

The point of this post is not to get better ways to parse html (but feel free to
suggest them anyway :) - the point is to find out why I can't read a string from
a file and then use it (as expected) as a regex in a match operator expression.

I humbly await searing insight and enlightenment from the collective (to which
resistance is futile in any case).
 
D

David A. Black

Hi --

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

lineArray=Array.new()
stateTable=Hash.new()

File.open("stateTable.txt") { |file|
file.each { |line|
lineArray=line.scan(/[^\s]+/)
stateTable[lineArray[0]]=lineArray[1..-1]
}
}

I've printed this hash out in several different ways with the same results: the
key-value pairs look as expected (no extraneous spaces, newlines, etc.). Once
the hash is set up, it drives a state machine with this code:

1 state="html"
2 while input=gets() # text lines are the s.m.'s "clock"
3 if input.chomp().length>0 # skip blank lines

Is it possible that you need chomp! instead of chomp ? I'm not sure
what the regex looks like that you're testing it against later, but if
it doesn't allow a terminal \n then that may be the problem.


David
 
D

David A. Black

Hi --

I'm parsing some html and have a table-driven state machine that configures
itself by reading tuples (one per line) of "present state", "pattern to match",
and "next state" with this code:

I think my chomp theory is probably wrong.

Can you share a line or two of stateTable.txt?


David
 
B

Brian Schröder

I'm parsing some html and have a table-driven state machine that configur= es
itself by reading tuples (one per line) of "present state", "pattern to m= atch",
and "next state" with this code:
=20
lineArray=3DArray.new()
stateTable=3DHash.new()
=20
File.open("stateTable.txt") { |file|
file.each { |line|
lineArray=3Dline.scan(/[^\s]+/)
stateTable[lineArray[0]]=3DlineArray[1..-1]
}
}
=20
I've printed this hash out in several different ways with the same result= s: the
key-value pairs look as expected (no extraneous spaces, newlines, etc.). = Once
the hash is set up, it drives a state machine with this code:
=20
1 state=3D"html"
2 while input=3Dgets() # text lines are the s.m.'s "clo= ck"
3 if input.chomp().length>0 # skip blank lines
4 if stateTable.has_key?(state) # is current state defined by a tu= ple?
# for now all states are defined
5 if input=3D~Regexp.new(stateTable[state][0]) # change state if= match
6 state=3DstateTable[state][3]
7 elsif # else complain
8 print("\nline #{$NR}: no match on #{stateTable[state][0]}\n")
9 exit
10 end
=20
11 end # if state in stateTable
12 end # if input.chomp()
13 end # while
=20
I have confirmed multiple times and ways that stateTable["html"][0] conta= ins
"<html>" yet the if on line 4 is never successful even though the first
non-blank line in the input is "<html>". I tried doing it manually by ins= erting
the following between lines 3 and 4:
=20
if input=3D~/<html>/ ...
=20
and this worked (moving the state machine to the next state which is "tit= le")
but the problem repeated itself all over again in that state too. So I ha= ve no
problem pattern matching with regex literals but can't pattern match with
regex's derived from ostensibly identical strings read from a file.
=20
For the conditional on line 5 I have also tried:
Regexp.new(Regexp.escape(stateTable[state][0]))
and
Regexp.new(stateTable[state][0].to_s)
and
Regexp.new(stateTable[state][0]).match(input) # returned nil
to no avail.
=20
For line 5 I initially had:
if input=3D~stateTable[state][0]
This didn't work either and generated the following warning:
warning: string=3D~string will be obsolete; use explicit regexp
=20
I'm using version 1.8.1 (2003-12-25) on Windows (i386-mswin32).
=20
The point of this post is not to get better ways to parse html (but feel = free to
suggest them anyway :) - the point is to find out why I can't read a stri= ng from
a file and then use it (as expected) as a regex in a match operator expre= ssion.
=20
I humbly await searing insight and enlightenment from the collective (to = which
resistance is futile in any case).
=20
=20
=20
=20

I can't see the error you describe, but I'll just clean up and debug
your code a bit, maybe that will help you see the problem. I'm typing
this directly into the mail, so beware of any spelling bugs I'll
introduce.

There is a bug in line 7. The else part is never reached because you
used an elsif, which will always execute the print line, which returns
nil. Therefor you get an error message but won't enter the error
branch.

Additionally you allow for only one arrow from each state. I don't
think that was your intention. I changed the code to allow for
multiple state changes.

A state machine is a nice thing, but it won't help you very much with
html, because it can't count. So it can't match opening and closing
tags. And if the input you are processing is normal html, you are not
feeding the sm tokens but lines which can include multiple tokens.

It would be more efficient to create the regexp pattern only once and
not on each matching try. I.e.

state_table=3DHash.new() { | h, k | h[k] =3D [] }
StateChange =3D Struct.new:)pattern, :next_state)
=20
File.open("stateTable.txt") do |file|
file.each do |line|
present_state, pattern, next_state =3D *line.scan(/[^\s]+/)
state_table[present_state] <<=20
StateChange.new(Regexp.new(pattern), next_state)
end
end

state=3D"html"
while input =3D gets # text lines are the s.m.'s "clock=
"
input.strip!
next if input.empty? # skip blank lines

raise "State undefined" unless state_table.has_key?(state) =20
=20
next_states =3D state_table[state].select { | state_change |=20
state_change.pattern =3D~ input=20
}

if next_states.length =3D=3D 1
state =3D next_states[0].next_state
elsif next_states.empty?
raise "No match on #{state}: #{input}"
elsif=20
raise "Too many matches on #{state}: #{input}"
end

end # while

hth,

Brian

--=20
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/
 
G

Gavin Kistner

I'm parsing some html and have a table-driven state machine that
configures
itself by reading tuples (one per line) of "present state",
"pattern to match",
and "next state" with this code:

Just an aside - you might be interested in my TagTreeScanner class,
just to look at.
http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html

(I still need to finish and upload the version with the nicer DSL
setup. But at its heart, it's a state machine running regexp against
strings to determine the next state (and build a tree along the way).)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,183
Messages
2,570,966
Members
47,514
Latest member
AdeleGelle

Latest Threads

Top