String pattern matching

J

Jim Lewis

Anyone have experience with string pattern matching?
I need a fast way to match variables to strings. Example:

string - variables
============
abcaaab - xyz
abca - xy
eeabcac - vxw

x matches abc
y matches a
z matches aab
w maches ac
v maches ee
 
F

forward

Firstly sort variable expressions by its length

xy = 'abca'
xyz = 'abcaaab'
vxw = 'eeabcac'

Expand all expressions by its values except itself

xy = 'abca'
'abca' z = 'abcaaab'
vxw = 'eeabcac'

Cut all left and right matches

xy = 'abca'
z = 'aab'
vxw = 'eeabcac'

Repeat until you can.

z = 'aab'
xy = 'abca'
vxw = 'eeabcac'

Get first variable expression intersection
- variables: x is longest intersection of xy and vxw
- value : 'abca' starting at position 2 of vxw (longest intersection
of 'abca' and 'eeabcac')
'a' starting at position 5 of vxw

Then try matching:
x='abca' (at position 2 of vxw)
x='abc' (at position 2 of vxw)
x='ab' (at position 2 of vxw)
x='a' (at position 2 of vxw)
x='a' (at position 5 of vxw)
x='' (at position arbitrary position of vxw)
and calculate the others.

Repeat all step until you can.


In your example, there are 13 results.
x='abca' y='' z='aab' v='ee' w='c'
x='abc' y='a' z='aab' v='ee' w='ac'
x='ab' y='ca' z='aab' v='ee' w='cac'
x='a' y='bca' z='aab' v='ee' w='bcac'
x='a' y='bca' z='aab' v='eeabc' w='c'
x='' y='abca' z='aab' v='' w=''
x='' y='abca' z='aab' v='e' w='eabcac'
x='' y='abca' z='aab' v='ee' w='abcac'
x='' y='abca' z='aab' v='eea' w='bcac'
x='' y='abca' z='aab' v='eeab' w='cac'
x='' y='abca' z='aab' v='eeabc' w='ac'
x='' y='abca' z='aab' v='eeabca' w='c'
x='' y='abca' z='aab' v='eeabcac' w=''

with the same original expressions

xy = 'abca'
xyz = 'abcaaab'
vxw = 'eeabcac'

Note that maximum matching is best matching in some kind of problems.

Cutting off empty values, you can get 4 results:
x='abc' y='a' z='aab' v='ee' w='ac'
x='ab' y='ca' z='aab' v='ee' w='cac'
x='a' y='bca' z='aab' v='ee' w='bcac'
x='a' y='bca' z='aab' v='eeabc' w='c'
 
J

Jim Lewis

Thanks for the interesting and detailed analysis. In my case I don't
need all possible answers by rather the first "greedy" match. Seems
like there might be some recursive approach.
 
E

Eddie Corns

Jim Lewis said:
Anyone have experience with string pattern matching?
I need a fast way to match variables to strings. Example:
string - variables
============
abcaaab - xyz
abca - xy
eeabcac - vxw
x matches abc
y matches a
z matches aab
w maches ac
v maches ee

Off topic I know but I've been learning snobol pattern matching recently so I
thought I'd give this one a bash. Here is my effort:

define('foo(str)')
&fullscan = 1
'/abcaaab/abca/eeabcac/' '/' arb $ x arb $ y arb $ z '/' *x *y '/'
+ arb $ v *x arb $ w '/' *foo(v '/' w '/' x '/' y '/' z) :F(end)
foo output = str :(FRETURN)

end

Good eh? Here's the output:

/eeabcac//abca/aab
e/eabcac//abca/aab
ee/abcac//abca/aab
eea/bcac//abca/aab
eeab/cac//abca/aab
eeabc/ac//abca/aab
eeabca/c//abca/aab
eeabcac///abca/aab
ee/bcac/a/bca/aab
eeabc/c/a/bca/aab
ee/cac/ab/ca/aab
ee/ac/abc/a/aab
ee/c/abca//aab

Removing empty entries is quite easy:

'/abcaaab/abca/eeabcac/' '/' arb $ x arb $ y arb $ z '/' *x *y '/'
+ arb $ v *x arb $ w '/' *differ(v) *differ(w) *differ(x)
+ *differ(y) *differ(z) *foo(v '/' w '/' x '/' y '/' z) :F(end)

generates:

ee/bcac/a/bca/aab
eeabc/c/a/bca/aab
ee/cac/ab/ca/aab
ee/ac/abc/a/aab

If I get time I'll try to get this working in Sam Wilmott's Python pattern
matching library.

What fun!

Eddie
 
K

Kent Johnson

Eddie said:
Off topic I know but I've been learning snobol pattern matching recently so I
thought I'd give this one a bash. Here is my effort:

define('foo(str)')
&fullscan = 1
'/abcaaab/abca/eeabcac/' '/' arb $ x arb $ y arb $ z '/' *x *y '/'
+ arb $ v *x arb $ w '/' *foo(v '/' w '/' x '/' y '/' z) :F(end)
foo output = str :(FRETURN)

end
If I get time I'll try to get this working in Sam Wilmott's Python pattern
matching library.

What fun!

Cool! I have to get in on the fun :)

This program uses Sam Wilmott's library to find one solution. It is a
simple translation of your program above. I couldn't figure out how to
get multiple solutions.
http://www.wilmott.ca/python/patternmatching.html

import string
from patterns_b import *

subject = MatchingInput ('/abcaaab/abca/eeabcac/')
Letters = AnyOfP(string.letters)[1:]

subject ^ IsP('/') & Letters >> 'x' & Letters >> 'y' & Letters >> 'z' &
IsP('/') \
& AnotherP('x') & AnotherP('y') & IsP('/') \
& Letters >> 'v' & AnotherP('x') & Letters >> 'w' & IsP('/')

print 'v =', subject['v']
print 'w =', subject['w']
print 'x =', subject['x']
print 'y =', subject['y']
print 'z =', subject['z']


Output is
v = ee
w = ac
x = abc
y = a
z = aab


Kent
 
K

Kent Johnson

Jim said:
Anyone have experience with string pattern matching?
I need a fast way to match variables to strings. Example:

string - variables
============
abcaaab - xyz
abca - xy
eeabcac - vxw

x matches abc
y matches a
z matches aab
w maches ac
v maches ee

You can do this with a regular expression, actually:

testText = '/abcaaab/abca/eeabcac/'
import re

m =
re.search(r'/(?P<x>.+)(?P<y>.+)(?P<z>.+)/(?P=x)(?P=y)/(?P<v>.+)(?P=x)(?P<w>.+)/',
testText)
print m.group('v', 'w', 'x', 'y', 'z')


Output is:
('ee', 'ac', 'abc', 'a', 'aab')

For variety, change the matches to non-greedy (.+?) and get this result:
('ee', 'bcac', 'a', 'bca', 'aab')


Kent
 
E

Eddie Corns

Kent Johnson said:
Eddie Corns wrote:
Cool! I have to get in on the fun :)
This program uses Sam Wilmott's library to find one solution. It is a
simple translation of your program above. I couldn't figure out how to
get multiple solutions.

Nice!

FWIW,
import string
from patterns_b import *

# investigate captured text and force a failure for backtracking.
# A bit clumsy perhaps! passing a lambda would be more elegant.
class DumpP (Pattern):
def Match (self, subject):
print
print 'v =', subject['v']
print 'w =', subject['w']
print 'x =', subject['x']
print 'y =', subject['y']
print 'z =', subject['z']
if 1 == 2: yield None

subject = MatchingInput ('/abcaaab/abca/eeabcac/')
Letters = AnyOfP(string.letters)[1:]

while True:
subject ^ FenceP() & IsP('/') & Letters >> 'x' & Letters >> 'y' & Letters >> 'z' & IsP('/') \
& AnotherP('x') & AnotherP('y') & IsP('/') \
& Letters >> 'v' & AnotherP('x') & Letters >> 'w' & IsP('/') & DumpP()


Not sure if it's supposed to exit in quite the way it does.

With a more compact notation, I think this beats regexs for many uses
especially when dealing with end users.

Note: if anyone else plays with this interesting pattern library, note there
are a couple of small bugs in it.

Eddie
 
J

jimlewis

Thanks to all for the various good approaches. Kent's plain RE approach
seems the most straightforward - did not know that RE can handle this
situation - good to know!
 
K

Kent Johnson

Thanks to all for the various good approaches. Kent's plain RE approach
seems the most straightforward - did not know that RE can handle this
situation - good to know!

Thanks to Eddie Corns also who showed how to express the problem as a
parsing problem.

I am also trying a pyparsing solution but I don't see a way to repeat a
previous element in pyparsing. Paul McGuire, are you listening?

Kent
 
J

Jim Lewis

You can do this with a regular expression...

I tried the plain RE approach and found it no faster than my
direct-coded version. Anyone have any ideas on how to code this problem
elegantly without RE? My code is long and cumbersome - 200 lines! Speed
is my primary concern but low LOC would be nice of course. The sezman
approach above seems a bit complex.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top