Needs help with Matching Logic

K

Kishore

I am comparitively a newbie in Perl.
I am working a logic to display the snippets matched results of a
'keyword' from a text file just like google would do in the search
results.

I have the content of the text file in the variable $file_content.
And I have the 'keyword' in $keyword.

I need to get the string like google does when displaying the search
results..
When I match the $keyword in the $file_content, I want to also pull 5
words before and 5 words after so I can show that snippet of the file
where the matching of the keyword occurs.

I searched in the google groups for a few days, but couldn't find
anything to help me.

I really appreciate any help I can get.

Thanks!
Kishore
 
P

Paul Lalli

I am comparitively a newbie in Perl.
I am working a logic to display the snippets matched results of a
'keyword' from a text file just like google would do in the search
results.

I have the content of the text file in the variable $file_content.
And I have the 'keyword' in $keyword.

I need to get the string like google does when displaying the search
results..
When I match the $keyword in the $file_content, I want to also pull 5
words before and 5 words after so I can show that snippet of the file
where the matching of the keyword occurs.

I searched in the google groups for a few days, but couldn't find
anything to help me.

I really appreciate any help I can get.

how about something like:

m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/

Using that, $1 is the series of up to five words before the match, $2 is
the match, and $3 is the series of up to five words after the match.

It'd probably have to be tweaked a bit to get exactly what you want, but
it should at least give you a starting point.

Paul Lalli
 
K

Kishore

Paul Lalli said:
how about something like:

m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/

Using that, $1 is the series of up to five words before the match, $2 is
the match, and $3 is the series of up to five words after the match.

It works really great.

Thank you very much.

What is colon:)) for? I don't believe I saw this in the books I have
been refering to so far.

Thanks!
- Kishore.
 
G

gnari

Kishore said:
It works really great.

What is colon:)) for? I don't believe I saw this in the books I have
been refering to so far.

(?:...)

look up 'Extended Patterns' in
perldoc perlre

gnari
 
I

Ilmari Karonen

m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/

Using that, $1 is the series of up to five words before the match, $2 is
the match, and $3 is the series of up to five words after the match.

Note that if $keyword is supposed to be a plain string rather than a
regex, you'll neet to escape metacharacters in it. An easy way to do
this is:

m/((?:\S+\s+){0,5})(\Q$keyword\E)((?:\s+\S+){0,5})/

Also, this regex can be optimized a bit by noting that the only way $1
can contain less than 5 words is if the match occurs at the very
beginning of the string. Separating that special case, we get:

m/((?:\S+\s+){5}|^\s*(?:\S+\s+){0,4})(\Q$keyword\E)((?:\s+\S+){0,5})/

This is noticeably faster if the first occurrence of $keyword isn't
near the beginning, since it saves the regex engine some needless
backtracking.

Also note that, if you use global matching to extract multiple
snippets from the text, the results can be unexpected if there are
multiple occurrences of $keyword near each other. In particular, if
there are less than 5 words between two occurrences, the second one
will be swallowed in the 5 words matched after the first one.

The easiest way to fix that is to use negative look-ahead:

m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keyword\E)\S+){0,5})/g

Oddly enough, optimizing this regex the same way as before doesn't
seem to help, and seems to tickle a perl bug (probably related to \G
handling?) when used in scalar context.


Oh, and you probably want case-insensitive matching, and should
probably allow punctuation around $keyword, something like:

m/((?:\w+\W+){0,5})(\Q$keyword\E)((?:\W+\w+){0,5})/i

or (optimized):

m/((?:\w+\W+){5}|^\W*(?:\w+\W+){0,4})(\Q$keyword\E)((?:\W+\w+){0,5})/i

or for global matching:

m/((?:\w+\W+){0,5}?)(\Q$keyword\E)((?:\W+(?!\Q$keyword\E)\w+){0,5})/ig
 
B

Brian McCauley

Ilmari Karonen said:
Note that if $keyword is supposed to be a plain string rather than a
regex, you'll neet to escape metacharacters in it. An easy way to do
this is:

m/((?:\S+\s+){0,5})(\Q$keyword\E)((?:\s+\S+){0,5})/
Also note that, if you use global matching to extract multiple
snippets from the text, the results can be unexpected if there are
multiple occurrences of $keyword near each other. In particular, if
there are less than 5 words between two occurrences, the second one
will be swallowed in the 5 words matched after the first one.

The easiest way to fix that is to use negative look-ahead:

m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keyword\E)\S+){0,5})/g

Er, no it would be easier and more ideomatic to put the third capture
inside a lookahead.

m/((?:\S+\s+){0,5}?)(\Q$keyword\E)(?=((?:\s+\S+){0,5}))/g


--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Ben Morrow

Quoth (e-mail address removed) (Kishore):
What is colon:)) for? I don't believe I saw this in the books I have
been refering to so far.

The construction is (?: ... ), to be contrasted with ( ... ); it modifes
the parens so that they just group without capturing. See perldoc
perlre or perldoc perlretut.

[as a side note, I would *always* use /x on a regex with (?:) in, just
because things get lost:

/( (?: \S+\s+ ){0,5} ) ($keyword) ( (?: \s+\S+ ){0,5} )/x

]

Ben
 
I

Ilmari Karonen

Er, no it would be easier and more ideomatic to put the third capture
inside a lookahead.

m/((?:\S+\s+){0,5}?)(\Q$keyword\E)(?=((?:\s+\S+){0,5}))/g

Those two don't do the same thing. With your version the snippets may
overlap, with mine they can't. Deciding which solution is better is
really up to the OP.
 
K

Kishore

Ilmari Karonen said:
Oh, and you probably want case-insensitive matching, and should
probably allow punctuation around $keyword, something like:

m/((?:\w+\W+){0,5})(\Q$keyword\E)((?:\W+\w+){0,5})/i

I was having problems with punctuation.
This code solved the problem.
Thanks very much.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,879
Members
47,415
Latest member
PeggyCramp

Latest Threads

Top