James said:
Torsten Bronger wrote:
I need some help with finding matches in a string that has some
characters which are marked as escaped (in a separate list of
indices). Escaped means that they must not be part of any match.
[...]
You should probably provide examples of what you are trying to do
or you will likely get a lot of irrelevant answers.
Example string: u"Hollo", escaped positions: [4]. Thus, the second
"o" is escaped and must not be found be the regexp searches.
Instead of re.search, I call the function guarded_search(pattern,
text, offset) which takes care of escaped caracters. Thus, while
will find the second "o",
guarded_search("o$", string, 0)
Huh? Did you mean 4 instead of zero?
Quite apart from the confusing use of "escape", your requirements are
still as clear as mud. Try writing up docs for your "guarded_search"
function. Supply test cases showing what you expect to match and what
you don't expect to match. Is "offset" the offset in the text? If so,
don't you really want a set of "forbidden" offsets, not just one?
But how to program "guarded_search"?
Actually, it is about changing the semantics of the regexp syntax:
"." doesn't mean anymore "any character except newline" but "any
character except newline and characters marked as escaped".
Make up your mind whether you are "escaping" characters [likely to be
interpreted by many people as position-independent] or "escaping"
positions within the text.
And so
on, for all syntax elements of regular expressions. Escaped
characters must spoil any match, however, the regexp machine should
continue to search for other matches.
Whatever your exact requirement, it would seem unlikely to be so
wildly popularly demanded as to warrant inclusion in the "regexp
machine". You would have to write your own wrapper, something like the
following totally-untested example of one possible implementation of
one possible guess at what you mean:
import re
def guarded_search(pattern, text, forbidden_offsets, overlap=False):
regex = re.compile(pattern)
pos = 0
while True:
m = regex.search(text, pos)
if not m:
return
start, end = m.span()
for bad_pos in forbidden_offsets:
if start <= bad_pos < end:
break
else:
yield m
if overlap:
pos = start + 1
else:
pos = end
8<-------
HTH,
John- Hide quoted text -
- Show quoted text -
Here are two pyparsing-based routines, guardedSearch and
guardedSearchByColumn. The first uses a pyparsing parse action to
reject matches at a given string location, and returns a list of
tuples containing the string location and matched text. The second
uses an enhanced version of guardedSearch that uses the pyparsing
built-ins col and lineno to filter matches by column instead of by raw
string location, and returns a list of tuples of line and column of
the match location, and the matching text. (Note that string
locations are zero-based, while line and column numbers are 1-based.)
-- Paul
from pyparsing import Regex,ParseException,col,lineno
def guardedSearch(pattern, text, forbidden_offsets):
def offsetValidator(strng,locn,tokens):
if locn in forbidden_offsets:
raise ParseException, "can't match at offset %d" % locn
regex = Regex(pattern).setParseAction(offsetValidator)
return [ (tokStart,toks[0]) for toks,tokStart,tokEnd in
regex.scanString(text) ]
print guardedSearch(u"o", u"Hollo how are you", [4,])
def guardedSearchByColumn(pattern, text, forbidden_columns):
def offsetValidator(strng,locn,tokens):
if col(locn,strng) in forbidden_columns:
raise ParseException, "can't match at offset %d" % locn
regex = Regex(pattern).setParseAction(offsetValidator)
return [ (lineno(tokStart,text),col(tokStart,text),toks[0])
for toks,tokStart,tokEnd in regex.scanString(text) ]
text = """\
alksjdflasjf;sa
a;sljflsjlaj
;asjflasfja;sf
aslfj;asfj;dsf
aslf;lajdf;ajsf
aslfj;afsj;sd
"""
print guardedSearchByColumn(";", text, [1,6,11,])
Prints:
[(1, 'o'), (7, 'o'), (15, 'o')]
[(1, 13, ';'), (2, 2, ';'), (3, 12, ';'), (5, 5, ';')]