matching a sentence, greedy up!

Christian Buck · Aug 10, 2003

Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?

?!\s[a-z]))')
mo = re_satz.search(s)
if mo:
print "found:"
sentences = re_satz.findall(s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

thx in advance

Christian

Helmut Jarausch · Aug 11, 2003

Christian said:
Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'??!\s[a-z]))')
mo = re_satz.search(s)
if mo:
print "found:"
sentences = re_satz.findall(s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

First, you don't need to escape any characters within a character group [].

The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
you exclude the '.' . So it matches upto but not including the first dot.
Now, as far as I can see, nothing else fits. So the output is just what
I expected. How do you think you can differentiate between the end of a
sentence and (the first part of) an abbreviation?

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

I write a code to save comment in post on my Facebook forum but it did not work.	0	Aug 30, 2023
Check forms With JavaScript	1	Mar 28, 2023
Collect Excel Data from Website	5	Apr 30, 2022
Php modal form to email	1	Aug 28, 2024
URGENT	1	Jan 31, 2023
Pyparsing: Non-greedy matching?	2	Dec 31, 2004
aligning a set of word substrings to sentence	9	Dec 1, 2005
Unicode: matching a	0	Nov 15, 2007

matching a sentence, greedy up!

Christian Buck

Helmut Jarausch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads