Using Groups inside Braces with Regular Expressions

C

Chris

I'm trying to delimit sentences in a block of text by defining the
end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.

I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part. If this isn't valid regex syntax, how else would I match
a block of text that doesn't the delimiter pattern?

Thanks,
Chris
 
M

MRAB

I'm trying to delimit  sentences in a block of text by defining the
end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.

I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part. If this isn't valid regex syntax, how else would I match
a block of text that doesn't the delimiter pattern?
What is the [^(?:[A-Z]|$)] part meant to be doing? Is it meant to be
matching everything up to the end of the sentence?

[...] is a character class, so Python is parsing the character class
as:

[^(?:[A-Z]|$)]
^^^^^^^^^^
 
C

Chris

end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.
I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)
However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part. If this isn't valid regex syntax, how else would I match
a block of text that doesn't the delimiter pattern?

What is the [^(?:[A-Z]|$)] part meant to be doing? Is it meant to be
matching everything up to the end of the sentence?

[...] is a character class, so Python is parsing the character class
as:

[^(?:[A-Z]|$)]
^^^^^^^^^^

It was meant to include everything except the end-of-sentence pattern.
However, I just realized that I can simply replace it with ".*?"
 
J

John Machin

Misleading subject.

[] brackets or "square brackets"
{} braces or "curly brackets"
() parentheses or "round brackets"
I'm trying to delimit sentences in a block of text by defining the
end-of-sentence marker as a period followed by a space followed by an
uppercase letter or end-of-string.

.... which has at least two problems:

(1) You are insisting on at least one space between the period and the
end-of-string (this can be overcome, see later).
(2) Periods are often dropped in after abbreviations and contractions
e.g. "Mr. Geo. Smith". You will get three "sentences" out of that.
I'd imagine the regex for that would look something like:
[^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)

However, Python keeps giving me an "unbalanced parenthesis" error for
the [^] part.

It's nice to know that Python is consistent with its error messages.
If this isn't valid regex syntax,

If? It definitely isn't valid syntax. The brackets should delimit a
character class. You are trying to cram a somewhat complicated
expression into a character class, or you should be using parentheses.
However it's a bit hard to determine what you really meant that part
of the pattern to achieve.
how else would I match
a block of text that doesn't the delimiter pattern?

Start from the top down:
A sentence is:
anything (with some qualifications)
followed by (but not including):
a period
followed by
either
1 or more whitespaces then a capital letter
or
0 or more whitespaces then end-of-string

So something like this might do the trick:
sep = re.compile(r'\.(?:\s+(?=[A-Z])|\s*(?=\Z))')
sep.split('Hello. Mr. Chris X\nis here.\nIP addr 1.2.3.4. ')
['Hello', 'Mr', 'Chris X\nis here', 'IP addr 1.2.3.4', '']
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top