python/regex question... hope someone can help

charonzen · Dec 9, 2007

I have a list of strings. These strings are previously selected
bigrams with underscores between them ('and_the', 'nothing_given', and
so on). I need to write a regex that will read another text string
that this list was derived from and replace selections in this text
string with those from my list. So in my text string, '... and the...
' becomes ' ... and_the...'. I can't figure out how to manipulate

re.sub(r'([a-z]*) ([a-z]*)', r'(????)', textstring)

Any suggestions?

Thank you if you can help!

John Machin · Dec 9, 2007

I have a list of strings. These strings are previously selected
bigrams with underscores between them ('and_the', 'nothing_given', and
so on). I need to write a regex that will read another text string
that this list was derived from and replace selections in this text
string with those from my list. So in my text string, '... and the...
' becomes ' ... and_the...'. I can't figure out how to manipulate

re.sub(r'([a-z]*) ([a-z]*)', r'(????)', textstring)

Any suggestions?

The usual suggestion is: Don't bother with regexes when simple string
methods will do the job.
.... for bigram in alist:
.... original = bigram.replace('_', ' ')
.... text = text.replace(original, bigram)
.... return text
........ ['quick_brown', 'lazy_dogs', 'brown_fox'],
.... 'The quick brown fox jumped over the lazy dogs.'
.... )
The quick_brown_fox jumped over the lazy_dogs.

print ch_replace(['red_herring'], 'He prepared herring fillets.') He prepared_herring fillets.

Click to expand...

Click to expand...

Another suggestion is to ensure that the job specification is not
overly simplified. How did you parse the text into "words" in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Cheers,
John

John Machin · Dec 9, 2007

The following *may* come close to doing what your revised spec
requires:

import re
def ch_replace2(alist, text):
for bigram in alist:
pattern = r'\b' + bigram.replace('_', ' ') + r'\b'
text = re.sub(pattern, bigram, text)
return text

Cheers,
John

charonzen · Dec 9, 2007

Another suggestion is to ensure that the job specification is not
overly simplified. How did you parse the text into "words" in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Cheers,
John

Thank you John, that definitely puts things in perspective! I'm very
new to both Python and text parsing, and I often feel that I can't see
the forest for the trees. If you're asking, I'm working on a project
that utilizes Church's mutual information score. I tokenize my text,
split it into a list, derive some unigram and bigram dictionaries, and
then calculate a pmi dictionary based on x,y from the bigrams and
unigrams. The bigrams that pass my threshold then get put into my
list of x_y strings, and you know the rest. By modifying the original
text file, I can view 'x_y', z pairs as x,y and iterate it until I
have some collocations that are worth playing with. So I think that
covers the question the same parsing method. I'm sure there are more
pythonic ways to do it, but I'm on deadline

Thanks again!

Brandon

Gabriel Genellina · Dec 10, 2007

[John Machin] Another suggestion is to ensure that the job
specification is not
overly simplified. How did you parse the text into "words" in the
prior exercise that produced the list of bigrams? Won't you need to
use the same parsing method in the current exercise of tagging the
bigrams with an underscore?

Click to expand...

Thank you John, that definitely puts things in perspective! I'm very
new to both Python and text parsing, and I often feel that I can't see
the forest for the trees. If you're asking, I'm working on a project
that utilizes Church's mutual information score. I tokenize my text,
split it into a list, derive some unigram and bigram dictionaries, and
then calculate a pmi dictionary based on x,y from the bigrams and
unigrams. The bigrams that pass my threshold then get put into my
list of x_y strings, and you know the rest. By modifying the original
text file, I can view 'x_y', z pairs as x,y and iterate it until I
have some collocations that are worth playing with. So I think that
covers the question the same parsing method. I'm sure there are more
pythonic ways to do it, but I'm on deadline

Looks like you should work with the list of tokens, collapsing consecutive
elements, not with the original text. Should be easier, and faster because
you don't regenerate the text and tokenize it again and again.

While loop unclear, can someone help?	4	Dec 6, 2023
Regex replace problem	2	Jan 6, 2022
SQL Connection string regex pattern to parse sections	1	May 9, 2024
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Can anyone help me code a simple python code?	1	Mar 13, 2022
Twitter Bot for Series recommendations help please	1	Oct 2, 2024
Questions about regex	3	May 29, 2009
I need help fixing my website	2	Oct 15, 2023

python/regex question... hope someone can help

charonzen

John Machin

John Machin

charonzen

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads