Need a specific sort of string modification. Can someone help?

N

Nick Mellor

Hi Sia,

Find a multi-digit method in this version:

from string import maketrans
from itertools import takewhile

def is_digit(s): return s.isdigit()

class redux:

def __init__(self):
intab = '+-'
outtab = ' '
self.trantab = maketrans(intab, outtab)


def reduce_plusminus(self, s):
list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
for r
in s.translate(self.trantab).split()]
return ''.join(list_form)

def reduce_plusminus_multi_digit(self, s):
spl = s.translate(self.trantab).split()
digits = [list(takewhile(is_digit, r))
for r
in spl]
numbers = [int(''.join(r)) if r else 0
for r
in digits]
skips = [len(dig) + num for dig, num in zip(digits, numbers)]
return ''.join([s[r:] for r, s in zip(skips, spl)])

if __name__ == "__main__":
p = redux()
print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")
print 'multi-digit...'
print p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG")
print p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")


HTH,

Nick
 
N

Nick Mellor

Note that the multi-line version above tolerates missing digits: if the number is missing after the '+/-' it doesn't skip any letters.

Brief explanation of the multi-digit version:

+/- are converted to spaces and used to split the string into sections. Thesplit process effectively swallows the +/- characters.

The complication of multi-digits is that you need to skip the (possibly multiple) digits, which adds another stage to the calculation. In:

+3ACG. -> .

you skip 1 + 3 characters, 1 for the digit, 3 for the following letters as specified by the digit 3. In:

-11ACGACGACGACG. -> G.

You skip 2 + 11 characters, 2 digits in "12" and 11 letters following. And incidentally in:

+ACG. -> ACG.

there's no digit, so you skip 0 digits + 0 letters.

Having split on +/- using .translate() and .split() I use takewhile to separate the zero or more digits from the following letters. If takewhile doesn't find any digits at the start of the sequence, it returns the empty list []. ''.join(list) swallows empty lists so dropwhile and ''.join() cover theno-digit case between them. If a lack of digits is a data error then it would be easy to test for-- just look for an empty list in 'digits'.

I was pleasantly surprised to find that using list comprehensions, zip, join (all highly optimised in Python) and several intermediate lists still works at a fairly decent speed, despite using more stages to handle multi-digits. But it is about 4x slower than the less flexible 1-digit version on my hardware (about 25,000 per second.)

Nick

Hi Sia,



Find a multi-digit method in this version:



from string import maketrans

from itertools import takewhile



def is_digit(s): return s.isdigit()



class redux:



def __init__(self):

intab = '+-'

outtab = ' '

self.trantab = maketrans(intab, outtab)





def reduce_plusminus(self, s):

list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r

for r

in s.translate(self.trantab).split()]

return ''.join(list_form)



def reduce_plusminus_multi_digit(self, s):

spl = s.translate(self.trantab).split()

digits = [list(takewhile(is_digit, r))

for r

in spl]

numbers = [int(''.join(r)) if r else 0

for r

in digits]

skips = [len(dig) + num for dig, num in zip(digits, numbers)]

return ''.join([s[r:] for r, s in zip(skips, spl)])



if __name__ == "__main__":

p = redux()

print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")

print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")

print 'multi-digit...'

print p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG")

print p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")





HTH,



Nick



I have strings such as:











The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:











How can I do that?

Thanks.
 
N

Nick Mellor

Oops!

"You skip 2 + 11 characters, 2 digits in "12" and 11 letters following. And incidentally in: "

should read:

"You skip 2 + 11 characters, 2 digits in "11" and 11 letters following. And incidentally in: "

N
 
N

Nick Mellor

Hi Sia,

Thanks for the problem! I hope you find these examples understandable.

Below, find an inflexible but fairly fast single-digit method and a slower (but still respectable) multi-digit method that copes with entirely absent digits after +/- and multi-digit skips such as 12 or 37 or 186 following letters rather than just 3 or 5 or 9. Not knowing your problem domain I'm notsure if these larger skips are likely or even possible, nor whether it's likely that there will be no digit.

Neither method flags the edge case where there are not enough letters to skip before the next +/-. They just swallow that chunk of the string entirelywithout error. Is that a problem?

Python 2.7 or later.

from string import maketrans
from itertools import takewhile

# function used by takewhile to detect digits
def is_digit(s): return s.isdigit()

class redux:

def __init__(self):
intab = '+-'
outtab = ' '
self.trantab = maketrans(intab, outtab)


# simple-minded, fast, 1-digit algorithm
def reduce_plusminus(self, s):
list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
for r
in s.translate(self.trantab).split()]
return ''.join(list_form)

# multi-digit algorithm
def reduce_plusminus_multi_digit(self, s):
chunks = s.translate(self.trantab).split()
digits = [list(takewhile(is_digit, r))
for r
in chunks]
# zero case ('else 0') is for missing digit(s)
numbers = [int(''.join(r)) if r else 0
for r
in digits]
# how far to skip (number and skipped letters)
skips = [len(dig) + num
for dig, num
in zip(digits, numbers)]
return ''.join([sequence[skipover:]
for skipover, sequence
in zip(skips, chunks)])

if __name__ == "__main__":
p = redux()
print (p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG"))
print (p.reduce_plusminus("tA.-5AG.-2AG,-2ag"))
print ('multi-digit...')
print (p.reduce_plusminus_multi_digit(".+3ACG.+5CAACG.+3ACG.+3ACG"))
print (p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG"))
print (p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+ACG.+3ACG"))

for n in range(100000): p.reduce_plusminus_multi_digit(".+12ACGACGACGACG.+5CAACG.+3ACG.+3ACG")


Note that the multi-line version above tolerates missing digits: if the number is missing after the '+/-' it doesn't skip any letters.

Explanation:

String slicing is good in Python, but list comprehensions are often the most powerful and efficient way to transform data. If you've come from anotherlanguage or you're a newb to programming, it's well worth getting grips with list comprehensions [a for b in lst].

First, +/- are converted to spaces. Splitting the string into a list, breaking at spaces, effectively swallows the +/- characters and leaves us with chunks of string from which to drop the specified number of letters. Having dropped the letters, we convert the transformed list back into a string.

The single-digit version just assumes that the number is always present andalways single-digit. If you allow multi-digits be ready to skip those multiple digits, as well as skipping letters, adding another stage to the calculation. E.g. in:

+3ACG. -> .

you skip 1 + 3 characters, 1 for the digit, 3 for the following letters as specified by the digit 3. In:

-11ACGACGACGACG. -> G.

You skip 2 + 11 characters, 2 characters in "11" and 11 characters following. And incidentally in:

+ACG. -> ACG.

there's no digit, so you skip 0 digits + 0 letters. If a lack of digits is a data error then it would be easy to test for-- just look for an empty list in the 'digits' list.

For each chunk (stuff between +/-) I use takewhile to separate the zero or more initial digits from the following letters. itertools.takewhile is a nice tool for matching stuff until the first time a condition ceases to hold.In this case I'm matching digits up until the first non-digit.

If takewhile doesn't find any digits at the start of the sequence, it returns the empty list []. ''.join(list) swallows empty lists so takewhile and ''.join() do behind-the-scenes leg-work for me in detecting the case of absent digit(s).

For my money speed isn't bad for these two methods (again, I don't know your domain.) On my (ancient) hardware the 1-digit version does over 100,000/sfor your longer string, the multi-digit about 35,000/s.

HTH,

Nick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,142
Messages
2,570,819
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top