efficiently splitting up strings based on substrings

P

per

I'm trying to efficiently "split" strings based on what substrings
they are made up of.
i have a set of strings that are comprised of known substrings.
For example, a, b, and c are substrings that are not identical to each
other, e.g.:
a = "0" * 5
b = "1" * 5
c = "2" * 5

Then my_string might be:

my_string = a + b + c

i am looking for an efficient way to solve the following problem.
suppose i have a short
string x that is a substring of my_string. I want to "split" the
string x into blocks based on
what substrings (i.e. a, b, or c) chunks of s fall into.

to illustrate this, suppose x = "00111". Then I can detect where x
starts in my_string
using my_string.find(x). But I don't know how to partition x into
blocks depending
on the substrings. What I want to get out in this case is: "00",
"111". If x were "001111122",
I'd want to get out "00","11111", "22".

is there an easy way to do this? i can't simply split x on a, b, or c
because these might
not be contained in x. I want to avoid doing something inefficient
like looking at all substrings
of my_string etc.

i wouldn't mind using regular expressions for this but i cannot think
of an easy regular
expression for this problem. I looked at the string module in the
library but did not see
anything that seemd related but i might have missed it.

any help on this would be greatly appreciated. thanks.
 
R

Rhodri James

I'm trying to efficiently "split" strings based on what substrings
they are made up of.
i have a set of strings that are comprised of known substrings.
For example, a, b, and c are substrings that are not identical to each
other, e.g.:
a = "0" * 5
b = "1" * 5
c = "2" * 5

Then my_string might be:

my_string = a + b + c

i am looking for an efficient way to solve the following problem.
suppose i have a short
string x that is a substring of my_string. I want to "split" the
string x into blocks based on
what substrings (i.e. a, b, or c) chunks of s fall into.

to illustrate this, suppose x = "00111". Then I can detect where x
starts in my_string
using my_string.find(x). But I don't know how to partition x into
blocks depending
on the substrings. What I want to get out in this case is: "00",
"111". If x were "001111122",
I'd want to get out "00","11111", "22".

is there an easy way to do this? i can't simply split x on a, b, or c
because these might
not be contained in x. I want to avoid doing something inefficient
like looking at all substrings
of my_string etc.

i wouldn't mind using regular expressions for this but i cannot think
of an easy regular
expression for this problem. I looked at the string module in the
library but did not see
anything that seemd related but i might have missed it.

I'm not sure I understand your question exactly. You seem to imply
that the order of the substrings of x is consistent. If that's the
case, this ought to help:
('00', '111', '')

You'll have to filter out the empty groups for yourself, but that's
no great problem.
 
P

per

I'm not sure I understand your question exactly.  You seem to imply
that the order of the substrings of x is consistent.  If that's the
case, this ought to help:


('00', '11111', '22')>>> y = "00111"

('00', '111', '')

You'll have to filter out the empty groups for yourself, but that's
no great problem.

The order of the substrings is consistent but what if it's not 0, 1, 2
but a more complicated string? e.g.

a = 1030405, b = 1babcf, c = fUUIUP

then the substring x might be 4051ba, in which case using a regexp
with (1*) will not work since both a and b substrings begin with the
character 1.

your solution works if that weren't a possibility, so what you wrote
is definitely the kind of solution i am looking for. i am just not
sure how to solve it in the general case where the substrings might be
similar to each other (but not similar enough that you can't tell
where the substring came from).
 
P

per

Right.  This looks approximately nothing like what I thought your
problem was.  Would I be right in thinking that you want to match
substrings of your potential "substrings" against the string x?

I'm sufficiently confused that I think I'd like to see what your
use case actually is before I make more of a fool of myself.

it's exactly the same problem, except there are no constraints on the
strings. so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c. if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.
 
R

Rhodri James

it's exactly the same problem, except there are no constraints on the
strings. so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c. if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.

I see, I think. Let me explain it back to you, just to be sure.

You have a string x, and three component strings a, b and c. x is
a substring of the concatenation of a, b and c (i.e. a+b+c). You
want to find out how x overlaps a, b and c.

Assuming I've understood this right, you're overthinking the problem.
All you need to do is find the start of x in a+b+c, then do some
calculations based on the string lengths and slice appropriately.
I'd scribble some example code, but it's nearly 1am and I'd be sure
to commit fence-post errors at this time of night.
 
7

7stud

Right.  This looks approximately nothing like what I thought your
problem was.  Would I be right in thinking that you want to match
substrings of your potential "substrings" against the string x?
I'm sufficiently confused that I think I'd like to see what your
use case actually is before I make more of a fool of myself.

it's exactly the same problem, except there are no constraints on the
strings.  so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c.  if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.


a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
b_start = abc.find(b) - data_start
c_start = abc.find(c) - data_start

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU
 
7

7stud

it's exactly the same problem, except there are no constraints on the
strings.  so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.
so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP
given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c.  if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.
i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
b_start = abc.find(b) - data_start
c_start = abc.find(c) - data_start

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU

....or maybe this is easier to follow:

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
new_abc = abc[data_start:]
print new_abc
print data
print "-" * 10

--output:--
4051babcffUUIUP
4051babcffU
----------

b_start = new_abc.find(b)
c_start = new_abc.find(c)

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU
 
7

7stud

I'm trying to efficiently "split" strings based on what substrings
they are made up of.
i have a set of strings that are comprised of known substrings..
For example, a, b, and c are substrings that are not identical to each
other, e.g.:
a = "0" * 5
b = "1" * 5
c = "2" * 5
Then my_string might be:
my_string = a + b + c
i am looking for an efficient way to solve the following problem.
suppose i have a short
string x that is a substring of my_string.  I want to "split" the
string x into blocks based on
what substrings (i.e. a, b, or c) chunks of s fall into.
to illustrate this, suppose x = "00111". Then I can detect where x
starts in my_string
using my_string.find(x).  But I don't know how to partition x into
blocks depending
on the substrings.  What I want to get out in this case is: "00",
"111".  If x were "001111122",
I'd want to get out "00","11111", "22".
is there an easy way to do this?  i can't simply split x on a, b, or c
because these might
not be contained in x.  I want to avoid doing something inefficient
like looking at all substrings
of my_string etc.
i wouldn't mind using regular expressions for this but i cannot think
of an easy regular
expression for this problem.  I looked at the string module in the
library but did not see
anything that seemd related but i might have missed it.
I'm not sure I understand your question exactly.  You seem to imply
that the order of the substrings of x is consistent.  If that's the
case, this ought to help:
import re
x = "001111122"
m = re.match(r"(0*)(1*)(2*)", x)
m.groups()
('00', '11111', '22')>>> y = "00111"
m = re.match(r"(0*)(1*)(2*)", y)
m.groups()
('00', '111', '')
You'll have to filter out the empty groups for yourself, but that's
no great problem.
The order of the substrings is consistent but what if it's not 0, 1, 2
but a more complicated string? e.g.
a = 1030405, b = 1babcf, c = fUUIUP
then the substring x might be 4051ba, in which case using a regexp
with (1*) will not work since both a and b substrings begin with the
character 1.
Right.  This looks approximately nothing like what I thought your
problem was.  Would I be right in thinking that you want to match
substrings of your potential "substrings" against the string x?
I'm sufficiently confused that I think I'd like to see what your
use case actually is before I make more of a fool of myself.
--
Rhodri James *-* Wildebeest Herder to the Masses
it's exactly the same problem, except there are no constraints on the
strings.  so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.
so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP
given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c.  if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.
i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.
a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"
data_start = abc.find(data)
b_start = abc.find(b) - data_start
c_start = abc.find(c) - data_start
print data[:b_start]
print data[b_start:c_start]
print data[c_start:]
--output:--
405
1babcf
fU

...or maybe this is easier to follow:

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
new_abc = abc[data_start:]
print new_abc
print data
print "-" * 10

--output:--
4051babcffUUIUP
4051babcffU
----------

b_start = new_abc.find(b)
c_start = new_abc.find(c)

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU

Nope. My solutions have problems with:

data = "cffU"

To handle that data, it gets messier:

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "cffU"

data_start = abc.find(data)
new_abc = abc[data_start:]
print new_abc
print data
print "-" * 10


b_start = new_abc.find(b)
if b_start == -1:
b_start = 0

c_start = new_abc.find(c)
if c_start == -1:
c_start = 0

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]


If data is not a substring of abc, then that last line will select the
whole data string.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top