efficiently splitting up strings based on substrings

per · Sep 5, 2009

I'm trying to efficiently "split" strings based on what substrings
they are made up of.
i have a set of strings that are comprised of known substrings.
For example, a, b, and c are substrings that are not identical to each
other, e.g.:
a = "0" * 5
b = "1" * 5
c = "2" * 5

Then my_string might be:

my_string = a + b + c

i am looking for an efficient way to solve the following problem.
suppose i have a short
string x that is a substring of my_string. I want to "split" the
string x into blocks based on
what substrings (i.e. a, b, or c) chunks of s fall into.

to illustrate this, suppose x = "00111". Then I can detect where x
starts in my_string
using my_string.find(x). But I don't know how to partition x into
blocks depending
on the substrings. What I want to get out in this case is: "00",
"111". If x were "001111122",
I'd want to get out "00","11111", "22".

is there an easy way to do this? i can't simply split x on a, b, or c
because these might
not be contained in x. I want to avoid doing something inefficient
like looking at all substrings
of my_string etc.

i wouldn't mind using regular expressions for this but i cannot think
of an easy regular
expression for this problem. I looked at the string module in the
library but did not see
anything that seemd related but i might have missed it.

any help on this would be greatly appreciated. thanks.

Rhodri James · Sep 5, 2009

I'm trying to efficiently "split" strings based on what substrings
they are made up of.
i have a set of strings that are comprised of known substrings.
For example, a, b, and c are substrings that are not identical to each
other, e.g.:
a = "0" * 5
b = "1" * 5
c = "2" * 5

Then my_string might be:

my_string = a + b + c

i am looking for an efficient way to solve the following problem.
suppose i have a short
string x that is a substring of my_string. I want to "split" the
string x into blocks based on
what substrings (i.e. a, b, or c) chunks of s fall into.

to illustrate this, suppose x = "00111". Then I can detect where x
starts in my_string
using my_string.find(x). But I don't know how to partition x into
blocks depending
on the substrings. What I want to get out in this case is: "00",
"111". If x were "001111122",
I'd want to get out "00","11111", "22".

is there an easy way to do this? i can't simply split x on a, b, or c
because these might
not be contained in x. I want to avoid doing something inefficient
like looking at all substrings
of my_string etc.

i wouldn't mind using regular expressions for this but i cannot think
of an easy regular
expression for this problem. I looked at the string module in the
library but did not see
anything that seemd related but i might have missed it.

I'm not sure I understand your question exactly. You seem to imply
that the order of the substrings of x is consistent. If that's the
case, this ought to help:
('00', '111', '')

You'll have to filter out the empty groups for yourself, but that's
no great problem.

per · Sep 5, 2009

I'm not sure I understand your question exactly. You seem to imply
that the order of the substrings of x is consistent. If that's the
case, this ought to help:

('00', '11111', '22')>>> y = "00111"

('00', '111', '')

You'll have to filter out the empty groups for yourself, but that's
no great problem.

The order of the substrings is consistent but what if it's not 0, 1, 2
but a more complicated string? e.g.

a = 1030405, b = 1babcf, c = fUUIUP

then the substring x might be 4051ba, in which case using a regexp
with (1*) will not work since both a and b substrings begin with the
character 1.

your solution works if that weren't a possibility, so what you wrote
is definitely the kind of solution i am looking for. i am just not
sure how to solve it in the general case where the substrings might be
similar to each other (but not similar enough that you can't tell
where the substring came from).

per · Sep 6, 2009

Right. This looks approximately nothing like what I thought your
problem was. Would I be right in thinking that you want to match
substrings of your potential "substrings" against the string x?

I'm sufficiently confused that I think I'd like to see what your
use case actually is before I make more of a fool of myself.

it's exactly the same problem, except there are no constraints on the
strings. so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c. if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.

Rhodri James · Sep 6, 2009

it's exactly the same problem, except there are no constraints on the
strings. so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c. if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.

I see, I think. Let me explain it back to you, just to be sure.

You have a string x, and three component strings a, b and c. x is
a substring of the concatenation of a, b and c (i.e. a+b+c). You
want to find out how x overlaps a, b and c.

Assuming I've understood this right, you're overthinking the problem.
All you need to do is find the start of x in a+b+c, then do some
calculations based on the string lengths and slice appropriately.
I'd scribble some example code, but it's nearly 1am and I'd be sure
to commit fence-post errors at this time of night.

7stud · Sep 6, 2009

Right. This looks approximately nothing like what I thought your
problem was. Would I be right in thinking that you want to match
substrings of your potential "substrings" against the string x?

Click to expand...

I'm sufficiently confused that I think I'd like to see what your
use case actually is before I make more of a fool of myself.

Click to expand...

it's exactly the same problem, except there are no constraints on the
strings. so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c. if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
b_start = abc.find(b) - data_start
c_start = abc.find(c) - data_start

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU

7stud · Sep 6, 2009

it's exactly the same problem, except there are no constraints on the
strings. so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

Click to expand...

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

Click to expand...

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c. if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

Click to expand...

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.

Click to expand...

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
b_start = abc.find(b) - data_start
c_start = abc.find(c) - data_start

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU

....or maybe this is easier to follow:

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
new_abc = abc[data_start:]
print new_abc
print data
print "-" * 10

--output:--
4051babcffUUIUP
4051babcffU
----------

b_start = new_abc.find(b)
c_start = new_abc.find(c)

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU

7stud · Sep 6, 2009

I'm trying to efficiently "split" strings based on what substrings
they are made up of.
i have a set of strings that are comprised of known substrings..
For example, a, b, and c are substrings that are not identical to each
other, e.g.:
a = "0" * 5
b = "1" * 5
c = "2" * 5
Then my_string might be:
my_string = a + b + c
i am looking for an efficient way to solve the following problem.
suppose i have a short
string x that is a substring of my_string. I want to "split" the
string x into blocks based on
what substrings (i.e. a, b, or c) chunks of s fall into.
to illustrate this, suppose x = "00111". Then I can detect where x
starts in my_string
using my_string.find(x). But I don't know how to partition x into
blocks depending
on the substrings. What I want to get out in this case is: "00",
"111". If x were "001111122",
I'd want to get out "00","11111", "22".
is there an easy way to do this? i can't simply split x on a, b, or c
because these might
not be contained in x. I want to avoid doing something inefficient
like looking at all substrings
of my_string etc.
i wouldn't mind using regular expressions for this but i cannot think
of an easy regular
expression for this problem. I looked at the string module in the
library but did not see
anything that seemd related but i might have missed it.
I'm not sure I understand your question exactly. You seem to imply
that the order of the substrings of x is consistent. If that's the
case, this ought to help:
import re
x = "001111122"
m = re.match(r"(0*)(1*)(2*)", x)
m.groups()
('00', '11111', '22')>>> y = "00111"
m = re.match(r"(0*)(1*)(2*)", y)
m.groups()
('00', '111', '')
You'll have to filter out the empty groups for yourself, but that's
no great problem.
The order of the substrings is consistent but what if it's not 0, 1, 2
but a more complicated string? e.g.
a = 1030405, b = 1babcf, c = fUUIUP
then the substring x might be 4051ba, in which case using a regexp
with (1*) will not work since both a and b substrings begin with the
character 1.
Right. This looks approximately nothing like what I thought your
problem was. Would I be right in thinking that you want to match
substrings of your potential "substrings" against the string x?
I'm sufficiently confused that I think I'd like to see what your
use case actually is before I make more of a fool of myself.
--
Rhodri James *-* Wildebeest Herder to the Masses
it's exactly the same problem, except there are no constraints on the
strings. so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.
so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP
given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c. if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.
i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.

Click to expand...

Click to expand...

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

Click to expand...

data_start = abc.find(data)
b_start = abc.find(b) - data_start
c_start = abc.find(c) - data_start

Click to expand...

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

Click to expand...

--output:--
405
1babcf
fU

Click to expand...

...or maybe this is easier to follow:

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "4051babcffU"

data_start = abc.find(data)
new_abc = abc[data_start:]
print new_abc
print data
print "-" * 10

--output:--
4051babcffUUIUP
4051babcffU
----------

b_start = new_abc.find(b)
c_start = new_abc.find(c)

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

--output:--
405
1babcf
fU

Nope. My solutions have problems with:

data = "cffU"

To handle that data, it gets messier:

a = "1030405"
b = "1babcf"
c = "fUUIUP"
abc = "10304051babcffUUIUP"
data = "cffU"

data_start = abc.find(data)
new_abc = abc[data_start:]
print new_abc
print data
print "-" * 10

b_start = new_abc.find(b)
if b_start == -1:
b_start = 0

c_start = new_abc.find(c)
if c_start == -1:
c_start = 0

print data[:b_start]
print data[b_start:c_start]
print data[c_start:]

If data is not a substring of abc, then that last line will select the
whole data string.

Splitting a string into substrings of equal size	18	Aug 15, 2009
splitting file/content into lines based on regex termination	0	Nov 7, 2013
Padding strings for a clean visual print out...	5	Dec 23, 2023
How to use SQLite (sqlite3) more efficiently	0	Jun 5, 2014
Counting Palindrome Substrings ?	1	Feb 5, 2011
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
extracting substrings based on 'fuzzy' match	2	Nov 21, 2008
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023

efficiently splitting up strings based on substrings

per

Rhodri James

per

per

Rhodri James

7stud

7stud

7stud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads