Splitting a String with a Regex

stevengarcia · Apr 27, 2006

I have multiple root XML documents in a String that looks like

"<?xml...><response .../><?xml...><response .../><?xml...><response
..../>"

There are three valid XML documents above, unfortunately I have all of
them in one String so (as far as I can tell) XML parsing with dom4j
will not give me three Document objects.

I am trying to write a method that will split the above String into
three separate strings that are all valid XML, and can be parsed by an
XML parser. First I tried String.split()...but there is no good
delimiter. Then I tried writing a regular expression, and I think
regex's will work here, but I'm not proficient at this advanced topic.

The other thing too is the real XML has carriage feeds and other random
characters between each XML document. The XML within each document is
assured to be valid, however.

Is a regex a good way to do this? Your help would be appreciated.

stevengarcia · Apr 27, 2006

I guess I could use a StringTokenizer, and the token would be "<?xml",
and also tell the StringTokenizer to return the delimiter along with
each token.

That should work.

Frank Seidinger · Apr 27, 2006

I have multiple root XML documents in a String that looks like

"<?xml...><response .../><?xml...><response .../><?xml...><response
.../>"

There are three valid XML documents above, unfortunately I have all of
them in one String so (as far as I can tell) XML parsing with dom4j
will not give me three Document objects.

Did you try to parse the first document with either a dom or a sax parser?
All xml parsers are reading from input streams and don't care if the
document is split over several lines or just com in one line.

Therefore creating an input stream from a string using
StringBufferInputStream and feeding this stream to a parser should consume
as many characters as needed to parse the first valid xml document.

Using the same input stream again for the parser should get you the next
document. You can repeat this, until your string is completely consumed.

I am trying to write a method that will split the above String into
three separate strings that are all valid XML, and can be parsed by an
XML parser. First I tried String.split()...but there is no good
delimiter. Then I tried writing a regular expression, and I think
regex's will work here, but I'm not proficient at this advanced topic.

For that, you simply can use the indexOf(String str) method of the string
class itself with indexOf("<?xml") for example you can find the index where
your first document starts.

With indexOf("<?xml", firstIndex) you will find the start of the second
document. The space between firstIndex and secondIndex is the content of
your fist document.

stevengarcia · Apr 27, 2006

I guess I could use a StringTokenizer, and the token would be "<?xml",
and also tell the StringTokenizer to return the delimiter along with
each token.

That should work.

Nope, it doesn't. StringTokenizer uses all of the characters in the
delim as tokens. I want the "<?xml" to be one token, not
individualized.

Maybe it's back to regex.

Danno · Apr 27, 2006

Try:

String s = "<?xml...><response
..../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.

stevengarcia · Apr 27, 2006

Frank said:
Did you try to parse the first document with either a dom or a sax parser?
All xml parsers are reading from input streams and don't care if the
document is split over several lines or just com in one line.

Therefore creating an input stream from a string using
StringBufferInputStream and feeding this stream to a parser should consume
as many characters as needed to parse the first valid xml document.

Using the same input stream again for the parser should get you the next
document. You can repeat this, until your string is completely consumed.

I got excited by this idea, so I tried it. It didn't work, as I got
the following exception

The processing instruction target matching "[xX][mM][lL]" is not
allowed.

and that I think means you can't have more than one <?xml in a
document.

Good suggestion though.

Smilodon · Apr 28, 2006

Would you please try this one?

public class MultiXMLSplit {
private static final String xmlStr =
"<?xml><root>hello1</root><?xml><root>hello2</root><?xml><root>hello3</root>";
public static void main(String[] args) {
int index1 = xmlStr.indexOf("<?xml");
int index2;
while (index1 != -1 && index1 < xmlStr.length() - 1) {
index2 = xmlStr.indexOf("<?xml", index1 + 1);
if (index2 != -1 && index2 < xmlStr.length()) {
System.out.println(xmlStr.substring(index1, index2));
} else break;
index1 = index2;
}
// Deal with the last xml doc
if (index1 != -1 && index1 < xmlStr.length() - 1)
System.out.println(xmlStr.substring(index1));
}
}

Maybe you should add more codes to trim the space chars at the head of each
XML document text. As I known, if an xml document text starts with space
chars, the xml parser will not parse it correctly. You will get error
messages like this:

The processing instruction target matching "[xX][mM][lL]" is not
allowed.

Oliver Wong · May 3, 2006

Danno said:
Try:

String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.

Probably won't work. XML is a context-free language, not a regular
language.

- Oliver

Jussi Piitulainen · May 3, 2006

Oliver said:
Danno said:

Try:

String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.

Click to expand...

Probably won't work. XML is a context-free language, not a
regular language.

It might well work (maybe better with "<[?]xml.*?>" or so) for a
particular kind of input sequence where any <?xml...?> thing only
appears in the beginning of each individual part and nowhere else,
and the ... in any of them doesn't contain >.

Just looping to find each string "<?xml" would then also work.

Oliver Wong · May 3, 2006

Jussi Piitulainen said:
Oliver said:

Danno said:

Try:

String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.

Click to expand...

Probably won't work. XML is a context-free language, not a
regular language.

Click to expand...

It might well work (maybe better with "<[?]xml.*?>" or so) for a
particular kind of input sequence where any <?xml...?> thing only
appears in the beginning of each individual part and nowhere else,
and the ... in any of them doesn't contain >.

Just looping to find each string "<?xml" would then also work.

Oops, I had thought that the regular expression Danno wrote was to get
the content of the strings themselves, rather than the delimiters. So
actually, Danno's code may probably work, as long as the "[.]*" part isn't
greedy, along with the other qualifications you gave.

- Oliver

Jussi Piitulainen · May 4, 2006

Oliver said:
Jussi said:

Oliver said:

Danno wrote: ....
String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>"); ....
Probably won't work. XML is a context-free language, not a
regular language.

Click to expand...

It might well work (maybe better with "<[?]xml.*?>" or so) for a
particular kind of input sequence where any <?xml...?> thing only
appears in the beginning of each individual part and nowhere else,
and the ... in any of them doesn't contain >.

Just looping to find each string "<?xml" would then also work.

Click to expand...

Oops, I had thought that the regular expression Danno wrote was
to get the content of the strings themselves, rather than the
delimiters. So actually, Danno's code may probably work, as long as
the "[.]*" part isn't greedy, along with the other qualifications
you gave.

Yes, the pattern in .split() is just the delimiter.

Greed is one fault. Character class brackets are another: the pattern
"[.]*" matches any number of dots only, while ".*" matches any number
of almost any characters. Both faults are easily fixed.

The method does not return the actual delimiters, so the text that was
matched by ".?" would be lost. If all the other conditions are right,
then "(<[?]xml.*?)((?=<[?]xml)|\\z)" should match exactly the wanted
parts of the document: from "<?xml" up to another "<?xml" or the end
of all input. Let me see. I shorten the tags a bit to keep the line
lengths under control:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Split {
public static void main(String [] _) {
Matcher m = Pattern
.compile("(<[?]x.*?)((?=<[?]x)|\\z)")
.matcher("<?x 1?><r 1/><?x 2?><r 2/><?x 3?><r 3/>");
while (m.find()) {
System.out.println("(" + m.group(1) + ")(" + m.group(2) + ")");
}
}
}

Ok, it appears to work - if all the conditions about the input are
true.

Danno · May 4, 2006

Holy shit, you are LORD of the REGEX. That's awesome, I am swiping
this.

SQL Connection string regex pattern to parse sections	1	May 9, 2024
problem splitting a string	4	Feb 9, 2007
regex help: splitting string gets weird groups	8	Apr 8, 2010
regex capability	9	Apr 4, 2011
Issue with textbox script?	0	Sep 5, 2022
splitting text string	4	Mar 22, 2010
Robust regex	2	Nov 19, 2012
splitting with a regex & keeping a ref?	11	May 1, 2008

Splitting a String with a Regex

stevengarcia

stevengarcia

Frank Seidinger

stevengarcia

Danno

stevengarcia

Smilodon

Oliver Wong

Jussi Piitulainen

Oliver Wong

Jussi Piitulainen

Danno

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads